**Ask an MLer: Classification and Hypothesis Testing**

Often scientists want to use machine learning as an analytical tool to answer a question about a data set. However, transitioning from classification accuracy to a hypothesis test can be tricky. Today we answer a question from a neuroscientist on the subject.

**How can I tell if my classifier is performing significantly above chance? Can I use a t test?**

The short answer is: with a permutation test, and no, you can’t use a t test.

**Why won’t a t test work here?**

Let’s start by thinking about what flavor of t-test we might be tempted to perform. Given that you have a distribution of classification accuracies (either from cross-validation or a held-out data set), and you understand that there are number of classes in your data set, it seems natural to perform a one-sample t test, as follows:

Here, is your average classification accuracy, is the standard deviation of your accuracies, is the number of samples, and , which is chance performance.

This test is asking if your average classification accuracy is equal to . If is chance, this seems pretty sensible at first glance. However, there are a few important assumptions being made here:

- It should be possible for the distribution of mean accuracies to be symmetric about
- is the correct representation of chance performance.
- The distribution of should be Gaussian. This is only true if the individual samples that went into the computation of are independent.

It should be obvious that the first assumption is violated by the nature of classification accuracies. Seeing below chance performance on data is extremely rare (although not impossible), therefore even if your classifier is not performing well you could still pass this test.

This leads into the second assumption. The truth is that is only “chance” with infinite data, which you definitely do not have. Often if you actually estimate chance (as below) it is not exactly . The estimate of chance you get is a distribution whose mean AND variance matter.

The last assumption is tricky. Are your estimations of classification accuracy independent? Not if you’ve cross-validated. The different folds depend on each other and are correlated. Not only does this make the computation of problematic for the above formula, but the standard deviation estimate is also affected. Furthermore, you can’t possibly have accuracies below 0 or above 1, so there is absolutely no way could be drawn from a true Gaussian.

So, what’s a scientist to do? If you can’t use the standard statistical toolbox to evaluate the performance of your classifier, you can use a **permutation test**.

**What is a permutation test? How do I do it? **

Let’s think about what we actually want to test: what is the probability that we found a connection between training data and labels by chance alone? For example, if we are trying to predict the word a person is reading from brain images, could it be possible that we are getting above chance accuracy just by luck? More than that, how *probable* is it?

A permutation test creates the scenario where there is no connection between the training data and the labels, and simulates the accuracy we would observe by chance. We do this by permuting the order of the labels, thereby assigning “incorrect” labels to each of the training instances. We then run the same machine learning pipeline that we ran on the original data, but use the permuted labels. We do this many times (say, 100 times) and record the accuracy each time. These accuracies create a null distribution for the regime where any relation between the training data and the labels is entirely coincidental.

If we would like to be able to say that our true performance is above chance with , we look at where the true accuracy falls relative to the distribution of permuted accuracies. If it is larger than 95% of those accuracies, than we can assign it a p value of 0.05.

So permutation tests are pretty easy! You do whatever you did to measure the true training accuracy, just on the permuted labels. The biggest downfall of this method is the computational load required to run hundreds of tests. Especially if you have multiple subjects and many time windows/ROIs to test (as in MEG or fMRI) these tests can take hours or even days to run. We’ve effectively traded analytical energy for computational energy.

Remember that if you are testing across time or ROIs, you still need to correct for multiple comparisons! We’ll talk about that more in a future post.

**Further Reading:**

- Treating cross-validation folds as independent
- Circularity and optimistic accuracy
- Larger folds may be better
- What is chance performance?

**Do you have a burning Machine Learning question? Ask us and we’ll answer it in a post!**

## 3 comments

Satpreet says:

Jul 22, 2017

Thank you for the interesting writeup! Is there a size of dataset (and thus, holdout set) above which this (i.e. hypothesis testing) is not a concern?

On a different note, is there ever a scenario in which the features (i.e. rows of individual features/columns of the design matrix) should be shuffled instead? (i.e. decorrelate the input features)

Alona and Nicole says:

Jul 24, 2017

Thanks for reading! I’m not sure I understand your first question. This post describes the situation in which you want to prove that there is signal in the data and that the classifier performs better than it would be chance. No matter how much data you have, you would eventually have to demonstrate that the classifier is better via a hypothesis test.

To respond to your second question: if your input features are correlated (e.g. they are timepoints) that will definitely affect classification performance and you may need to think carefully about what kind of classifier (and regularization technique) to use. I’m having trouble coming up with a scenario in which decorrelating them via shuffling could provide a good baseline. For example, let’s say you want to shuffle timepoints to see if that destroys some important correlation you found. That’s actually not a stringent test, because time has a lot of structure in it that could result in spurious correlations, and you don’t get those if you completely shuffle it. I think most people do a linear shift.

Satpreet says:

Jul 22, 2017

Also, podcast recommendation: http://unsupervisedthinkingpodcast.blogspot.co.uk/p/podcast-episodes.html