As we gear up for another semester, once again the “Introduction to Machine Learning” course offered by CMU is filled to capacity, with a long waitlist to boot.  It has become overwhelmingly one of the most popular courses for graduate students at CMU – several departments even list it as a required course.

Why so popular? The wide-ranging applicability of machine learning to many kinds of research. Like its cousin, statistics, machine learning provides tools for anyone interested in analyzing data. It has achieved buzzword status, and many people just want to make sure they’re not missing out.

Despite its popularity, the course has yet to stabilize in terms of content. Each professor that gets saddled with it has to decide what set of topics to teach, as well as how deep to go into each topic.  This is a challenge not to be underestimated, because the audience for the course is not homogeneous.

Broadly, the students who sign up for machine learning fall into two groups:

  1. Those who want to use machine learning effectively, but are primarily interested in other fields of study.
  2. Those who want to advance machine learning itself, who work in the field or in related fields (e.g. robotics or theory).

Group 2 is somewhat easier to work with, because you can assume that they have a strong mathematical background, programming skills, and at the very least a love/hate relationship with proofs. The ML department has actually created a course just for this group, which ameliorates the problem somewhat.

However, even a classroom made up entirely of Group 1 poses a challenge. This group is very heterogeneous in terms of math background, as well as tolerance for/interest in theoretical proofs.  This is the group that drops out of the intro to ML course, leaves strongly negative course reviews, and complains that machine learning is “just too hard.”

It is very tempting to say “good riddance” and write these people off as posers. But what is the point of inventing machine learning algorithms if nobody understands them well enough to effectively use them?

The field of statistics has been down this road before. While there are many who are interested in statistics itself, most scientists view it as a data analysis tool. Thus, we have courses with titles like “Statistics for Psychologists” and so on. Statistics is the language by which scientists communicate their results to each other and to the rest of the world. Everyone should learn how to use it properly.

It is a matter of debate as to whether statisticians succeeded in teaching scientists the language of statistics. Anyone who follows the reproducibility/p-hacking problem can tell you that misunderstanding is pervasive and evident both in papers and in the community’s response to the problem.

How can we prevent a similar misuse/misinterpretation of machine learning? There is some evidence that it is already prevalent in the field of neuroscience, as MVPA becomes a go-to tool for neuroimaging analysis.

One potential solution is to create a course with a title like “Machine Learning: a User’s Guide.” The undecided question here is what to include in such a course. Here are my thoughts on the subject:

First, what is data? How can we characterize data for use in machine learning? Here we introduce the vocabulary that machine learners use (samples and features), and review some basic statistical concepts. This sounds like a no-brainer, but let me assure you from personal experience that getting everyone on the same page here will save you time in the long run.

Then you need to establish understanding of some basic algorithms. Think Naive Bayes, Logistic Regression, SVMs, and KNN. These algorithms will serve as useful examples to illustrate correct usage, while also being go-to choices for those working in a small-data regime (e.g. neuroscience). So, what does someone need to know about an algorithm? Here’s a (likely incomplete) list:

  • Requirements about the input data: continuous features? Discrete features? Data dimensionality?
  • The assumptions it makes about the data. Specifically, the data conditions under which it will and won’t work well.
  • The learned model
  • How it is trained

This motivates the question: how does one evaluate the performance of an algorithm? Here you can begin talking about cross-validation, hold out sets, and circularity of analysis/overfitting. What are the pros and cons of different data evaluation techniques?

Now you can get into feature selection. What do you do when you have more features than data samples? How can you optimize an algorithm’s performance? What is the appropriate way to do that without introducing circularity? You can then zoom out and talk about bias/variance in general: how, when you have a more complex model you need more data to fit that model.

This would conclude the basic user’s guide to machine learning. Some advanced topics that you could add to the end (if you have time) would include:

  • Optimization: how do we construct objective functions and optimize them? [for a more mathematically inclined group]
  • More algorithms: neural networks, graphical models
  • Establishing statistical significance: permutation tests, Wilcoxon signed-rank test
  • Data visualization: unsupervised methods, dimensionality reduction
  • Interpreting machine learning results: what do the weight values of a linear classifier mean?

The goal of a course like this would be to equip students with the tools to apply machine learning to a data set properly, and use it to advance scientific understanding of that data. To that end, the most effective assignments would be a combination of hands-on programming tasks and open-ended, thought-experiment-type questions. There is not much value in having students derive gradient updates for known algorithms by hand – no one who simply uses ML algorithms  does that. Ideally, students should have some programming knowledge, but advanced knowledge of calculus and linear algebra are not necessary.

Machine learning encompasses a wide range of useful data analysis tools. If we want those tools to be used to advance science, we need to target scientists as users. Current machine learning courses focus mainly on algorithms and proofs, touching on bias/variance and cross-validation in later lectures. Circularity in analysis will continue to be prevalent so long as these topics continue to take a back seat.

This is not advocating for a removal of or dumbing-down of the other kind of introduction to machine learning. The goal here is to promote an accessible course so that machine learning can be universally (and correctly) adopted by scientists. I’m sure that my biases influenced this course plan – if you disagree about something, do share in the comments 🙂

If you’re interested in learning more about machine learning, we have some resources and podcasts for you to check out. Full disclosure: we already knew the basics when we found these – let us know how they work out for you.