A neural network is a computational model inspired by neurons, and the neuronal circuits observed in biological systems.  The history behind neural networks is long and storied and could be its own blog post (and of course it is already its own blog post), so we won’t get into that here.  Instead, let’s just cover what a artificial neuron is, and how we put them together to make a network.

An artificial neuron is just a function.  The function has input (x) and produces as output some value (y).  Typically, the input to the function is actually a vector of numbers (x1 … xm).  We multiply each element of the input x by some weight wi, and add them all up.  

The output of this function is a continuous value, and can be any number between negative and positive infinity.  Because biological neurons actually produce binary output (they either fire or they don’t fire) we can pass the summation to some other function to create a binary output.  Let’s redefine f by incorporating a new function sigma (σ).

If sigma is a threshold function that returns zero if the input is below 0, and 1 otherwise, this model is the perceptron, the simplest kind of neural network.  We can use many functions for sigma, another possible choice is the logistic function:

The logistic function has some technical drawbacks, so it’s less often used in practice, but it has the nice property that, like the threshold function of the perceptron, its input can be any number between negative and positive infinity, and its output is a number between 0 and 1.


Now, why would we want the output of our network to be a number between 0 and 1?  Well, aside from the fact that that’s how biological neurons function, often we interpret the output of networks in order to make what is called a classification.  For example, we might want to know if a picture is of a cat.   

In this (highly simplified) scenario, the input to our network could be the individual pixels of an image.  If our network is working properly, the output of the function f will be a value close to 1 if the input is of a cat, and close to zero otherwise.  


We can visualize our single neuron h like so

In the above picture, circles (nodes) represent numbers and the lines (edges) represent where those numbers are sent.  The x’s have no input edges because they come directly from our data.  The edges represent the flow of data through the network.  Each edge is associated with a weight, and we typically learn a different weight for every edge of the network.

Above, h1 is connected to x1…x3 because those values are input to the function f that determines h1.  As we link together many neurons, the value h (computed via f) can become the input to another neuron.  Thus, we think of the output h as a “hidden” quantity – it is neither an input nor an output of the overall system.  


The power of neural networks really comes from chaining multiple levels of these neurons together.  Here’s a neural network with two hidden layers, each having 3 hidden values.


Here, every purple circle is computed with our function f, and the edges arriving at the left side of the circle define the input to that function.  Each function f can be parameterized with a different set of weights w.  The edges leaving from the right side of the circle containing an h represent that h becoming the input to another function.


The hidden values h1, h2 and h3 all take x as input.  The second layer of hidden values take the hidden values from the first layer as input.  The final output, y-hat, takes the second layer of hidden values as input.  We then compare the predicted y-hat (e.g. the prediction of whether the image is of a cat) to the true y (e.g. the truth of whether the image is of a cat) to tell if our network has learned the task we aimed to teach it (e.g. the network is able “see” cats).


It turns out that chaining neurons together like this gives the network the power to approximate any smooth bounded function.  The catch is that we have to find the right settings for each of the weights (w) for every edge of the network.


So, how do we go about finding the right w parameters for our network?. We need to be able to tune the weights w to make good predictions for a given input.   Essentially we will pass a new example (x) for which we have the answer (y) to our network, and measure the difference between what we want to predict (y) and what we actually predicted (y-hat).  


Tuning the weights of a neural network is done using Stochastic Gradient Descent (SGD).  SGD is a very general framework for updating the parameters of a model in response to the mistakes it makes on training data (i.e. the difference between y and y-hat).  SGD has the ability to assign “blame” for a mistake to certain parameters in the network, and update parameters in a way that minimizes the chance the same mistake will be made again.  This is how a neural network learns from data.  Given enough data, a neural network is able to learn a parameterization that picks up on the patterns in the data (so long as reliable patterns exist).


Recent breakthroughs in training these networks has allowed for networks that are both very deep (multiple hidden layers) and very wide (each layer having many hidden values).  Training these large networks is referred to as Deep Learning. The ability to train these large networks has led to breakthroughs on multiple fronts, from computer vision, to beating humans at board games and poker, to language translation and self driving cars.


However, some are skeptical about deep learning.  For many machine learning methods there are strong theoretical justification for their formulation, and we have expectations for the model’s behavior under certain circumstances.  Not so for deep learning.  In fact there’s not a whole lot known about what these networks are learning, or how to interpret the weights of a trained network.  This is changing, and some theory has begun to emerge, but deep learning and neural networks remain fairly mysterious.  More reading here, and here and a fun visualization here.