In our previous post we talked about how the brain might optimize cost functions. Now we’ll explore how cost functions may be generated, represented, and change over time in the brain.

Marblestone et al outline several ways that cost functions could be generated.  In particular, they talk about specialized circuitry for comparing the predicted output to the desired output of a system, which is necessary for supervised or auto-encoder type unsupervised learning.

For unsupervised learning, genetically determined biological priors could produce architectures designed to facilitate a successful representation of the world (e.g. object permanence, object immutability).  The brain also has access to multiple types of input (e.g. visual, auditory, tactile) which may allow it to learn about the world more efficiently than a vision-only system (see this review paper for examples of multi-modal semantic learning).

Reinforcement learning was inspired by decades [centuries?] of work on behavior and reinforcement in psychology.  Marblestone et al discuss how RL might be further assisted by genetic priors encoded in neural architecture as well as how social constructs might further learning (e.g. through imitation).

This section (and the paper more generally) is very broad, covering the concept of cost functions from the neuronal level to the more abstract psychological level. We’re going to dig into the arguments at the bottom and at the top.

 

This video of Geoff Hinton’s explanation for why CNNs are the wrong way to do computer vision made us interested in the idea of cortical capsules.  These capsules are meant to add additional structure to neural networks.  Capsules are groups of neurons that can act as feature detectors.  They produce outputs that are more informative than single neurons with single scalar outputs.

The paper that explains Hinton’s “capsule” concept can be summed up in a direct quote:

Instead of aiming for viewpoint invariance [a system that can recognize objects regardless of their orientation] in the activities of “neurons” that use a single scalar output to summarize the activities of a local pool of replicated feature detectors, artificial neural networks should use local “capsules” that perform some quite complicated internal computations on their inputs and then encapsulate the results of these computations into a small vector of highly informative outputs.  

This paper shows some interesting qualitative results for capsule learning; it can generate new images to represent an object from a different viewpoint (see below).

Left: An input stereo image of an object, middle: the generated viewpoint-transformed image, right: the target viewpoint transformed image. From Hinton, G. E., Krizhevsky, A., & Wang, S. D. (2011).

A few papers have taken inspiration from the idea of “capsules” including  Goroshin et al., which explains capsules that have a “representation has a locally stable “what” component and a locally linear, or equivariant “where” component”.  That is, given two frames of a video that are temporally adjacent, a particular capsule will produce outputs that are also adjacent in the learned latent space.  This could be useful for tracking objects as they move through space, for example.

The idea of capsules and their ability to track in a viewpoint-invariant way was inspired by the properties of biological visual systems.  The hope is that a more biologically-plausible neural system might be able to learn from videos as fast as a biological visual system can.

 

Let’s move one layer up: how do humans learn more complex concepts from sensory input?

A central attribute of human learning is that we chain simple concepts together to help understand complex concepts. But how far back does the chain go? Are our initial, simplest concepts the result of statistical learning as infants, or are some things genetically determined?

As Marblestone et al. discuss, it is quite difficult to learn some of the complex, socially important concepts that we use regularly based on statistical learning alone. Furthermore, we know from neuroimaging results that some regions of the brain are specialized to perform certain tasks (e.g. the FFA, although scholars disagree on what “specialized” means there).

How would we replicate the effect of a genetically predetermined specialization in a cost function? The natural answer would be through strong priors, or even through an explicit restriction of the cost function space. Some example priors could be sparsity (in the sense that the output of an object recognition system should be sparse) and slowness in time (in the sense that objects generally don’t change very quickly in time).

Ullman et al 2012 attempts to formalize biological priors  by creating an algorithm that detects hands by relying upon the predetermined and simple “proto-concept” of a “mover event.” That is, they strongly bias the model to believe that “movers” are likely to be “hands.”  In their work, a “mover” is defined as a moving object that changes stationary objects. Here is a depiction of a “mover event” taken from the paper:

What they find is that this biasing does in fact improve the data efficiency, and even the generalizability of their approach.  In particular, they see a performance boost when they combine two different object detection algorithms (appearance-based and context-based).

Overall I think the idea has merit: it’s one way to get at the data efficiency gap we talked about in our last post. That said, the evidence put forth by Ullman et al. is preliminary. I’d like to see this approach tried for a wider variety of tasks, and for other “proto-concepts”.

If we want to have machines learn like humans, we need a more thorough understanding of how humans learn. Not just on a neural level, but on a more abstract, psychological level. The approach of hard-coding bias in a model worked for Ullman et al, but only because they had a reasonable idea of what bias to code. Note that a “mover event” is a considerably more complex prior than simply slowness in time.

How can we integrate these biases (perhaps assumptions is a better term) into our models? We could stick to a strictly connectionist and try to enforce priors on a network, or we could give symbolic approaches another chance.  For all you nerds who are unfamiliar with the topic, here’s a classic paper on symbolic vs connectionist approaches, from before deep learning became feasible as an approach to learning problems.

Just as connectionism faded from the limelight back in the 90’s, symbolic approaches are definitely considered unfashionable today. But there is something very data efficient in taking a symbolic approach, and it’s possible that the brain does this by genetically predetermining certain functions of neural circuits.  This reminds us of the central pattern generators that are involved in much of locomotion.  They mechanize some of the simple sets of actions that need to be taken to achieve some goal (e.g. walking).  It will be very interesting to see if symbolic AI makes a comeback as we push the limits of deep nets.