Neural networks have the rather uncanny knack for turning meaning into numbers. Data flows from the input to the output, getting pushed through a series of transformations which process the data into increasingly abstruse vectors of representations. These numbers, the activations of the network, carry useful information from one layer of the network to the next, and are believed to represent the data at different layers of abstraction. But the vectors themselves have thus far defied interpretation.
In this blog post I put forward a possible interpretation of these vectors. I argue we shouldn't take these vectors literally, but rather as an encoding for a simpler, sparse data structure. This gives rise to a simple technique (the -SVD) for reverse engineering this data structure, and gives us the tools to decode the vectors' meaning.
Applying this trick to a Variational Autoencoder, trained on a dataset of faces, produces a decomposition of this face into its bare components
and applying it to the image captioning system NeuralTalk2 yields a breakdown of a sentence into the sum of a few simple sentence fragments. This image gets captioned as
Let's restrict our attention to a common pattern in neural network design. A little old school, perhaps, but still elegant, this pattern combines two powerhouses of deep learning, an encoder and a decoder, via composition to produce our net,
The Encoder is the lens in which we see the data. It sees the raw data, and through a sequence of linear transformations and activations, distills all of its salient information into an incomprehensible array of numbers (one typically smaller than the input size). The decoder takes this vector, and through another poorly understood process, interprets it, returning the desired output. This formula is simple, but effective. A good number of papers in deep learning are a result of finding such fruitful combinations.
Lamb et al.
Kingma et al.
|Image Captioning Vinyals et al. NeuralTalk2||Image Synthesis|
Reed et al.
Sutskever et al.
Vinyals et al.
The defining characteristic of these architectures is the information bottleneck. This bottleneck is the output of and the input of and is illustrated by the solid black rectangle in the above diagrams. At least in principle, the information bottleneck forces a compression of the input data by mapping it into a lower dimension, so that only its meaningful dimensions of variation are captured - and the rest, 'noise', is discarded.
This vector has been called, by various people, an "embedding", a "representational vector" or a "latent vector". But Geoff Hinton, in a stroke of marketing genius, gave it the name "thought vector".
Thought vectors have been observed empirically to possess some curious properties. The most fascinating among these is known colloquially as "Linear Structure" - the observation that certain directions in thought-space can be given semantic meaning.
White observes the following property, for example. Take an autoencoder trained on faces. First we identify (by hand) a few (say ) images containing a certain qualitative feature - a smile, for example. Encode and average them to get a vector, .
This output, which White calls the smile vector, can be interpreted as a code for an abstract smile. We can visualize this vector by looking at its output on the decoder.
The background of this image is grey - a clue about its indifference to background color. And though the face is vaguely feminine, it is also generic, apart from an exaggerated, toothy grin. We can add to any thought vector makes the decoder's output smile.
These vectors are used to great effect for image processing in Upchurch et al.
There is a very straightforward line of thought which follows from the observation of linear structure. Taking the dogma that directions are meanings to its logical conclusion, it's a small leap to conjecture the whole thought vector is nothing more than sum of these directions. Let's refer to these directions as atoms. If were a picture of a man in short hair wearing sunglasses, for example, a decomposition might look like
If we had access to the full complement of atoms (now numbered) in , we can write
where is small. In this interpretation of a thought vector, the 's themselves have no meaning - they just need to, collectively, satisfy a few simple mathematical properties. In essence, they just need to be "different enough". Different enough that its presence in the sum can be reliably determined.
The fact that detecting these atoms is even possible may seem counterintuitive. Adding numbers together is typically an irreversible operation. Say if we add , there's no way to recover from ( works just as well). The story is different, however, in higher dimensions. If the 's were orthogonal, we can check for 's presence by probing the encoding on the left by taking the dot product with .
This allows for the storage of atoms in a vector of size . This is rather obvious, but bear with me. We can in fact store far more atoms than if is sparse. By the magic of sparse recovery, we can store as many as atoms in a humble vector of length nondestructively. needs to satisfy a few technical properties, of course, but what is critical is that (the number of nonzero entries in ) is small. The sparser is, the more atoms we can handle. Think of this as a tradeoff between storing information in and storing information in the nonzero locations of .
This is more than an information theoretic trick, however - it's also a natural way of representing information. Think of this as a tagging system, where the nonzero in represents a tag. Every data point can be tagged with at most tags from a list of tags (e.g sunglasses, facial hair, blonde, etc), which can be large. Each tag has its own rating, but it is the tags themselves, not the individual ratings, which contain the salient information. This is basis of sparse coding.
Sparse structure has already been found in some shallow embedding system such as word2vec by Arora et. al. And if such sparse structure exists in deep networks, we can derive a simple 2 step process to interpret thought vectors.
The second step in this process can be achieved heuristically with the -SVD. For these experiments, I used pyksvd.
Once we have , we can decompose any output of (even those not in the training set) via sparse coding
My first experiment will be on a variational autoencoder designed by Lamb et al., trained on the Celeb-A dataset of faces. The thought vector is a element vector, and the neural net does a noble job of capturing most of the dimensions of variation in the images.
Using a dictionary of total atoms, with each element a sum of sparse atoms, I can produce a pretty respectable reconstruction of most of the faces. The atoms are signed, and the positive atoms look like this and the negative atoms look like this. The reconstructions can be visualized incrementally, by finding the best reconstruction for increasing values of .
(Move your mouse over the icons to see how the truncated reconstruction looks)
Notice how the reconstructor works much like an artist does. First it paints a fairly generic face in broad strokes. Then it fills in small local details which make the face recognizable. Here is another example
On some of the more generic faces, the reconstruction is pretty decent even at . This allows us to visualize each atom's exact contribution to the overall picture.
True to our hypothesis, each vector in the dictionary has meaning. Like the smile vector, adding these vectors to existing images produces interesting effects. The usual suspects are all present, corresponding to
Facial Hair and Accessories
and Facial Expressions.
But there are also a few interesting surprises. Like the following atoms for
or a whole slew of atoms dedicated to modeling the Forehead and Hair.
And we can run this entire thing in reverse. This provides a way of querying our training set, to find, for example, all faces which are lit strongly from the front
people wearing sunglasses
and fans of heavy metal music.
We move on from interpolating images to interpolating sentences. Here we investigate the image captioning system NeuralTalk2. This network is a hybrid of a convolutional net (the VGGNet) and a recurrent net, fine tuned to the captioning task on the COCO dataset. This network as thought vector of size . The network takes a picture, turns it into a thought vector, and turns that vector into a sentence using a recurrent neural net. The captions look like this:
|a woman is holding a cat in a kitchen||a couple of people walking down a street holding umbrellas||a group of four different types of computer equipment||a man in a suit and tie standing in a room|
which range in quality from the technically correct to pretty good. A conspicuous mistake the system often makes is the omission of certain critical, but somewhat out of place elements of a picture.
The captioning system does marvelously on (1) and (2), picking up on subtle cues that the woman is holding a dog, and is in a kitchen. But where is the knife in (3), and the lego figurines in (4)? Our method of analysis provides a means to debug such problems. Surprisingly, using only a sparsity of and a dictionary size of was enough.
Let us visualize these thought vectors. Our language model does not generate a single sentence, but a probability distribution over possible sentences. So we visualize its output in the form of a dialog tree. Each path from the root to a leaf represents a sentence, with its probability the product of the thicknesses of the respective edges. I generated these figures by sampling from this distribution a few times, and combining the data in a trie.
Image (1) produces the following output.
I've taken the liberty of giving them my own labels to facilitate navigation. The output of the algorithm is an excellent synthesis of the concepts of "dog", "woman at counter", "woman in front of cake", and surprisingly, the verb/noun combo "holding a cat". We can visualize each atom by looking at examples of images for which thought vector contains these elements.
Holding a cat
Girl/Woman and Cake
Woman at Counter
Image (2) consists of two meaningful atoms (I have omitted two atoms with small weights which were misclassifications). The first corresponds to "a black and white picture", and the other to "man with an umbrella".
And true to form, we can look at all the other pictures which contain these atoms
Black and White Photo of
People with Umbrellas
Image (3) produces the following output.
The keyboard is detected, of course. But surprisingly, so are the presence of the lego figures as "skateboarders". Failing to combine these two elements, the language model simply chose to omit what wasn't convenient to articulate.
Image (4) demonstrates the same problem.
The atoms are visualized here:
This "failure to synthesize" is surprisingly common. Taking linear combinations of two unrelated sentences, for example, would often result in the outputs interpolating in a discontinuous manner. Usually one or the other would dominate the output depending on their relative weights. This "bug" can be interpreted as a feature, however. The language models resistance to combing two semantically alien ideas seems to be an emergence of a kind of "common sense" - but this sensibility comes at the expense of a more free-form creativity.
Here is another example of this behavior. Take this atom for statue. We can combine it with other atoms
But the language model itself dictates how the sentences are merged. You might want the language model to spit out "a statue of a man riding a snowboard". But the language model actually returns "a statue of a man riding a horse". In a way, the model can be forgiven for this - the model has never seen a statue of a man on a snowboard, and so is reluctant to caption it so. The model has an internal idea of what statues are.
The dictionary elements are not restricted to nouns. It permits certain modifiers too. This element stands for "a group of".
and here a similar atom seems to be able to count up to 4.
Rather curiously, it turns "airplanes" into "knives". I do not understand why this happens.
A final note. Unlike the previous model, the dictionary elements produced by -SVD are largely unsigned. Though some negative vectors do bleed into the 's they are generally small in magnitude and don't seem to have meaningful interpretations. This seems to be a consequence of the thought vectors being taken after a activation - forcing the entire vector to be positive. Since never sees a negative vector, the entirety of is "dead space", and the atoms can only interact constructively.
The final question that should be asked is why this structure should even exist in the first place. How does this structure emerge from training? And how does the decoder work?
Identifying sparse elements in a thought vector may not be as difficult a task as it initially seems. Given the right conditions on it can be done quite efficiently by solving the convex sparse coding problem:
This is pretty encouraging. It has been hypothesized by Gregor et al. that the decoder might be implementing an unfolded sparse coding algorithm, at least for a single iteration. Perhaps this theory can be confirmed by correlating various constellations of activations to the atoms of our dictionary. And perhaps there's a possibility we can read right off the decoder.
The former riddle is more difficult to answer. And it breaks down into a bevy of minor mysteries when probed. Is this structure specific to certain neural architectures (perhaps those which use activations)? Or does it come from the data? Was this structure discovered automatically, or were the assumptions of sparsity hidden in the network structure? Does sparse structure exist in all levels of representation, or only encoder/decoder networks? Is sparse coding even the true model for the data, or is this just an approximation to how the data is really represented? But lacking any formal theory of deep learning, these questions are still open to investigation. I hope to have convinced you, at least, that this is an avenue worth investigating.
Thanks http://distill.pub for the inspiration in design and a certain yellow icon.