What Deep Learning Shares With Little Kids
Deep learning offers a new interpretation of child development.
Posted October 4, 2018
I was standing around at the Turing Institute in London recently and overheard someone say, "It's starting to smell a lot like algorithms in here." That was definitely true. The Turing Institute is fragrant with algorithms.
Part of the freshness of that smell is that deep neural networks, a kind of computing system inspired by the workings of the brain, are changing the way we think about learning. Some of their tricks are not that different from what psychologists have been describing for years. But that's actually good because the world through the eyes of the algorithm is just familiar enough that old ideas start to make sense in new ways.
Take the "gavagai problem," which has been around for years among philosophers and developmental psychologists. Suppose you're traveling abroad and come upon a group of people who speak a language you've never heard of. One points to a cornfield and yells, "semomo!"
You look. There's corn. There's a road. There's some livestock. There's a tractor. What the heck does 'semomo' mean?
Now suppose a little later someone offers you a stick with some fried meat on the end. As they hand it to you they say, "semomo."
You're starting to get the picture now. Semomo perhaps means some kind of animal, maybe some of the livestock you saw in the field. Maybe the sheep.
Child developmental researchers like this problem because it characterizes the kind of problem children must have when they start to learn language. They call the kind of learning needed to solve this problem cross-situational learning. The idea is that by hearing the same word in different contexts, one eventually establishes what the word refers to. If someone says semomo and there's no sheep around, you should start to get suspicious.
Cross-situational learning is also a kind of statistical learning. All that is required to learn in an environment like this is to keep track of the statistics between things in the environment and, in this case, the words used to describe them. The brain basically solves an accounting problem. No innate language learning mechanisms are required.
Deep learning algorithms appear to learn information in much the same way. A prominent theory that is seeing some pickup lately is called the information bottleneck theory. (Naftali Tishby is one of the most vocal proponents of the theory and he describes it well in his YouTube video.)
The basic idea is that if you are trying to create a mapping between two objects, like the word semomo and sheep, then what an optimal algorithm needs is a way to determine what is relevant about all the situations that contain sheep. Relevant in this case means they still predict the word semomo. Though the algorithm doesn't know it at first, through a process of filtering out the unwanted information, it eventually figures out that semomo means sheep, not field, corn, or blue sky.
This is an advance on information theory as initially developed by Claude Shannon. Shannon did not include anything about semantics or coherence in his formulation of information. His main contribution was the reduction of information to 0s and 1s and the mathematical formulation for figuring out how much information a message has. This forms the basis of modern computing but it doesn't exactly solve the gavagai problem.
Tishby, along with his collaborators Noga Zaslavsky and Ravid Shwartz-Ziv, describe what deep neural networks do as a process of fitting followed by compression. During the fitting phase, the network learns to label training data (such as data from a series of images). During compression the network attempts to label new data and uses this to improve its performance. (Their advance was to mathematically derive the optimal limit for compression in a deep neural network and then to experimentally verify that this is exactly what such networks do.)
Children do this as well. Children are excellent at learning that a word like 'horse' is the right word for the horse picture in their animal book. Then they proceed to use this word to label all four-legged animals, dogs, cats, cows, and so on. This is called overgeneralization. Over time though, the children learn that 'horse' has a more specific meaning. This sounds a lot like Tishby's compression phase.
So learning in deep neural networks shares at least a few things with the way children learn. It probably shares a lot of things, and it's probably not just children it shares them with. Adults often overgeneralize as they learn what a new concept means. They learn some words like 'cognitive dissonance,' and they start to see it everywhere, whether it's there or not. Overgeneralizing bad theories is precisely what good scientists try to do. As Feynman put it, "Science is the belief in the ignorance of experts." That quote feels a little dangerous right now, but suffice it to say that scientists are experts at correcting their errors. And they do this by purposefully making them. The strength of deep neural networks is that they seem to be able to learn from their mistakes a lot faster than humans can.
Shwartz-Ziv, R., & Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.