The resurgence of artificial intelligence (AI) is largely due to advances in pattern-recognition due to deep learning, a form of machine learning that does not require explicit hard-coding. The architecture of deep neural networks is somewhat inspired by the biological brain and neuroscience. Like the biological brain, the inner workings of exactly why deep networks work are largely unexplained, and there is no single unifying theory. Recently researchers at the Massachusetts Institute of Technology (MIT) revealed new insights about how deep learning networks work to help further demystify the black box of AI machine learning.
The MIT research trio of Tomaso Poggio, Andrzej Banburski, and Quianli Liao at the Center for Brains, Minds, and Machines developed a new theory as to why deep networks work and published their study published on June 9, 2020 in PNAS (Proceedings of the National Academy of Sciences of the United States of America).
The researchers focused their study on the approximation by deep networks of certain classes of multivariate functions that avoid the curse of dimensionality—phenomena in which there is an exponential dependence on the number of parameters for accuracy on the dimension. Frequently in applied machine learning, the data is highly dimensional. Examples of high dimensional data include facial recognition, customer purchase history, patient healthcare records, and financial market analysis.
The depth in deep networks refers to the number of computational layers–the more computational network layers, the deeper the network. To formulate their theory, the team examined deep learning’s approximation power, dynamics of optimization, and out-of-sample performance.
In the study, the researchers compared deep and shallow networks in which both used identical sets of procedures such as pooling, convolution, linear combinations, a fixed nonlinear function of one variable, and dot products. Why do deep networks have great approximation powers, and tend to achieve better results than shallow networks given they are both universal approximators?
The scientists observed that with convolutional deep neural networks with hierarchical locality, this exponential cost vanishes and becomes more linear again. Then they demonstrated that dimensionality can be avoided for deep networks of the convolutional type for certain types of compositional functions. The implications are that for problems with hierarchical locality, such as image classification, deep networks are exponentially more powerful than shallow networks.
“In approximation theory, both shallow and deep networks are known to approximate any continuous functions at an exponential cost,” the researchers wrote. “However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality.”
The team then set out to explain why deep networks, which tend to be over-parameterized, perform well on out-of-sample data. The researchers demonstrated that for classification problems, given a standard deep network, trained with gradient descent algorithms, it is the direction in the parameter space that matters, rather than the norms or the size of the weights.
“In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalize networks,” the coauthors wrote. “The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem.”
The implications are that the dynamics of gradient descent on deep networks are equivalent to those with explicit constraints on both the norm and size of the parameters–the gradient descent converges to the max-margin solution. The team discovered a similarity known to linear models in which vector machines converge to the pseudoinverse solution which aims to minimize the number of solutions.
In effect, the team posit that the act of training deep networks serves to provide implicit regularization and norm control. The scientists attribute the ability for deep networks to generalize, sans explicit capacity controls of a regularization term or constraint on the norm of the weights, to the mathematical computation that shows the unit vector (computed from the solution of gradient descent) remains the same, whether or not the constraint is enforced during gradient descent. In other words, deep networks select minimum norm solutions, hence the gradient flow of deep networks with an exponential-type loss locally minimizes the expected error.
“We think that our results are especially interesting because they represent a potential explanation for one of the greatest puzzles that has emerged from the field of deep learning, that is, the unreasonable effectiveness of convolutional deep networks in a number of sensory problems,” wrote the researchers.
Through the interdisciplinary combination of applied mathematics, statistics, engineering, cognitive science, and computer science, MIT researchers developed a theory on why deep learning works that may enable the development of novel machine learning techniques and accelerate artificial intelligence breakthroughs in the future.
Copyright © 2020 Cami Rosso All rights reserved.