Revolutionary AI Algorithm Speeds Up Deep Learning on CPUs

Rice University and Intel's SLIDE runs fast on CPUs, no GPUs required.

Posted Mar 04, 2020

PublicDomainPictures/Pixabay
Source: PublicDomainPictures/Pixabay

This week at the MLSys Conference in Austin, Texas, researchers from Rice University in collaboration with Intel Corporation announced a breakthrough deep learning algorithm called SLIDE (sub-linear deep learning engine) that can rapidly train deep neural networks on CPUs (central processing units) and outperform GPUs (graphical processing units).

This new deep learning technique is a potential game-changer for not only hardware and AI software industries, but also for any organization using deep learning. To understand why requires a bit of background knowledge of the role of GPUs in artificial intelligence (AI). GPUs have more logical cores than CPUs. The CPU is the brain of the computer where calculations are performed. CPUs are better for processing single, complex computations sequentially such as parsing through or interpreting code logic, whereas GPUs are better for processing simpler computations in parallel.

GPUs are well-suited for training deep neural networks that use backpropagation, which stands for the backward propagation of errors. In the late 1980s, Geoffrey Hinton and his research colleagues popularized the concept of using backpropagation through networks of neuron-like units. In backpropagation, the weights of connections in the network are adjusted in a way that minimizes the difference between the actual output vector of the net and the desired output vector. It works by calculating the gradient of the error function with respect to the weights given an error function and artificial neural network. The calculation of the gradient is done backward through the layers of the artificial neural network. Backpropagation works its calculations by starting with the final layer, going through the layers of the neural network, and ending with the first layer. It uses matrix multiplication which GPU processing handles well.

The current global surge in artificial intelligence (AI) across industries and sectors is mainly due to improved pattern-recognition capabilities of deep learning, a subset of AI machine learning. Deep neural networks have contributed to progress in computer voice, speech, and image recognition. Another major contributing factor in the AI boom is the availability of big data to train deep learning algorithms. But processing large amounts of big data can be time-consuming and costly. With the rise of computer gaming came the massively parallel processing power of GPU technology. It is largely owing to the confluence of deep learning, big data, and GPUs that the AI winter has thawed.

The research paper submitted to the MLSys Conference by the researchers from Rice University and Intel Corporation describes an alternative to backpropagation, opening the door for faster, cheaper AI deep learning using CPUs instead of more expensive GPUs.

The researchers deployed the strategy of converting neural network training into a search problem that could be solved using a hash table—a data structure used to store keys to values. Instead of matrix multiplication, they used locality-sensitive hashing (LSH)—a method that hashes similar input items into the same buckets with high probability. Rather than use PyTorch or TensorFlow, the researchers wrote their algorithm using C++ code, an object-oriented programming language often used for embedded firmware, client-server applications, system and application software, and drivers.

What is innovative about SLIDE is that it is data parallel—two data instances can be processed independently in parallel. For example, if training SLIDE on the pictures of a pedestrian and a stop-sign, the two data instances would likely excite different nodes in the artificial neural network, and the algorithm can process the different data independently. SLIDE uses batch gradient descent with Adam optimizer, where each data instance in the batch runs separately in threads and the gradients are processed in parallel. Each neuron stores two additional arrays to track the input specific neuron activations and error gradients.

The researchers reported that training with SLIDE on a 44 core CPU was over 3.5 times faster than using Tensorflow on Tesla V100 at any given accuracy level. “Using just a CPU, SLIDE drastically reduces the computations during both training and inference outperforming an optimized implementation of Tensorflow (TF) on the best available GPU,” wrote the researchers. “We provide the first evidence that a smart algorithm with modest CPU OpenMP parallelism can outperform the best available hardware NVIDIA-V100, for training large deep learning architectures.”

Copyright © 2020 Cami Rosso All rights reserved.