Artificial Intelligence
World’s First in AI: IBM Research's 4-Bit Machine Learning
Novel training for deep neural networks may reduce AI's heavy carbon footprint.
Posted December 7, 2020 Reviewed by Ekua Hagan

The artificial intelligence (AI) renaissance is largely due to advances in deep learning, a type of machine learning with architectural elements inspired by the biological brain. However, unlike the energy-efficient human brain, the process of training large scale deep neural networks is enormously energy-intensive, requiring colossal amounts of computing memory and power. In a world’s first, IBM Research reveals at this week’s NeurIPS conference an unprecedented 4-bit AI training system that may help reduce machine learning’s heavy carbon footprint.
“Training AI models has become extremely expensive and generates a massive carbon footprint. IBM Research over the last five years has introduced a number of key techniques to address these challenges and dramatically improve how we train neural network models,” said Kailash Gopalakrishnan, IBM Fellow and Senior Manager, Accelerator Architectures and Machine Learning, IBM Research.
Traditional deep learning has a heavy carbon footprint, as training requires extensive computational processing of big data sets. For example, a University of Massachusetts Amherst study from last year found that training a large AI model with neural architecture can generate 284 metric tons of carbon dioxide—an amount roughly equivalent to the lifetime emissions of five average U.S. cars.
“We're excited to unveil at this year's Conference on Neural Information Processing Systems (NeurIPS) the world's first 4-bit training system,” said Gopalakrishnan. “This novel technique is significant in three ways. First, it boosts the efficiency of the best training systems available commercially today by more than seven times, cutting energy and costs. Second, these gains bring training closer to the edge, a major advancement for privacy and security of AI models. And third, it offers a technological upgrade for hybrid cloud infrastructure which is becoming increasingly critical to companies in their transitions to hybrid cloud environments.”
A deep neural network consists of many computational layers with nodes that are like artificial neurons. Activation functions, also known as transfer functions, determine whether a node is activated based on the weighted sum calculation and bias, and injects non-linearity in the node’s output and the network. Training a deep neural network is memory intensive because the activations from the forward pass must be retained in order to determine the error gradients in the backward pass.
Backpropagation, which stands for the backward propagation of errors, is often used in supervised learning. It is an algorithm that enables the updating of individual weights in the network to minimize the loss function—the difference between the neural network’s predictions and the data labels. In machine learning, gradient descent is an iterative optimization that yields the best results.
Reducing the number of bits lowers computational costs. A bit is a binary digit that represents a value of either “1” or “0” which corresponds to the concepts of “on” or “off,” “true” or “false,” and “yes” or “no.” Eight bits constitute a byte, the basic unit of memory and processing. Machine learning systems often use 32-bit “single-precision” or 64-bit “double precision” floating-point (FP) numbers—numbers without a fixed number of digits before and after the decimal point. Fortunately, many deep learning projects do not require ultra-high precision in order to serve its purpose and perform with a high degree of accuracy. Increasingly, 16-bit “half precision” floating-point numbers are being used for deep learning given that lower precision leads to lower computing costs.
Within a span of half a decade, IBM Research has continued to break barriers towards enabling energy-efficient AI deep learning. In 2015 IBM researchers successfully reduced deep learning training systems to 16-bit wide fixed-point number representation when using stochastic rounding without compromising accuracy.
Gopalakrishnan, along with other IBM researchers Naigang Wang, Jungwook Choi, Daniel Brand, and Chia-Yu Chen, achieved an AI milestone by introducing the first training system based on 8-bit floating-point numbers at NeurIPS 2018. Last year, IBM researchers Jungwook Choi and Swagath Venkataramani presented state-of-the-art 2-bit inference results at SysML 2019.
In another breakthrough achievement, Gopalakrishnan, Sun, Wang, and Chen along with Jia-min Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataraman, Kaoutar El Maghraoui, and Vijayalakshmi Srinivasan at the IBM T. J. Watson Research Center in Yorktown Heights, New York unveil high-performance 4-bit training that is also more eco-friendly. The researchers wrote in their NeurIPS 2020 paper, “To the best of our knowledge, no studies so far have demonstrated deep learning model convergence using 4-bits for all tensors (weights, activations, and gradients).”
The researchers created end-to-end 4-bit deep neural network training by using an innovative combination of a novel radix-4 FP4 format to enable the representation of gradients with a wide dynamic range, a per-layer trainable gradient scaling called GradScale that aligns gradients to the FP4 range, and a two-phase quantization process to minimize FP4 gradient errors—both mean squared and expected errors.
“While AI continues fueling historic transformations across countless industries, IBM Research hopes this latest breakthrough will accelerate even more innovation through leaner and more efficient training of large-scale systems,” said Gopalakrishnan.
Copyright © 2020 Cami Rosso All rights reserved.