Researchers Make Low Energy LLM Breakthrough

UC Santa Cruz researchers found that they could power a billion-parameter-scale language model on just 13 watts, about equal to the energy of powering a lightbulb. Illustration by Molly Fine.

Revolutionizing Energy Efficiency in Large Language Models: A Breakthrough from UC Santa Cruz

Researchers at UC Santa Cruz have achieved a groundbreaking feat in the realm of large language models. Typically, running advanced models like ChatGPT 3.5 comes with immense energy and financial costs—approximately $700,000 per day in energy expenses, leading to a significant carbon footprint. However, a new preprint paper reveals that it’s possible to operate high-performing language models on the energy required to power a lightbulb.

Eliminating the Expensive Element: Matrix Multiplication

In their innovative approach, the researchers tackled the most computationally expensive part of running large language models: matrix multiplication. By eliminating this step and employing custom hardware, they discovered that a billion-parameter-scale language model could operate on just 13 watts of power. This efficiency surpasses typical hardware by over 50 times.

“We got the same performance at way less cost — all we had to do was fundamentally change how neural networks work,” explained Jason Eshraghian, lead author and assistant professor of electrical and computer engineering at the University of California Santa Cruz Baskin School of Engineering (BSE) team not only revamped the algorithm but also built custom hardware to maximize efficiency.

Understanding the Costs

Modern neural networks rely heavily on matrix multiplication, where words are represented as numbers in matrices that are multiplied to generate language. These operations are typically carried out on GPUs, which are specialized for handling large datasets but come with high energy costs due to the need to move data between physically separated units.

Innovative Approach: Ternary Numbers

The team adopted a method to use ternary numbers (negative one, zero, positive one), reducing computation to summing numbers rather than multiplying them. This approach was inspired by previous work but went further by eliminating matrix multiplication. The researchers devised a strategy to overlay matrices and perform only the most crucial operations, maintaining performance while cutting costs.

Custom Hardware Development

To further enhance energy efficiency, the team created custom hardware using field-programmable gate arrays (FPGAs). This highly customizable hardware allowed the researchers to exploit all energy-saving features of their redesigned neural network. The result was a model that could produce words faster than a human reads, using just 13 watts of power—a staggering improvement over the 700 watts required by standard GPUs. “We replaced the expensive operation with cheaper operations,” said Rui-Jie Zhu, the paper’s first author and a graduate student in BSE and Eshraghian’s group.

Future Implications

The researchers believe there’s potential for even greater efficiency. “These numbers are already really solid, but it is very easy to make them much better,” Eshraghian noted. “If we’re able to do this within 13 watts, just imagine what we could do with a whole data center's worth of computing power. We’ve got all these resources, but let’s use them effectively.”

This innovative work paves the way for more sustainable AI development, reducing both energy consumption and environmental impact. The researchers have made their model open source, inviting further advancements in this promising field.