Roost.ai blog on Generative AI and Large Language Models

#148 The Pipe Dream of Running Inference on CPUs

Written by Rishi Yadav | February 2024

<< Previous Edition: LLM Personalization

A key highlight of large language models (LLMs), is their differing needs for resources during training versus inference phases. Training these advanced models demands considerable computational resources, often requiring powerful GPU clusters to manage the extensive data and calculations involved. This phase is critical as the model learns, tweaking its weights and biases in response to the dataset, a task that consumes substantial computing power and energy.

On the other hand, inference—the phase where the trained model is used to make predictions or generate text based on new inputs—is considerably less demanding. In essence, inference involves passing input data through the model's pre-trained network of weights and biases to produce an output. Because this process does not require the model to learn or adjust its parameters, it is inherently less resource-intensive than training.

This disparity has fueled the optimistic view that, although training may rely on GPUs, inference could be efficiently performed on CPUs. This shift is appealing due to the widespread availability and cost-effectiveness of CPUs compared to specialized GPU hardware, potentially broadening the accessibility of LLMs and AI technologies for diverse applications and users without access to high-end computing resources.

Is It All a Pipe Dream? Likely So

However, a critical question arises: Is this feasibility a pipedream? Our continuous exploration in generative AI innovations reveals a concerning insight: even basic inference tasks on CPUs can be 5-6 times slower than on GPUs. This observation challenges the notion of CPUs nearing parity with GPUs for inference tasks, highlighting significant hurdles, notably the lack of parallel processing capabilities inherent in CPUs.

Inference, in simpler terms, is about how a model makes sense of new data using what it has learned. Picture this: each neuron in the model uses the formula

y = f(W.X + b)

to process information, where y is the output, W represents weights, X is the input, and b signifies biases. The activation function f transforms the weighted input plus bias into an output that the model can use.

However, inference isn't the job of a lone neuron. Numerous neurons work in parallel, each equipped with its own set of weights and biases, performing similar calculations. Imagine activation as a tool that refines these raw outputs into something meaningful.

Then, it's all about competition among these outputs. Each is evaluated for its likelihood of being the right answer, with probabilities assigned accordingly. The output with the highest probability gets selected as the final answer, effectively becoming the "chosen one."

Coming back to the pipedream, executing inference efficiently on CPUs presents considerable challenges, casting doubt on the feasibility of this approach for high-performance needs. Lets discuss key challenges.

Parallelism

The necessity for multiple neurons to process information simultaneously underscores the need for parallelism. While CPUs can handle parallel tasks through multi-threading, the level of parallelism is relatively limited and coarse-grained compared to the fine-grained, extensive parallel processing capabilities of GPUs. This discrepancy significantly impacts the ability of CPUs to match the performance efficiency required for complex neural network computations.

Memory Access

Another critical issue lies in the mechanisms of memory allocation and access between CPUs and GPUs. GPUs are designed with a memory architecture optimized for high-throughput and low-latency access to large volumes of data, crucial for the rapid computations neural networks require. In contrast, CPUs typically have slower memory access and less optimized data throughput for the intense demands of neural network inference, leading to potential bottlenecks and reduced performance efficiency.

Conclusion

While there's a collective aspiration to efficiently conduct inference tasks on CPUs in the future, the array of challenges currently facing CPUs suggests that achieving this goal might be a long journey. We've delved into two primary obstacles—parallelism and memory access—yet, other significant hurdles remain, including energy efficiency. These challenges collectively underscore the considerable gap that CPUs need to bridge to rival the performance capabilities of GPUs for inference tasks.

>> Next Edition: Life of a Neuron Bee