Nvidia has announced a new parallel processing method that could radically transform the real-time memory of AI models. This technique, called “Helix Parallelism,” enables large language models in particular to process content with millions of words much more efficiently and without delay. The development process is built on Nvidia’s latest GPU architecture, Blackwell.
Increasing Efficiency in AI
One of the most fundamental challenges facing AI models is the ability to recall past knowledge and continue generating new content. Especially in applications requiring long-term historical data, such as law, medicine, or customer support systems, re-scanning this history becomes necessary for each word generation.

However, this process exponentially increases the load on GPU memory. AI models must re-access both the Key-Value Cache (KV) and the large Feed-Forward Network (FFN) weights for each new word. This rapidly consumes system resources and causes latency in operations.
Nvidia directly addresses these two bottlenecks with Helix Parallelism. The method handles the attention and FFN layers of a model separately. Using the KV Parallelism (KVP) technique used for the attention mechanism, historical data is shared among different GPUs.
This way, each GPU is responsible for only its own part, eliminating the need for the system to repeatedly scan the entire history. The FFN layer is run using traditional Tensor Parallelism (TP). Both processes operate in parallel, eliminating the need for GPUs to sit idle.
Data transfer is provided via Nvidia’s high-speed communication infrastructure, NVLink and NVL72. Furthermore, a new method called HOP-B minimizes latency by overlapping computation and data transfer processes. This allows for both transfer and processing of information, eliminating the need for processes to wait for each other.
Simulation results demonstrate that this new approach provides a significant leap in performance. In tests conducted with the DeepSeek-R1 671B model, the Helix method was able to serve 32 times more users with the same latency. In low-density scenarios, response time was reduced by up to 1.5 times, resulting in both high performance and resource efficiency.