The company’s new Aegaeon system reportedly slashes GPU requirements by 82 per cent while improving throughput almost ninefold.
The work was detailed in a paper presented at the 2025 ACM Symposium on Operating Systems in Seoul, written by engineers from Alibaba’s infrastructure division and Peking University. During several months of production testing, the number of Nvidia H20 accelerators needed to support dozens of large language models fell from 1,192 to just 213.
Unlike most research that focuses on faster model training, Aegaeon tackles the waste that happens during inference. It acts as a scheduler, parcelling out tiny slices of GPU time across different models with unpredictable demand. This approach keeps the chips busier and allows one H20 to serve several models simultaneously.
The result, according to the paper, is a sharp rise in system-wide efficiency, or “goodput”. By virtualising GPU access at the token level, Aegaeon maintains high utilisation even when workloads spike or idle unpredictably.
The South China Morning Post reported that all the tests used Nvidia’s H20 accelerator, one of the few still available to Chinese buyers under current US export restrictions.
If the figures stand up to scrutiny, Alibaba may have shown a way for Chinese data centres to do more with the limited hardware they can still import.