Episode

nncase: An End-to-End Compiler for Efficient LLM Deployment on Heterogeneous Storage Architectures

Hui Guo,Qihang Zheng,Chenghai Huo,Dongliang Guo,Haoqi Yang,Yang Zhang

Dec 25, 2025•6:14

Distributed, Parallel, and Cluster ComputingMachine Learning

No ratings yet

Abstract

The efficient deployment of large language models (LLMs) is hindered by memory architecture heterogeneity, where traditional compilers suffer from fragmented workflows and high adaptation costs. We present nncase, an open-source, end-to-end compilation framework designed to unify optimization across diverse targets. Central to nncase is an e-graph-based term rewriting engine that mitigates the phase ordering problem, enabling global exploration of computation and data movement strategies. The framework integrates three key modules: Auto Vectorize for adapting to heterogeneous computing units, Auto Distribution for searching parallel strategies with cost-aware communication optimization, and Auto Schedule for maximizing on-chip cache locality. Furthermore, a buffer-aware Codegen phase ensures efficient kernel instantiation. Evaluations show that nncase outperforms mainstream frameworks like MLC LLM and Intel IPEX on Qwen3 series models and achieves performance comparable to the hand-optimized llama.cpp on CPUs, demonstrating the viability of automated compilation for high-performance LLM deployment. The source code is available at https://github.com/kendryte/nncase.

Summary

The paper introduces nncase, an open-source, end-to-end compilation framework designed for efficient deployment of large language models (LLMs) on heterogeneous storage architectures. The main research problem addressed is the challenge of adapting LLMs to diverse hardware targets with varying memory hierarchies and compute units, a task that traditional compilers struggle with due to fragmented workflows and high adaptation costs. nncase employs an e-graph-based term rewriting engine to overcome the phase ordering problem, enabling global exploration of computation and data movement strategies. The framework integrates three key modules: Auto Vectorize, Auto Distribution, and Auto Schedule, which handle heterogeneous computing unit adaptation, parallel strategy search with cost-aware communication optimization, and on-chip cache locality maximization, respectively. A buffer-aware Codegen phase ensures efficient kernel instantiation. The key finding is that nncase outperforms mainstream frameworks like MLC LLM and Intel IPEX on Qwen3 series models and achieves performance comparable to the hand-optimized llama.cpp on CPUs. This demonstrates the viability of automated compilation for high-performance LLM deployment across diverse hardware. This matters to the field because it presents a unified compilation paradigm that decouples the compilation workflow from physical topology, offering a "compile once, adapt everywhere" capability and reducing the engineering costs associated with deploying LLMs on various platforms.

Key Insights

•E-graph-based term rewriting mitigates the phase ordering problem: Unlike traditional compilers with greedy strategies, nncase's e-graph explores a comprehensive design space without compromising semantic integrity, leading to better optimization.
•Auto Vectorize adapts to heterogeneous compute units: By introducing MetaPackOperation and FoldNopPack rules, the compiler generates multiple packed layout candidates within the e-graph, dynamically adjusting packing factors to optimize data layout conversion and computing unit saturation.
•Auto Distribution enables topology-agnostic parallelism: By adopting the SBP (Split, Broadcast, Partial) abstraction and embedding the distributed strategy search space into the e-graph, nncase automatically balances communication costs against computational efficiency.
•Auto Schedule maximizes on-chip cache locality: A hierarchical approach leveraging the nncase Tensor Template (NTT) Library decomposes scheduling into structural optimization (Loop Fusion via Monte Carlo Tree Search) and parametric optimization (Tiling via Mixed-Integer Nonlinear Programming), achieving register-level efficiency.
•nncase outperforms mainstream frameworks on Qwen3: Evaluations show that nncase achieves significantly better performance than MLC LLM and Intel IPEX on the Qwen3 series models. For example, in a single-core scenario with Qwen3-0.6B (F32), nncase achieves 8.7 tokens/s compared to 5.3 tokens/s for Intel IPEX, a ~64% improvement.
•Performance comparable to llama.cpp: nncase achieves performance comparable to the hand-optimized llama.cpp on CPUs, demonstrating the effectiveness of automated compilation.
•Limitations: The evaluation is performed on a single CPU platform (AMD Ryzen 9 5900X) and does not cover a wide range of hardware targets. The communication cost model is based on the Roofline model and the Alpha-Beta model, which are approximations and may not accurately reflect real-world communication costs in complex distributed systems.

Practical Implications

•Efficient LLM deployment across diverse hardware: nncase enables efficient deployment of LLMs on various hardware platforms, including single-node setups and multi-device clusters, by unifying optimization across heterogeneous memory architectures.
•Reduced engineering costs: By automating the optimization process, nncase reduces the engineering costs associated with manually tuning LLMs for specific hardware targets.
•Framework developers and researchers: Framework developers can adopt nncase's e-graph-based optimization techniques and modular design to enhance their compilers' capabilities. Researchers can build upon nncase to explore new optimization strategies and distributed execution paradigms for LLMs.
•Future research directions: The work opens up future research directions in areas such as exploring more sophisticated cost models for Auto Distribution, extending the NTT Library to support a wider range of hardware primitives, and investigating the application of nncase to other deep learning models and tasks.

Links & Resources

View on arXiv Download PDF

Authors

Hui Guo Qihang Zheng Chenghai Huo Dongliang Guo Haoqi Yang Yang Zhang

Cite This Paper

arXiv:2512.21571

Year:2025

Category:cs.DC

APA

Guo, H., Zheng, Q., Huo, C., Guo, D., Yang, H., Zhang, Y. (2025). nncase: An End-to-End Compiler for Efficient LLM Deployment on Heterogeneous Storage Architectures. arXiv preprint arXiv:2512.21571.

MLA

Hui Guo, Qihang Zheng, Chenghai Huo, Dongliang Guo, Haoqi Yang, and Yang Zhang. "nncase: An End-to-End Compiler for Efficient LLM Deployment on Heterogeneous Storage Architectures." arXiv preprint arXiv:2512.21571 (2025).