Zhenyu Bai
ARTIC Fellow @ School of Computing, National University of Singapore
Moore’s Law is over; memory is the enemy.
Don’t repeat the VLIW mistake: compiler–hardware co-design is the key to making dataflow architectures the winner.
My current research focuses on dataflow architecture and compilers. We develop CGRA-style dataflow architectures and polyhedral-based compilation for systems with explicit data movement, distributed memories, and parallel compute units.
On the software side, we build compiler support for tile-based DSLs (e.g., Triton and Helion) targeting commercial dataflow platforms, including Tenstorrent, IBM AIU, AMD NPU/AIE and classical NPU/TPU-like architectures. Our prototype end-to-end flow (Helion/Triton → MLIR → TT-Metal) on Tenstorrent Wormhole achieves performance comparable to vendor libraries on tensor kernels and fused AI operators.
selected publications
- ASPLOS26A data-driven dynamic execution orchestration architectureIn Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2026
- arXiv25TL: Automatic End-to-End Compiler of Tile-Based Languages for Spatial Dataflow ArchitecturesarXiv preprint arXiv:2512.22168, 2025
- DAC24Swat: Scalable and efficient window attention-based transformers acceleration on fpgasIn Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024
- PACT24Zed: A generalized accelerator for variably sparse matrix computations in mlIn Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques, 2024
- arXiv24Reconsidering the energy efficiency of spiking neural networksarXiv preprint arXiv:2409.08290, 2024