research | Ivan R. Ivanov

Publications

2026

CGO ’26

Thinking Fast and Correct: Automated Rewriting of Numerical Code through Compiler Augmentation

Siyuan Brant Qian, Vimarsh Sathia, Ivan R. Ivanov, Jan Huckelheim, Paul Hovland, and William S. Moses

In 2026 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , Feb 2026

Abs

Floating-point numbers are finite-precision approximations to real numbers and are ubiquitous in computer applications in nearly every field. Selecting the right floating-point representation that balances performance and numerical accuracy is a difficult task – one that has become even more critical as hardware trends toward high-performance, low-precision operations. Although the common wisdom around changing floating-point precision implies that accuracy and performance are inversely correlated, more advanced techniques can often circumvent this tradeoff. Applying complex numerical optimizations to real-world code, however, is an arduous engineering task that requires expertise in numerical analysis and performance engineering, and application-specific numerical context. While there is a plethora of existing tools that partially automate this process, they are limited in the scope of optimization techniques or still require substantial human intervention. We present Poseidon, a modular and extensible framework that fully automates floating-point optimizations for real-world applications within a production compiler. Our key insight is that a small surrogate profile often reveals sufficient numerical context to drive effective rewrites. Poseidon operates as a two-phase compiler: the first compilation instruments the program to capture numerical context; the second compilation consumes profiled data, generates and evaluates candidate rewrites, and solves for optimal performance/accuracy tradeoffs. Poseidon’s interoperability with standard compiler analyses and optimizations grants it analysis and optimization advantages unavailable to existing source- and binary-level approaches. On multiple large-scale applications, Poseidon leads to outsized benefits in performance without substantially changing accuracy, and outsized accuracy benefits without diminishing performance. On a quaternion differentiator, Poseidon enables a 1.46× speedup with a relative error of 10−7. On DOE’s LULESH hydrodynamics application, Poseidon improves program accuracy to exactly match a 512-bit simulation run without substantially reducing performance.
CGO ’26

FRUGAL: Pushing GPU Applications beyond Memory Limits

Lingqi Zhang, Tengfei Wang, Jiajun Huang, Chen Zhuang, Ivan R. Ivanov, Peng Chen, Toshio Endo, and Mohamed Wahib

In 2026 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , Feb 2026

Abs

GPUs power modern scientific and AI applications, but their limited memory capacity restricts scalability. Buying GPUs with larger HBM is prohibitively expensive and still bounded by market limits. Existing solutions either exploit application-specific knowledge through out-of-core techniques, which lack generality, or rely on system-level page faulting, which is transparent but inefficient. We propose FRUGAL, an application-agnostic framework and methodology that reduces GPU memory footprint while sustaining high performance. FRUGAL formulates memory management as an optimization over an application’s execution graph, encompassing prefetching, kernel execution, and offloading. Using static analysis and profiling, FRUGAL applies a two-phase scheduling and migration strategy, solving an otherwise intractable optimization efficiently. Evaluations on Tiled Cholesky Decomposition, Tiled LU Decomposition, Tiny-CUDA-NN, and QuEST show that FRUGAL significantly reduces maximum GPU memory usage by 80.21%, 80.20%, 64.75% and 60.86% with only a geometric mean of 28.31% slowdown. FRUGAL allows applications to exceed hardware-imposed limits, and maintains strong performance scalability beyond existing GPU memory constraints, without additional hardware cost.

2025

SC ’25

RAPTOR: Practical Numerical Profiling of Scientific Applications

Faveo Hoerold, Ivan R. Ivanov, Akash Dhruv, William S. Moses, Anshu Dubey, Mohamed Wahib, and Jens Domke

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , , Feb 2025

Abs

The proliferation of low-precision units in modern high-performance architectures increasingly burdens domain scientists. Historically, the choice in HPC was easy: can we get away with 32 bit floating-point operations and lower bandwidth requirements, or is FP64 necessary? Driven by Artificial Intelligence, vendors introduce novel low-precision units for vector and tensor operations, and FP64 capabilities stagnate or are reduced. This forces scientists to re-evaluate their codes, but a trivial search-and-replace approach to go from FP64 to FP16 will not suffice.We introduce RAPTOR: a numerical profiling tool to guide scientists in their search for code regions where precision lowering is feasible. Using LLVM, we transparently replace high-precision computations using low-precision units, or emulate a user-defined precision. RAPTOR is a novel, feature-rich approach—with focus on ease of use—to change, profile, and reason about numerical requirements and instabilities, which we demonstrate with four real-world multi-physics Flash-X applications.
LLVM-HPC @ SC ’25

Dynamic Thread Coarsening for CPU and GPU OpenMP Code

Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert

In Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis , , Feb 2025

Abs PDF

Thread coarsening is a well known optimization technique for GPUs. It enables instruction-level parallelism, reduces redundant computation, and can provide better memory access patterns. However, the presence of divergent control flow - cases where uniformity of branch conditions among threads cannot be proven at compile time - diminishes its effectiveness. In this work, we implement multi-level thread coarsening for CPU and GPU OpenMP code, by implementing a generic thread coarsening transformation on LLVM IR. We introduce dynamic convergence - a new technique that generates both coarsened and non-coarsened versions of divergent regions in the code and allows for the uniformity check to happen at runtime instead of compile time. We performed evalution on HecBench for GPU and LULESH for CPU. We found that best case speedup without dynamic convergence was 4.6% for GPUs and 2.9% for CPUs, while our approach achieved 7.5% for GPUs and 4.3% for CPUs.

2024

arXiv

Tadashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness

Emil Vatai, Aleksandr Drozd, Ivan R. Ivanov, Yinghao Ren, and Mohamed Wahib

Feb 2024
IWOMP ’24

Automatic Parallelization and OpenMP Offloading of Fortran Array Notation

Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert

In Advancing OpenMP for Future Accelerators , Feb 2024

Abs PDF

The Fortran programming language is prevalent in the scientific computing community with a wealth of existing software written in it. It is still being developed with the latest standard released in 2023. However, due to its long history, many old code bases are in need of modernization for new HPC systems. One advantage Fortran has over C and C++, which are other languages broadly used in scientific computing, is the easy syntax for manipulating entire arrays or subarrays. However, this feature is underused as there was no way of offloading them to accelerators and support for parallelization has been unsatisfactory. The new OpenMP 6.0 standard introduces the workdistribute directive which enables parallelization and/or offloading automatically by just annotating the region the programmer wishes to speed up. We implement workdistribute in the LLVM project’s Fortran compiler, called Flang. Flang uses MLIR – Multi-Level Intermediate Representation – which allows for a structured representation that captures the high level semantics of array manipulation and OpenMP. This allows us to build an implementation that performs on par with more verbose manually parallelized OpenMP code. By offloading linear algebra operations to vendor libraries, we also enable software developers to easily unlock the full potential of their hardware without needing to write verbose, vendor-specific source code.
EuroMPI ’24

SPMD IR: Unifying SPMD and Multi-value IR Showcased for Static Verification of Collectives

Semih Burak, Ivan R. Ivanov, Jens Domke, and Matthias Müller

In Recent Advances in the Message Passing Interface , Feb 2024

Abs

To effectively utilize modern HPC clusters, inter-node communication and related single program, multiple data (SPMD) parallel programming models such as mpi are inevitable. Current tools and compilers that employ analyses of SPMD models often have the limitation of only supporting one model or implementing the necessary abstraction internally. This makes the analysis and effort for the abstraction neither reusable nor the tool extensible to other models without extensive changes to the tool itself.
arXiv

Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training

Ivan R. Ivanov, Joachim Meyer, Aiden Grossman, William S. Moses, and Johannes Doerfert

Jun 2024

Abs PDF

The size and complexity of software applications is increasing at an accelerating pace. Source code repositories (along with their dependencies) require vast amounts of labor to keep them tested, maintained, and up to date. As the discipline now begins to also incorporate automatically generated programs, automation in testing and tuning is required to keep up with the pace - let alone reduce the present level of complexity. While machine learning has been used to understand and generate code in various contexts, machine learning models themselves are trained almost exclusively on static code without inputs, traces, or other execution time information. This lack of training data limits the ability of these models to understand real-world problems in software. In this work we show that inputs, like code, can be generated automatically at scale. Our generated inputs are stateful, and appear to faithfully reproduce the arbitrary data structures and system calls required to rerun a program function. By building our tool within the compiler, it both can be applied to arbitrary programming languages and architectures and can leverage static analysis and transformations for improved performance. Our approach is able to produce valid inputs, including initial memory states, for 90% of the ComPile dataset modules we explored, for a total of 21.4 million executable functions. Further, we find that a single generated input results in an average block coverage of 37%, whereas guided generation of five inputs improves it to 45%.
CGO ’24

Retargeting and Respecializing GPU Workloads for Performance Portability

I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses

In 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , Mar 2024

Abs PDF

In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.

2023

PPoPP ’23

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, and Oleksandr Zinenko

In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming , Montreal, QC, Canada, Mar 2023

Abs PDF

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7\texttimes.

2019

SC ’19

HyperX Topology: First at-Scale Implementation and Comparison to the Fat-Tree

Jens Domke, Satoshi Matsuoka, Ivan R. Ivanov, Yuki Tsushima, Tomoya Yuki, Akihiro Nomura, Shin’ichi Miura, Nie McDonald, Dennis L. Floyd, and Nicolas Dubé

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Denver, Colorado, Mar 2019

Abs

The de-facto standard topology for modern HPC systems and data-centers are Folded Clos networks, commonly known as Fat-Trees. The number of network endpoints in these systems is steadily increasing. The switch radix increase is not keeping up, forcing an increased path length in these multi-level trees that will limit gains for latency-sensitive applications. Additionally, today’s Fat-Trees force the extensive use of active optical cables which carries a prohibitive cost-structure at scale. To tackle these issues, researchers proposed various low-diameter topologies, such as Dragonfly. Another novel, but only theoretically studied, option is the HyperX. We built the world’s first 3 Pflop/s supercomputer with two separate networks, a 3–level Fat-Tree and a 12\texttimes8 HyperX. This dual-plane system allows us to perform a side-by-side comparison using a broad set of benchmarks. We show that the HyperX, together with our novel communication pattern-aware routing, can challenge the performance of, or even outperform, traditional Fat-Trees.

Talks

2025

RAPTOR: Practical Numerical Profiling of Scientific Applications

Faveo Hoerold, Ivan R. Ivanov, Akash Dhruv, William S. Moses, Anshu Dubey, Mohamed Wahib, and Jens Domke

In SC ’25 , Nov 2025
Dynamic Thread Coarsening for CPU and GPU OpenMP Code

Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert

In LLVM-HPC @ SC ’25 , Nov 2025
Automatic Minimal and Relocatable Proxy App Generation

Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert

In Student Research Competition at CGO 2025 , Mar 2025
Polyhedral Rescheduling of GPU Kernels To Exploit Async Memory Movement

Ivan R. Ivanov, William Moses, Emil Vatai, Toshio Endo, Jens Domke, and Alex Zinenko

In Ninth LLVM Performance Workshop at CGO 2025 , Mar 2025

2024

Automatic Parallelization and OpenMP Offloading of Fortran Array Notation

Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert

In 20th International Workshop on OpenMP , Sep 2024
Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training

Ivan R. Ivanov

In Monthly LLVM ML Guided Compiler Optimizations Meeting , Aug 2024
Retargeting and Respecializing GPU Workloads for Performance Portability

Ivan R. Ivanov

In R-CCS Cafe , Jun 2024
Automatic Retuning of Floating-Point Precision

Ivan R. Ivanov, and W. S. Moses

In 2024 Euro LLVM Developers’ Meeting , Apr 2024
Automatic Proxy App Generation through Input Capture and Generation

Ivan R. Ivanov, Aiden Grossman, Ludger Paehler, William S. Moses, and Johannes Doerfert

In 2024 Euro LLVM Developers’ Meeting , Apr 2024
Retargeting and Respecializing GPU Workloads for Performance Portability

Ivan R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses

In CGO ’24 , Mar 2024
Automatic Parallelization and OpenMP Offloading of Fortran

Ivan R. Ivanov, J. Domke, T. Endo, and J. Doerfert

In CGO ’24 LLVM Performance Workshop , Mar 2024

2023

Optimization of CUDA GPU Kernels and Translation to AMDGPU in Polygeist/MLIR

Ivan R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses

In 2023 LLVM Developers’ Meeting. Student Talk , Oct 2023
GPU Kernel Compilation in Polygeist/MLIR

Ivan R. Ivanov, O. Zinenko, J. Domke, T. Endo, J. Doerfert, and W. S. Moses

In 2023 LLVM Developers’ Meeting GPU Offloading Workshop. Lightning Talk , Oct 2023

2022

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs in Polygeist/MLIR

W. S. Moses, Ivan R. Ivanov, J. Domke, T. Endo, J. Doerfert, and O. Zinenko

In 2022 LLVM Developers’ Meeting. Lightning Talk , Nov 2022
Automatic translation of CUDA code into high performance CPU code using LLVM IR transformations

Ivan R. Ivanov, J. Domke, and T. Endo

In The 4th R-CCS Internation Symposium. Lightning Talk , Feb 2022

2021

Improved failover for HPC interconnects through localised routing restoration

Ivan R. Ivanov, J. Domke, A. Nomura, and T. Endo

In The 3rd R-CCS Internation Symposium. Lightning Talk , Feb 2021

Posters

2026

Bridge Over Troubled Water: Offloading OpenMP Regions to XLA via StableHLO

Muyao Xiao, Ivan R. Ivanov, Jens Domke, and Toshio Endo

In SCA/HPCAsia 2026 , Jan 2026

2025

Automatic Minimal and Relocatable Proxy App Generation

Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert

In Student Research Competition at CGO 2025 , Mar 2025

2024

Dynamic Thread Coarsening for OpenMP Offloading.

Ivan R. Ivanov, J. Domke, T. Endo, and J. Doerfert.

In CGO ’24 Student Research Competition. , Mar 2024
Unifying SPMD and Multi-Value IR - Use Case: Static Verification of Collective Communication.

S. Burak, Ivan R. Ivanov, J. Domke, and M. Mueller.

In CGO ’24 Student Research Competition. , Mar 2024

2023

Performance Portability of C/C++ CUDA Code via High-Level Intermediate Representation

Ivan R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses.

In 2023 RIKEN Summer School , Sep 2023
BITFLEX - An HPC User-Driven Automatic Toolchain for Precision Manipulation and Approximate Computing.

Ryan Barton, Mohamed Wahib, Jens Domke, Ivan R. Ivanov, Toshio Endo, and Satoshi Matsuoka.

In ISC High Performance 2023 , May 2023
Parallel Optimizations and Transformations of GPU Kernels Using a High-Level representation in MLIR/Polygeist.

Ivan R. Ivanov, William S. Moses, Jens Domke, and Toshio Endo.

In CGO ’23 Student Research Competition , Feb 2023

2022

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs in Polygeist/MLIR.

W. S. Moses, Ivan R. Ivanov, J. Domke, T. Endo, J. Doerfert, and O. Zinenko.

In 2022 LLVM Developers’ Meeting. , Nov 2022

Awards

2026

RIKEN OHBU Award for “Innovative contributions to research and development of MLIR and compilers”

CGO 2026 Distinguished Paper Award

2025

SC25 Best Reproducibility Advancement Award

CGO 2025 ACM Student Research Competition 3rd place

Theses

Mar 2026 - PhD Thesis

Compiler-Driven Performance Portability for Heterogeneous High-Performance Computing

Mar 2023 - Master’s Thesis

Optimizations and Transformations of Parallel Code via High Level Intermediate Representation

Mar 2021 - Bachelor’s Thesis

Improved failover for HPC interconnects through localised routing restoration

Conference Service

CGO 2026 Workshop & Tutorial Chair

EuroPar 2025 Program Comittee Member.

SC 24 Reproducibility Committee Member.

LLVM-GPU: First International Workshop on LLVM for GPUs at EuroPar 24 Program Committee Member

CGO25 Artifact Evaluation Comittee Member.

CGO24 LLVM Performance Workshop Moderation.

2024 Euro LLVM Session Moderation.

Review for LLMxHPC 2024 at Cluster ‘24

Review for IPDPS25.

Review for HPC Asia 2025.