Publications
2025
-
RAPTOR: Practical Numerical Profiling of Scientific Applications
Faveo Hoerold, Ivan R. Ivanov, Akash Dhruv, William S. Moses, Anshu Dubey, Mohamed Wahib, and Jens Domke
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , , 2025
The proliferation of low-precision units in modern high-performance architectures increasingly burdens domain scientists. Historically, the choice in HPC was easy: can we get away with 32 bit floating-point operations and lower bandwidth requirements, or is FP64 necessary? Driven by Artificial Intelligence, vendors introduce novel low-precision units for vector and tensor operations, and FP64 capabilities stagnate or are reduced. This forces scientists to re-evaluate their codes, but a trivial search-and-replace approach to go from FP64 to FP16 will not suffice.We introduce RAPTOR: a numerical profiling tool to guide scientists in their search for code regions where precision lowering is feasible. Using LLVM, we transparently replace high-precision computations using low-precision units, or emulate a user-defined precision. RAPTOR is a novel, feature-rich approach—with focus on ease of use—to change, profile, and reason about numerical requirements and instabilities, which we demonstrate with four real-world multi-physics Flash-X applications.
-
Dynamic Thread Coarsening for CPU and GPU OpenMP Code
Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert
In Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis , , 2025
Thread coarsening is a well known optimization technique for GPUs. It enables instruction-level parallelism, reduces redundant computation, and can provide better memory access patterns. However, the presence of divergent control flow - cases where uniformity of branch conditions among threads cannot be proven at compile time - diminishes its effectiveness. In this work, we implement multi-level thread coarsening for CPU and GPU OpenMP code, by implementing a generic thread coarsening transformation on LLVM IR. We introduce dynamic convergence - a new technique that generates both coarsened and non-coarsened versions of divergent regions in the code and allows for the uniformity check to happen at runtime instead of compile time. We performed evalution on HecBench for GPU and LULESH for CPU. We found that best case speedup without dynamic convergence was 4.6% for GPUs and 2.9% for CPUs, while our approach achieved 7.5% for GPUs and 4.3% for CPUs.
2024
-
Tadashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness
Emil Vatai, Aleksandr Drozd, Ivan R. Ivanov, Yinghao Ren, and Mohamed Wahib
2024
-
Automatic Parallelization and OpenMP Offloading of Fortran Array Notation
Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert
In Advancing OpenMP for Future Accelerators , 2024
The Fortran programming language is prevalent in the scientific computing community with a wealth of existing software written in it. It is still being developed with the latest standard released in 2023. However, due to its long history, many old code bases are in need of modernization for new HPC systems. One advantage Fortran has over C and C++, which are other languages broadly used in scientific computing, is the easy syntax for manipulating entire arrays or subarrays. However, this feature is underused as there was no way of offloading them to accelerators and support for parallelization has been unsatisfactory. The new OpenMP 6.0 standard introduces the workdistribute directive which enables parallelization and/or offloading automatically by just annotating the region the programmer wishes to speed up. We implement workdistribute in the LLVM project’s Fortran compiler, called Flang. Flang uses MLIR – Multi-Level Intermediate Representation – which allows for a structured representation that captures the high level semantics of array manipulation and OpenMP. This allows us to build an implementation that performs on par with more verbose manually parallelized OpenMP code. By offloading linear algebra operations to vendor libraries, we also enable software developers to easily unlock the full potential of their hardware without needing to write verbose, vendor-specific source code.
-
SPMD IR: Unifying SPMD and Multi-value IR Showcased for Static Verification of Collectives
Semih Burak, Ivan R. Ivanov, Jens Domke, and Matthias Müller
In Recent Advances in the Message Passing Interface , 2024
To effectively utilize modern HPC clusters, inter-node communication and related single program, multiple data (SPMD) parallel programming models such as mpi are inevitable. Current tools and compilers that employ analyses of SPMD models often have the limitation of only supporting one model or implementing the necessary abstraction internally. This makes the analysis and effort for the abstraction neither reusable nor the tool extensible to other models without extensive changes to the tool itself.
-
-
Retargeting and Respecializing GPU Workloads for Performance Portability
I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses
In 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , Mar 2024
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
2023
-
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, and Oleksandr Zinenko
In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming , Montreal, QC, Canada, Mar 2023
While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7\texttimes.
2019
-
HyperX Topology: First at-Scale Implementation and Comparison to the Fat-Tree
Jens Domke, Satoshi Matsuoka, Ivan R. Ivanov, Yuki Tsushima, Tomoya Yuki, Akihiro Nomura, Shin’ichi Miura, Nie McDonald, Dennis L. Floyd, and Nicolas Dubé
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Denver, Colorado, Mar 2019
The de-facto standard topology for modern HPC systems and data-centers are Folded Clos networks, commonly known as Fat-Trees. The number of network endpoints in these systems is steadily increasing. The switch radix increase is not keeping up, forcing an increased path length in these multi-level trees that will limit gains for latency-sensitive applications. Additionally, today’s Fat-Trees force the extensive use of active optical cables which carries a prohibitive cost-structure at scale. To tackle these issues, researchers proposed various low-diameter topologies, such as Dragonfly. Another novel, but only theoretically studied, option is the HyperX. We built the world’s first 3 Pflop/s supercomputer with two separate networks, a 3–level Fat-Tree and a 12\texttimes8 HyperX. This dual-plane system allows us to perform a side-by-side comparison using a broad set of benchmarks. We show that the HyperX, together with our novel communication pattern-aware routing, can challenge the performance of, or even outperform, traditional Fat-Trees.
Talks
2025
-
RAPTOR: Practical Numerical Profiling of Scientific Applications
Faveo Hoerold, Ivan R. Ivanov, Akash Dhruv, William S. Moses, Anshu Dubey, Mohamed Wahib, and Jens Domke
In SC ’25 , Nov 2025
-
Dynamic Thread Coarsening for CPU and GPU OpenMP Code
Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert
In LLVM-HPC @ SC ’25 , Nov 2025
-
Automatic Minimal and Relocatable Proxy App Generation
Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert
In Student Research Competition at CGO 2025 , Mar 2025
-
Polyhedral Rescheduling of GPU Kernels To Exploit Async Memory Movement
Ivan R. Ivanov, William Moses, Emil Vatai, Toshio Endo, Jens Domke, and Alex Zinenko
In Ninth LLVM Performance Workshop at CGO 2025 , Mar 2025
2024
-
Automatic Parallelization and OpenMP Offloading of Fortran Array Notation
Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert
In 20th International Workshop on OpenMP , Sep 2024
-
Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training
Ivan R. Ivanov
In Monthly LLVM ML Guided Compiler Optimizations Meeting , Aug 2024
-
Retargeting and Respecializing GPU Workloads for Performance Portability
Ivan R. Ivanov
In R-CCS Cafe , Jun 2024
-
Automatic Retuning of Floating-Point Precision
Ivan R. Ivanov, and W. S. Moses
In 2024 Euro LLVM Developers’ Meeting , Apr 2024
-
Automatic Proxy App Generation through Input Capture and Generation
Ivan R. Ivanov, Aiden Grossman, Ludger Paehler, William S. Moses, and Johannes Doerfert
In 2024 Euro LLVM Developers’ Meeting , Apr 2024
-
Retargeting and Respecializing GPU Workloads for Performance Portability
Ivan R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses
In CGO ’24 , Mar 2024
-
Automatic Parallelization and OpenMP Offloading of Fortran
Ivan R. Ivanov, J. Domke, T. Endo, and J. Doerfert
In CGO ’24 LLVM Performance Workshop , Mar 2024
2023
-
Optimization of CUDA GPU Kernels and Translation to AMDGPU in Polygeist/MLIR
Ivan R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses
In 2023 LLVM Developers’ Meeting. Student Talk , Oct 2023
-
GPU Kernel Compilation in Polygeist/MLIR
Ivan R. Ivanov, O. Zinenko, J. Domke, T. Endo, J. Doerfert, and W. S. Moses
In 2023 LLVM Developers’ Meeting GPU Offloading Workshop. Lightning Talk , Oct 2023
2022
-
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs in Polygeist/MLIR
W. S. Moses, Ivan R. Ivanov, J. Domke, T. Endo, J. Doerfert, and O. Zinenko
In 2022 LLVM Developers’ Meeting. Lightning Talk , Nov 2022
-
Automatic translation of CUDA code into high performance CPU code using LLVM IR transformations
Ivan R. Ivanov, J. Domke, and T. Endo
In The 4th R-CCS Internation Symposium. Lightning Talk , Feb 2022
2021
-
Improved failover for HPC interconnects through localised routing restoration
Ivan R. Ivanov, J. Domke, A. Nomura, and T. Endo
In The 3rd R-CCS Internation Symposium. Lightning Talk , Feb 2021
Posters
2025
-
Automatic Minimal and Relocatable Proxy App Generation
Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert
In Student Research Competition at CGO 2025 , Mar 2025
2024
-
Dynamic Thread Coarsening for OpenMP Offloading.
Ivan R. Ivanov, J. Domke, T. Endo, and J. Doerfert.
In CGO ’24 Student Research Competition. , Mar 2024
-
Unifying SPMD and Multi-Value IR - Use Case: Static Verification of Collective Communication.
S. Burak, Ivan R. Ivanov, J. Domke, and M. Mueller.
In CGO ’24 Student Research Competition. , Mar 2024
2023
-
Performance Portability of C/C++ CUDA Code via High-Level Intermediate Representation
Ivan R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses.
In 2023 RIKEN Summer School , Sep 2023
-
BITFLEX - An HPC User-Driven Automatic Toolchain for Precision Manipulation and Approximate Computing.
Ryan Barton, Mohamed Wahib, Jens Domke, Ivan R. Ivanov, Toshio Endo, and Satoshi Matsuoka.
In ISC High Performance 2023 , May 2023
-
Parallel Optimizations and Transformations of GPU Kernels Using a High-Level representation in MLIR/Polygeist.
Ivan R. Ivanov, William S. Moses, Jens Domke, and Toshio Endo.
In CGO ’23 Student Research Competition , Feb 2023
2022
-
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs in Polygeist/MLIR.
W. S. Moses, Ivan R. Ivanov, J. Domke, T. Endo, J. Doerfert, and O. Zinenko.
In 2022 LLVM Developers’ Meeting. , Nov 2022
Awards
SC25 Best Reproducibility Advancement Award
CGO 2025 ACM Student Research Competition 3rd place
Theses
Mar 2023 - Master’s Thesis
Optimizations and Transformations of Parallel Code via High Level Intermediate Representation
Mar 2021 - Bachelor’s Thesis
Improved failover for HPC interconnects through localised routing restoration
Conference Service
CGO 2026 Workshop & Tutorial Chair
EuroPar 2025 Program Comittee Member.
SC 24 Reproducibility Committee Member.
LLVM-GPU: First International Workshop on LLVM for GPUs at EuroPar 24 Program Committee Member
CGO25 Artifact Evaluation Comittee Member.
CGO24 LLVM Performance Workshop Moderation.
2024 Euro LLVM Session Moderation.
Review for LLMxHPC 2024 at Cluster ‘24
Review for IPDPS25.
Review for HPC Asia 2025.