research | Ivan R. Ivanov

Publications

2024

arXiv

Tadashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness

Emil Vatai, Aleksandr Drozd, Ivan R. Ivanov, Yinghao Ren, and Mohamed Wahib

2024
IWOMP ’24

Automatic Parallelization and OpenMP Offloading of Fortran Array Notation

Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert

In Advancing OpenMP for Future Accelerators , 2024

Abs PDF

The Fortran programming language is prevalent in the scientific computing community with a wealth of existing software written in it. It is still being developed with the latest standard released in 2023. However, due to its long history, many old code bases are in need of modernization for new HPC systems. One advantage Fortran has over C and C++, which are other languages broadly used in scientific computing, is the easy syntax for manipulating entire arrays or subarrays. However, this feature is underused as there was no way of offloading them to accelerators and support for parallelization has been unsatisfactory. The new OpenMP 6.0 standard introduces the workdistribute directive which enables parallelization and/or offloading automatically by just annotating the region the programmer wishes to speed up. We implement workdistribute in the LLVM project’s Fortran compiler, called Flang. Flang uses MLIR – Multi-Level Intermediate Representation – which allows for a structured representation that captures the high level semantics of array manipulation and OpenMP. This allows us to build an implementation that performs on par with more verbose manually parallelized OpenMP code. By offloading linear algebra operations to vendor libraries, we also enable software developers to easily unlock the full potential of their hardware without needing to write verbose, vendor-specific source code.
EuroMPI ’24

SPMD IR: Unifying SPMD and Multi-value IR Showcased for Static Verification of Collectives

Semih Burak, Ivan R. Ivanov, Jens Domke, and Matthias Müller

In Recent Advances in the Message Passing Interface , 2024

Abs

To effectively utilize modern HPC clusters, inter-node communication and related single program, multiple data (SPMD) parallel programming models such as mpi are inevitable. Current tools and compilers that employ analyses of SPMD models often have the limitation of only supporting one model or implementing the necessary abstraction internally. This makes the analysis and effort for the abstraction neither reusable nor the tool extensible to other models without extensive changes to the tool itself.
arXiv

Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training

Ivan R. Ivanov, Joachim Meyer, Aiden Grossman, William S. Moses, and Johannes Doerfert

Jun 2024

Abs PDF

The size and complexity of software applications is increasing at an accelerating pace. Source code repositories (along with their dependencies) require vast amounts of labor to keep them tested, maintained, and up to date. As the discipline now begins to also incorporate automatically generated programs, automation in testing and tuning is required to keep up with the pace - let alone reduce the present level of complexity. While machine learning has been used to understand and generate code in various contexts, machine learning models themselves are trained almost exclusively on static code without inputs, traces, or other execution time information. This lack of training data limits the ability of these models to understand real-world problems in software. In this work we show that inputs, like code, can be generated automatically at scale. Our generated inputs are stateful, and appear to faithfully reproduce the arbitrary data structures and system calls required to rerun a program function. By building our tool within the compiler, it both can be applied to arbitrary programming languages and architectures and can leverage static analysis and transformations for improved performance. Our approach is able to produce valid inputs, including initial memory states, for 90% of the ComPile dataset modules we explored, for a total of 21.4 million executable functions. Further, we find that a single generated input results in an average block coverage of 37%, whereas guided generation of five inputs improves it to 45%.
CGO ’24

Retargeting and Respecializing GPU Workloads for Performance Portability

I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses

In 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , Mar 2024

Abs PDF

In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.

2023

PPoPP ’23

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, and Oleksandr Zinenko

In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming , Montreal, QC, Canada, Mar 2023

Abs PDF

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7\texttimes.

2019

SC ’19

HyperX Topology: First at-Scale Implementation and Comparison to the Fat-Tree

Jens Domke, Satoshi Matsuoka, Ivan R. Ivanov, Yuki Tsushima, Tomoya Yuki, Akihiro Nomura, Shin’ichi Miura, Nie McDonald, Dennis L. Floyd, and Nicolas Dubé

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Denver, Colorado, Mar 2019

Abs

The de-facto standard topology for modern HPC systems and data-centers are Folded Clos networks, commonly known as Fat-Trees. The number of network endpoints in these systems is steadily increasing. The switch radix increase is not keeping up, forcing an increased path length in these multi-level trees that will limit gains for latency-sensitive applications. Additionally, today’s Fat-Trees force the extensive use of active optical cables which carries a prohibitive cost-structure at scale. To tackle these issues, researchers proposed various low-diameter topologies, such as Dragonfly. Another novel, but only theoretically studied, option is the HyperX. We built the world’s first 3 Pflop/s supercomputer with two separate networks, a 3–level Fat-Tree and a 12\texttimes8 HyperX. This dual-plane system allows us to perform a side-by-side comparison using a broad set of benchmarks. We show that the HyperX, together with our novel communication pattern-aware routing, can challenge the performance of, or even outperform, traditional Fat-Trees.

Talks

Mar. 2025

I. R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert Automatic Minimal and Relocatable Proxy App Generation Student Research Competition at CGO 2025

Mar. 2025

I. R. Ivanov, William Moses, Emil Vatai, Toshio Endo, Jens Domke, Alex Zinenko Polyhedral Rescheduling of GPU Kernels To Exploit Async Memory Movement Ninth LLVM Performance Workshop at CGO 2025

Sep. 2024

I. R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert Automatic Parallelization and OpenMP Offloading of Fortran Array Notation 20th International Workshop on OpenMP

Aug. 2024

I. R. Ivanov Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training. Monthly LLVM ML Guided Compiler Optimizations Meeting

June. 2024

I. R. Ivanov Retargeting and Respecializing GPU Workloads for Performance Portability. R-CCS Cafe

Apr. 2024

I. R. Ivanov and W. S. Moses. Automatic Retuning of Floating-Point Precision. 2024 Euro LLVM Developers’ Meeting

Apr. 2024

Ivan R. Ivanov, Aiden Grossman, Ludger Paehler, William S. Moses, Johannes Doerfert Automatic Proxy App Generation through Input Capture and Generation. 2024 Euro LLVM Developers’ Meeting

Mar. 2024

I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses. Retargeting and Respecializing GPU Workloads for Performance Portability. CGO ‘24

Mar. 2024

I. R. Ivanov, J. Domke, T. Endo, J. Doerfert. Automatic Parallelization and OpenMP Offloading of Fortran. CGO ‘24 LLVM Performance Workshop

Oct. 2023

I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses. Optimization of CUDA GPU Kernels and Translation to AMDGPU in Polygeist/MLIR. 2023 LLVM Developers’ Meeting. Student Talk

Oct. 2023

I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, J. Doerfert, and W. S. Moses. GPU Kernel Compilation in Polygeist/MLIR. 2023 LLVM Developers’ Meeting GPU Offloading Workshop. Lightning Talk

Nov. 2022

W. S. Moses, I. R. Ivanov, J. Domke, T. Endo, J. Doerfert, and O. Zinenko. High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs in Polygeist/MLIR. 2022 LLVM Developers’ Meeting. Lightning Talk

Feb. 2022

I. R. Ivanov, J. Domke, and T. Endo. Automatic translation of CUDA code into high performance CPU code using LLVM IR transformations. The 4th R-CCS Internation Symposium. Lightning Talk

Feb. 2021

I. R. Ivanov, J. Domke, A. Nomura and T. Endo. Improved failover for HPC interconnects through localised routing restoration. The 3rd R-CCS Internation Symposium. Lightning Talk

Posters

Mar. 2025

I. R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert Automatic Minimal and Relocatable Proxy App Generation Student Research Competition at CGO 2025

Apr. 2024

I. R. Ivanov, J. Domke, T. Endo, and J. Doerfert. Automatic Parallelization and OpenMP Offloading of Fortran. RIKEN Summer School 2024

Apr. 2024

I. R. Ivanov, J. Domke, T. Endo, and J. Doerfert. Automatic Parallelization and OpenMP Offloading of Fortran. JLESC 16

Mar. 2024

I. R. Ivanov, J. Domke, T. Endo, and J. Doerfert. Dynamic Thread Coarsening for OpenMP Offloading. CGO ‘24 Student Research Competition.

Mar. 2024

S. Burak, I. R. Ivanov, J. Domke, M. Mueller. Unifying SPMD and Multi-Value IR - Use Case: Static Verification of Collective Communication. CGO ‘24 Student Research Competition.

Sep. 2023

I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses. Performance Portability of C/C++ CUDA Code via High-Level Intermediate Representation. 2023 RIKEN Summer School

May. 2023

Ryan Barton, Mohamed Wahib, Jens Domke, Ivan R. Ivanov, Toshio Endo, Satoshi Matsuoka. BITFLEX - An HPC User-Driven Automatic Toolchain for Precision Manipulation and Approximate Computing. ISC High Performance 2023

Feb. 2023

Ivan R. Ivanov, William S. Moses, Jens Domke, Toshio Endo. Parallel Optimizations and Transformations of GPU Kernels Using a High-Level representation in MLIR/Polygeist. CGO ‘23 Student Research Competition

Nov. 2022

Awards

CGO 2025 ACM Student Research Competition 3rd place

Theses

Mar 2023 - Master’s Thesis

Optimizations and Transformations of Parallel Code via High Level Intermediate Representation

Mar 2021 - Bachelor’s Thesis

Improved failover for HPC interconnects through localised routing restoration

Conference Service

EuroPar 2025 Program Comittee Member.

SC 24 Reproducibility Committee Member.

LLVM-GPU: First International Workshop on LLVM for GPUs at EuroPar 24 Program Committee Member

CGO25 Artifact Evaluation Comittee Member.

CGO24 LLVM Performance Workshop Moderation.

2024 Euro LLVM Session Moderation.

Review for LLMxHPC 2024 at Cluster ‘24

Review for IPDPS25.

Review for HPC Asia 2025.