Publications
2024
-
Tadashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness
Emil Vatai, Aleksandr Drozd, Ivan R. Ivanov, Yinghao Ren, and Mohamed Wahib
2024
-
Automatic Parallelization and OpenMP Offloading of Fortran Array Notation
Ivan R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert
In Advancing OpenMP for Future Accelerators , 2024
The Fortran programming language is prevalent in the scientific computing community with a wealth of existing software written in it. It is still being developed with the latest standard released in 2023. However, due to its long history, many old code bases are in need of modernization for new HPC systems. One advantage Fortran has over C and C++, which are other languages broadly used in scientific computing, is the easy syntax for manipulating entire arrays or subarrays. However, this feature is underused as there was no way of offloading them to accelerators and support for parallelization has been unsatisfactory. The new OpenMP 6.0 standard introduces the workdistribute directive which enables parallelization and/or offloading automatically by just annotating the region the programmer wishes to speed up. We implement workdistribute in the LLVM project’s Fortran compiler, called Flang. Flang uses MLIR – Multi-Level Intermediate Representation – which allows for a structured representation that captures the high level semantics of array manipulation and OpenMP. This allows us to build an implementation that performs on par with more verbose manually parallelized OpenMP code. By offloading linear algebra operations to vendor libraries, we also enable software developers to easily unlock the full potential of their hardware without needing to write verbose, vendor-specific source code.
-
SPMD IR: Unifying SPMD and Multi-value IR Showcased for Static Verification of Collectives
Semih Burak, Ivan R. Ivanov, Jens Domke, and Matthias Müller
In Recent Advances in the Message Passing Interface , 2024
To effectively utilize modern HPC clusters, inter-node communication and related single program, multiple data (SPMD) parallel programming models such as mpi are inevitable. Current tools and compilers that employ analyses of SPMD models often have the limitation of only supporting one model or implementing the necessary abstraction internally. This makes the analysis and effort for the abstraction neither reusable nor the tool extensible to other models without extensive changes to the tool itself.
-
-
Retargeting and Respecializing GPU Workloads for Performance Portability
I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses
In 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , Mar 2024
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
2023
-
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, and Oleksandr Zinenko
In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming , Montreal, QC, Canada, Mar 2023
While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7\texttimes.
2019
-
HyperX Topology: First at-Scale Implementation and Comparison to the Fat-Tree
Jens Domke, Satoshi Matsuoka, Ivan R. Ivanov, Yuki Tsushima, Tomoya Yuki, Akihiro Nomura, Shin’ichi Miura, Nie McDonald, Dennis L. Floyd, and Nicolas Dubé
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Denver, Colorado, Mar 2019
The de-facto standard topology for modern HPC systems and data-centers are Folded Clos networks, commonly known as Fat-Trees. The number of network endpoints in these systems is steadily increasing. The switch radix increase is not keeping up, forcing an increased path length in these multi-level trees that will limit gains for latency-sensitive applications. Additionally, today’s Fat-Trees force the extensive use of active optical cables which carries a prohibitive cost-structure at scale. To tackle these issues, researchers proposed various low-diameter topologies, such as Dragonfly. Another novel, but only theoretically studied, option is the HyperX. We built the world’s first 3 Pflop/s supercomputer with two separate networks, a 3–level Fat-Tree and a 12\texttimes8 HyperX. This dual-plane system allows us to perform a side-by-side comparison using a broad set of benchmarks. We show that the HyperX, together with our novel communication pattern-aware routing, can challenge the performance of, or even outperform, traditional Fat-Trees.
Talks
Sep. 2024
I. R. Ivanov, Jens Domke, Toshio Endo, and Johannes Doerfert Automatic Parallelization and OpenMP Offloading of Fortran Array Notation 20th International Workshop on OpenMP
Aug. 2024
I. R. Ivanov Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training. Monthly LLVM ML Guided Compiler Optimizations Meeting
June. 2024
I. R. Ivanov Retargeting and Respecializing GPU Workloads for Performance Portability. R-CCS Cafe
Apr. 2024
I. R. Ivanov and W. S. Moses. Automatic Retuning of Floating-Point Precision. 2024 Euro LLVM Developers’ Meeting
Apr. 2024
Ivan R. Ivanov, Aiden Grossman, Ludger Paehler, William S. Moses, Johannes Doerfert Automatic Proxy App Generation through Input Capture and Generation. 2024 Euro LLVM Developers’ Meeting
Mar. 2024
I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses. Retargeting and Respecializing GPU Workloads for Performance Portability. CGO ‘24
Mar. 2024
I. R. Ivanov, J. Domke, T. Endo, J. Doerfert. Automatic Parallelization and OpenMP Offloading of Fortran. CGO ‘24 LLVM Performance Workshop
Oct. 2023
I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses. Optimization of CUDA GPU Kernels and Translation to AMDGPU in Polygeist/MLIR. 2023 LLVM Developers’ Meeting. Student Talk
Oct. 2023
I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, J. Doerfert, and W. S. Moses. GPU Kernel Compilation in Polygeist/MLIR. 2023 LLVM Developers’ Meeting GPU Offloading Workshop. Lightning Talk
Nov. 2022
W. S. Moses, I. R. Ivanov, J. Domke, T. Endo, J. Doerfert, and O. Zinenko. High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs in Polygeist/MLIR. 2022 LLVM Developers’ Meeting. Lightning Talk
Feb. 2022
I. R. Ivanov, J. Domke, and T. Endo. Automatic translation of CUDA code into high performance CPU code using LLVM IR transformations. The 4th R-CCS Internation Symposium. Lightning Talk
Feb. 2021
I. R. Ivanov, J. Domke, A. Nomura and T. Endo. Improved failover for HPC interconnects through localised routing restoration. The 3rd R-CCS Internation Symposium. Lightning Talk
Posters
Apr. 2024
I. R. Ivanov, J. Domke, T. Endo, and J. Doerfert. Automatic Parallelization and OpenMP Offloading of Fortran. JLESC 16
Mar. 2024
I. R. Ivanov, J. Domke, T. Endo, and J. Doerfert. Dynamic Thread Coarsening for OpenMP Offloading. CGO ‘24 Student Research Competition.
Mar. 2024
S. Burak, I. R. Ivanov, J. Domke, M. Mueller. Unifying SPMD and Multi-Value IR - Use Case: Static Verification of Collective Communication. CGO ‘24 Student Research Competition.
Sep. 2023
I. R. Ivanov, O. Zinenko, J. Domke, T. Endo, and W. S. Moses. Performance Portability of C/C++ CUDA Code via High-Level Intermediate Representation. 2023 RIKEN Summer School
May. 2023
Ryan Barton, Mohamed Wahib, Jens Domke, Ivan R. Ivanov, Toshio Endo, Satoshi Matsuoka. BITFLEX - An HPC User-Driven Automatic Toolchain for Precision Manipulation and Approximate Computing. ISC High Performance 2023
Feb. 2023
Ivan R. Ivanov, William S. Moses, Jens Domke, Toshio Endo. Parallel Optimizations and Transformations of GPU Kernels Using a High-Level representation in MLIR/Polygeist. CGO ‘23 Student Research Competition
Nov. 2022
W. S. Moses, I. R. Ivanov, J. Domke, T. Endo, J. Doerfert, and O. Zinenko. High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs in Polygeist/MLIR. 2022 LLVM Developers’ Meeting.
Theses
Mar 2023 - Master’s Thesis
Optimizations and Transformations of Parallel Code via High Level Intermediate Representation
Mar 2021 - Bachelor’s Thesis
Improved failover for HPC interconnects through localised routing restoration
Conference Service
CGO24 LLVM Performance Workshop Moderation
2024 Euro LLVM Session Moderation
SC 24 Reproducibility Committee Member
Review for LLMxHPC 2024 @ Cluster ‘24