Abstract
The demand for efficient linear solvers in large-scale reservoir simulation has driven the development of graphics processing unit (GPU)-accelerated preconditioners. However, these often face limitations in memory efficiency, portability across architectures, and numerical consistency with central processing unit (CPU)-based solvers. We developed autotuned GPU implementations of the incomplete lower-upper (LU) factorization with zero fill-in (ILU(0)) and diagonal ILU(0) (DILU) preconditioners and integrated them into the open-source Open Porous Media (OPM) Flow simulator. The preconditioners were conservatively parallelized to ensure numerical consistency with serial computation (up to instruction set equivalence). Our approach incorporated level-set scheduling, post-factorization row reordering, matrix splitting, execution graphs from the Compute Unified Device Architecture (CUDA) and Heterogeneous-computing Interface for Portability (HIP) programming models, mixed-precision schemes, and autotuning to improve runtime performance. We evaluated our implementations on both NVIDIA® and AMD architectures, including consumer-grade and high-performance computing (HPC)-grade GPUs. ILU(0) and DILU exhibited similar numerical behavior and performance, although DILU used significantly less memory. Both implementations outperformed reference ILU(0) solvers based on hipSPARSE or cuSPARSE. The triangular solve achieved up to 5.25X speedup and the factorization up to 2.5X, resulting in an overall linear solver speedup of 2.92X. Profiling showed that our solvers utilized more than half of the available memory bandwidth on consumer GPUs and nearly half on HPC GPUs. Additionally, they outperformed a more sophisticated constrained pressure residual (CPR) preconditioner running on a 16-core CPU for medium and large realizations of Case C from the 11th SPE Comparative Solution Project, with speedups of up to 5.12X per linear solve. While the implementations do not fully exploit the capabilities of HPC-grade GPUs, further gains may be achieved through optimized interkernel synchronization, improved cache usage, and better memory bandwidth utilization. Future work includes extending the approach to GPU-based CPR-algebraic multigrid (AMG) preconditioners and assembling the linear system directly on the GPU to reduce data transfer overhead.