Exact State Reconstruction of Linear Iterative Solvers with NVRAM

This project includes the implementation of In-NVRAM Exact State Reconstruction (ESR) for the PCG solver, as we describe in https://arxiv.org/pdf/2204.11584.pdf. In this work we also plan to research recoverability (with NVRAM) of concurrent applications using OpenMP. ...learn more

Project status: Under Development

HPC

Intel Technologies
Optane

Code Samples [1]

Overview / Usage

Iterative linear solvers are main kernels in scientific applications. Exact State Reconstruction (ESR) techniques have been proposed in the last decaded, and rely on RAM to create redundancies for certain state variables of the solver. Our research investigaes how can NVRAM be utilized to decrease memory overheads and time pverheads of what we call, In-RAM ESR. This project is a reuslt of our research that investigates how to dramatically enhance in-RAM (ESR) performances, given all of the technological changes over the course of the last decade since firstly introduced, and eliminate its main problems, that is, extended memory footprint and constant surge of network traffic. Our work rests on three pillars: (1) recently enabled capabilities of direct access (DAX) to NVRAM, (2) the access to such memory with MPI One-Sided Communication (OSC) over RDMA, and (3) the observation that these two capabilities allow to keep on all of the qualities of original in-RAM ESR while persisting just one copy of recovery data every persistence cycle instead of many redundancies. This yields the enhanced in-NVRAM ESR, which instead of relying on and populating the RAM with many redundancies for fault tolerance, sends just one copy DAX-wise through RDMA directly to the persistent NVRAM. Accessing byte-addressable NVRAM directly, without the latency of moving data to and from the I/O bus, with comparable performances to RAM, and with a small overhead, creates a much advanced ESR mechanism, without compromising data and recovery consistency.

We implement in-NVRAM ESR with our new library of MPI One-Sided Communication (OSC) over RDMA under the setting of NVRAM, and study two possible NVRAM placements architectures:

  • Homogeneous NVRAM cluster, in which each compute node is equipped with its own NVM module, enabling the persistence of ESR state variables to local NVRAM by using either the persistent memory development kit (PMDK) libraries or a local MPI window.
  • NVRAM persistent recovery data (PRD) sub-cluster, in which recovery data is persisted in dedicated PRD sub-cluster nodes via remote MPI one-sided communication implemented using RDMA.

In the PRD sub-cluster architecture, we assume RAID between nodes to provide fault tolerance to errors in the sub-cluster. Otherwise, each node of the sub-cluster behaves as a single point of failure. We stress that while in-RAM ESR's data transportation increases quadratically with the cluster size, the increase writes of RAID is linear and depends on RAID level.

In this work we also plan to research recoverability (with NVRAM) of concurrent applications using OpenMP.

Methodology / Approach

This project is a reuslt of our research that investigates how to dramatically enhance in-RAM Exact State Reconstruction (ESR) performances, given all of the technological changes over the course of the last decade since firstly introduced, and eliminate its main problems, that is, extended memory footprint and constant surge of network traffic. Our work rests on three pillars: (1) recently enabled capabilities of direct access (DAX) to NVRAM, (2) the access to such memory with MPI One-Sided Communication (OSC) over RDMA, and (3) the observation that these two capabilities allow to keep on all of the qualities of original in-RAM ESR while persisting just one copy of recovery data every persistence cycle instead of many redundancies.

Technologies Used

This project is a reuslt of our research that investigates how to dramatically enhance in-RAM Exact State Reconstruction (ESR) performances, given all of the technological changes over the course of the last decade since firstly introduced, and eliminate its main problems, that is, extended memory footprint and constant surge of network traffic. Our work rests on three pillars: (1) recently enabled capabilities of direct access (DAX) to NVRAM, and specifically with the PMDK library, (2) the access to such memory with MPI One-Sided Communication (OSC) over RDMA, and (3) the observation that these two capabilities allow to keep on all of the qualities of original in-RAM ESR while persisting just one copy of recovery data every persistence cycle instead of many redundancies.

Repository

https://github.com/Scientific-Computing-Lab-NRCN/In-NVRAM-ESR.git

Collaborators

There are no people to show.

Comments (0)