Efficiency and productivity for decision-making on mobile SoCs
Denisa Constantinescu
Andalusia
Markov decision processes provide a formal framework for a computer to make decisions autonomously and intelligently when the effects of its actions are not deterministic. To solve this computationally complex problem, we experiment with scheduling on a low-power heterogeneous CPU+GPU SoC platform. ...learn more
Project status: Published/In Market
oneAPI, Robotics, HPC, Artificial Intelligence
Intel Technologies
oneAPI,
DPC++,
Intel Integrated Graphics,
Intel vTune,
Intel CPU
Overview / Usage
Why do we care about solving large MDPs and POMDPs efficiently? Because they are very convenient frameworks to model intelligent agents and have many practical applications:
- Autonomous robots for cleaning, deep-space navigation, inspection and repair, toxic-waste cleanup,..;
- Structural inspection and maintenance of paved roads, bridges, aircraft parts, and buildings; search and rescue;
- Medical diagnostics and the list goes on.
Unfortunately, computing optimal solutions for them is intractable for medium to large-sized problems, which covers the majority of the interesting ones. Therefore, MDP- and POMDP- based solutions are not even considered for mobile and low-power platforms. Our aim is to make it feasible and easy to implement intelligent agents and autonomous decision-making applications on mobile platforms.
In this paper, we have tested and analyzed the feasibility of solving large-scale Markov decision processes exactly with value iteration in low-power heterogeneous computing platforms. We seek to widen the applicability of decision-making methods that embed VI as an inner component to real-world problems, so we study the impact on the performance and energy efficiency of implementing VI with different heterogeneous programming approaches and scheduling strategies. From the three programming models evaluated, OpenCL—OCL—, oneAPI with SYCL-style buffers—BUFF—and oneAPI with USM feature—USM—, we learned that oneAPI implementations can be up to 5x easier to program than OCL and only incur in 3–8% of overhead if the scheduling strategy is selected carefully. From the three scheduling strategies evaluated, static—HO, dynamic—HD, and adaptive—HL, the static scheduling performs best in terms of performance and energy efficiency, though it requires exhaustive off-line searching. Adaptive scheduling provides good results with no previous training, in particular when using the USM approach to code the kernels and scheduler, and for large problem sizes. As future work, we are striving to extend our research line by looking into more complex decision-making procedures, i.e., partially observable Markov decision processes (POMDP).
Acknowledgements: This work is a result of the research project TIN2016-80920-R, funded by the Spanish Government. It has also been supported by Junta de Andalucía under research projects UMA18-FEDERJA-108, UMA18-FEDERJA-113, and TEP-2279.
Methodology / Approach
Value Iteration algorithm is composed of three kernels with sequential dependencies; therefore, it is necessary to execute them one after the other. We have identified the Evaluate policy kernel as the most computationally intensive, followed by the Improve policy kernel.
Evaluate policy kernel uses up to 80% of the computing resources, so in this work, we focus our attention on optimizing the execution of this kernel. Due to the main data structure that is traversed by this kernel, a 3D sparse matrix stored in a CSR format, two challenges arise: (1) the memory access pattern is not coalescent and (2) the computational load of each iteration of the parallel iteration space is different. So the CPU and GPU threads unavoidably have unbalanced workloads to process regardless of the workload distribution strategy.
In total, we evaluate twelve implementations of VI, incrementally optimized for runtime, energy efficiency, and ease of use. The first version is sequential (SEQ) and serves as a reference for correctness. The second basic implementation is programmed with **OpenCL **(OCL) and it is used to run VI only on the GPU. The third variant is optimized for multicore CPU execution; we call it TBB for using the Threading Building Blocks framework. It runs the three kernels of VI (Evaluate policy, Improve policy, and Check convergence & update) using parallel_for and parallel_reduce TBB function templates.
The nine remaining implementations are all heterogeneous, simultaneously exploiting the CPU and the GPU. They are the result of combining three different coding styles for the GPU (OCL, BUFF, and USM) and three different heterogeneous schedulers (HO, HD, and HL). Therefore, these implementations are: HO-OCL, HO-BUFF, HO-USM; HD-OCL, HD-BUFF, HD-USM; and HL-OCL, HL-BUFF, and HL-USM.
Regarding the three different coding styles, we first have OpenCL, which is a low-level programming model in which HW aspects of the GPU are exposed to the user. This implies that the developer has to learn and manage data types as OpenCL platform, device, context, queue, kernel, etc. and deal with low-level functions to compile the GPU kernel, move data to and from the GPU, pass the kernel arguments, enqueue the GPU kernel, etc. These low-level details are hidden if we use oneAPI, that offers two coding models:
- BUFF: explicitly declare buffers (that encapsulate the data across the CPU and the GPU) and accessors (that give access to the data and define data dependencies);
- USM (Unified Shared Memory): a pointer-based alternative to BUFF, where the data is allocated using malloc_shared() and accessed as a regular array from both the CPU and the GPU. This particular data allocation is an Intel extension only available in the Intel's DPC++ compiler (not part of the SYCL standard) and requires HW support for Shared Virtual Memory (a.k.a. SVM or unified virtual address space). The clear benefit is that data movement between the CPU and the GPU is avoided, although cache coherency can be a source of overhead.
Our three scheduler implementations (HO, HD, and HL) have been devised to improve the overall heterogeneous performance. They extend the TBB implementation using increasingly complex scheduling strategies for CPU+GPU heterogeneous computing.
So what does this application do more precisely? Given the Reinforcement learning terminology, it does the "exploration" part of on-line decision making, by exploring the state space and computing an optimal solution for a navigation problem (modeled as an MDP). We don't do the "exploitation" part - using the solution to actually navigate the robot. In broad lines, the application receives as input the robot navigation history (experience) from a navigation log file and a set of discretization parameters. This input is used to obtain a discrete MDP model for navigation. Additionally, we provide a number of options on how the MDP solver (Value Iteration) will be executed, including different CPU-GPU heterogeneous scheduling strategies. For the parallel implementations, the number of CPU cores ( and for some of them, if you want to use the GPU) can be configured. By default, all computing resources will be used. The output of Value Iteration is a plan for navigation (for each state of the MDP navigation problem, there is an action assigned).
The source code is publicly available on bitbucket --> https://bitbucket.org/corbera/vi-mdp/src/oneAPI. In the README you can find more details on the parallel implementations of VI and running the code.
We are currently extending the work on a slightly different line of development, which involves solving more complex decision-making problems. We don't foresee any further updates to this project.
Technologies Used
HW:
- i5-8250U @1.60GHz Intel(R), quad core
- UHD 620 @300MHz, 24 EUs
- 8GB DDR4
SW:
- Ubuntu 18.04 LTS OS
- gcc 6.2.0 C/CPP compiler (C++14)
- OpenCL 1.2
- oneAPI 2021.1-beta03
- Intel(R) NEO (Gen9) Graphics Driver
- Performance Counter Monitor (PCM) library (to monitor performance and energy metrics)
- Intel vTune
Repository
https://bitbucket.org/corbera/vi-mdp/src/oneAPI/
Other links
- Published Article, Springer: Efficiency and productivity for decision making on low-power heterogeneous CPU+GPU SoCs
- Published Article, Elsevier: Performance evaluation of decision making under uncertainty for low power heterogeneous platforms
Collaborators
There are no people to show.