Evaluating the Suitability of Intel oneAPI for Fine-Grained Parallelism

Owen McGrath

Owen McGrath

Chicago, Illinois

0 0
  • 0 Collaborators

We explored the Intel oneAPI programming framework, specifically focusing on DPC++ and oneTBB, and evaluated how suitable it is for very fine-grained parallelism. We ran benchmarks of oneTBB and compared the results to other parallel programming solutions such as OpenMP and XQueue. ...learn more

Project status: Under Development

oneAPI, HPC

Groups
Student Developers for oneAPI

Intel Technologies
oneAPI, Intel vTune

Docs/PDFs [1]Links [4]

Overview / Usage

The goal of the project was to become familiar with Intel oneAPI, specifically DPC++ and oneTBB, and then evaulate their suitability for very fine-grained parallelism. An example of a very-fined grained workload is recursive Fibonacci, which creates a task for each step of the computation that may consist of only a few instructions. Some parallel programming libraries struggle performance-wise when dealing with tasks of this size, and we wanted to determine if oneTBB fit into this category or not.

We concluded that DPC++ and oneTBB are both well-suited for very-fined grained workloads, and specifically that oneTBB outperformed similar parallel programming libraries in many of our tests. We have written a technical report detailing the results of our testings and our findings.

Methodology / Approach

For our testing, we created a series of benchmarking programs. We used our benchmarks to measure the maximum theoretical throughput in tasks per second, using noop tasks; measured the load balancing capabilities of the scheduler by keeping track of how many tasks each logical thread was processing during execution; and compared the performance of oneTBB to other parallel programming libraries. Specifically, we looked at the GNU and LLVM implementations of OpenMP and XQueue, which is an OpenMP implementation that uses a completely lock-free scheduler. We converted several OpenMP benchmarking suites to use oneTBB.

We tried investigating oneTBB's source code to learn how it is able to achieve its level of performance. Although we had some struggles doing so, after the report was written, we were able to meet with Intel employees to get more details about oneTBB's implementation which explained its performance, and learned why our changes did not have the expected effect.

Technologies Used

As mentioned previously, we used DPC++ and oneTBB. We also consulted Intel vTune during our investigation into oneTBB's source code. We ran all of our benchmarks on a 192-core, 384-thread, Xeon-powered supercomputing node at our university, which allowed us to run our tests at very high levels of parallelism.

Documents and Presentations

Comments (0)