An implementation of parallel high-performance random forest based on oneapi

Yihua Chen

Yihua Chen

Unknown

基于OneAPI技术的并行随机森林实现,旨在提高随机森林算法的性能 Parallel random forest implementation based on OneAPI technology, aiming to improve the performance of random forest algorithm ...learn more

Project status: Under Development

Artificial Intelligence, HPC, Cloud

Intel Technologies
oneAPI, DPC++, DevCloud

Code Samples [1]

Overview / Usage

The decision tree model is one of the most widely used machine learning models because of its simple feature preprocessing, easy integration learning, good fitting ability and interpretability. Different from the linear model (mathematical description: f(W*X +b)), it learns the appropriate weights of each feature through data samples, and makes decisions after weighting. The decision tree will select appropriate features and perform feature division before making a decision (that is, the decision boundary is nonlinear, which improves the nonlinear ability of the model). In practice, it can be found that shallow trees are easy to underfit, and by deepening the depth of the tree, although the accuracy of the training set can be improved, it is easy to overfit. From this perspective, a single tree model is relatively weak, and it is easy to adjust the diversity of each tree model according to hyperparameters such as feature selection and depth. Because of these characteristics, the decision tree is very suitable for further improving the model effect by combining multiple tree models for ensemble learning. Integrated learning is a method of combining some basic learners to learn together to improve its generalization ability and robustness. There are two mainstream integration ideas: Parallel approach (bagging): independently train a number of base learners and then (weighted) average their predictions. Representative models such as: Random Forests; Serial mode (boosting): One by one (serial) training base learners, each base learner is mainly used to correct the deviation of the previous learner. Representative models such as: AdaBoost, GBDT and XGBOOST; (Extension: like the GBDT integration method based on the cart regression tree, the gradient of the loss function is introduced to replace the residual. This process is also similar to the local gradient descent algorithm) Obviously, if we need to improve performance, we should choose to implement the random forest model and try to implement it based on oneapi. In order to compare the performance difference between the project and the industry standard solution, we first refer to the source code of the machine learning open source library sklearn. In terms of decision trees, we use the same tree model CART, and at the same time realize the integration of random forests, implemented in c++ A single-machine single-threaded version has been developed, and almost the same performance has been achieved through experiments. On this basis, we implemented a multi-computing unit version based on oneapi. This version achieves a big performance improvement. In the selection of the data set, we adopted the most common mnist data set, and reserved an interface for replacing the data set, which makes the project have a strong reuse value.

Methodology / Approach

The main idea is object-oriented programming, and complete abstraction and encapsulation of several components and levels of data loading, data sets, decision tree nodes, decision trees, and random forests. Random forest mainly maintains a multi-tree data structure and holds data sets at the same time. Its training and reasoning are responsible for distributing tasks to the lower decision tree for execution, so it is accelerated through thread-level parallelization. The decision tree is to sample the data set and continuously split the nodes by calculating the Gini index and other indicators. Therefore, oneAPI is required to parallelize the calculation. The transmission of data sets involves memory management, so a lot of STL is used very carefully, and pointer-style arrays are not used, only pointers are passed as references.

Technologies Used

  1. Multi-thread parallelism: A random forest consists of multiple trees, but the training process of each tree is independent, so the training of each tree will be performed on an independent thread. In order to avoid data races, synchronization and other issues that slow down the running speed, each thread has the necessary data copy.

  2. Clear structure: The project adopts the idea of object-oriented programming, and abstracts and concretely encapsulates components such as tree nodes, decision trees, random forests, and data sets. Naming conventions and variable usage strictly follow the Google C++ style guide. At the same time, the memory management is perfect. It is convenient for secondary development.

  3. oneAPI parallelism: In the project, for calculation-intensive scenarios such as information gain calculation on decision tree nodes, the DPC++ interface of oneAPI is used. The computing tasks are parallelized, and the selector in the API intelligently selects the computing unit carrier, making full use of heterogeneous computing power and effectively improving computing efficiency.

  4. Excellent performance: The project uses multithreading to train and infer decision trees in random forests, which has fully improved the operating efficiency of random forest algorithms. With the help of DPCPP in oneAPI, the performance pain point caused by the need to calculate more parameters during decision tree training is further solved.

Repository

https://github.com/BigWhitePolarBear/random-forest

Collaborators

There are no people to show.

Comments (1)