gene sequence de-redundancy
zhen ju
Unknown
- 0 Collaborators
A novel Greedy Incremental Alignment-based algorithm called nGIA was proposed for sequence clustering with high efficiency and precision. The nGIA consists of a pre-filter, a modified short word filter, a new data packing strategy, a modified greedy incremental method, and is parallelized via GPU. ...learn more
Project status: Published/In Market
Overview / Usage
Non-redundant sequence datasets are of utmost importance in bioinformatics. Redundant sequences do not provide any information but will cost a lot when analyzing these sequences. Therefore, various de-redundancy tools have been developed, such as CD-HIT, Usearch, and Vsearch. But these tools are all based on CPU. To make the running time acceptable, approximate algorithms are used to speed up. As result, they can‘t get accurate results.
We implemented a new tool. Taking the advantage of GPU, our tool can get accurate results, and runs fastest on the hardware at the same price. Our tool supports CUDA and one API.
Methodology / Approach
The core algorithm of sequence de-redundancy can be simply summarized into two steps. First, use a low-time complexity algorithm to filter out obviously dissimilar sequences. Second, the dynamic programming algorithm is used to calculate the similarity of the two sequences.
We improved the filter algorithm based on the pigeon principle and improve dynamic programming algorithm performance by compressing data. All of the above algorithms achieve heterogeneous acceleration by CUDA and One API.
Our application was originally developed with CUDA and then migrated to oneAPI by the dpcp tool. There are some errors in the code after the automatic migration, and we have debugged manually.
Technologies Used
We have completed software development on the dev cloud platform. One API base toolkit and one API HPC toolkit are used, and the software runs on Xeon CPU and GPU. We used VTune to improve performance.