oneOLIGO
Eugenio Marinelli
Unknown
In project OneOligo, we are using OneAPI for implementing scalable, heterogeneous-parallel-processing algorithms that can be used quickly and accurate decode digital data stored in synthetic DNA generated by project OligoArchive. ...learn more
Project status: Published/In Market
Overview / Usage
The demand for data-driven decision making coupled with need to retain data to meet regulatory compliance requirements has resulted in a rapid increase in the amount of archival data stored by enterprises. As data generation rate far outpaces the rate of improvement in storage density of media like HDD and tape, researchers have started investigating new architectures and media types that can store such “cold”, infrequently accessed data at very low cost.
Synthetic DNA is one such storage media that has received some attention recently due to its high density and durability. DNA possesses three key properties that make it relevant for archival storage. First, it is an extremely dense threedimensional storage medium that has the theoretical ability to store 455 Exabytes in 1 gram; in contrast, a 3.5” hard disk drive can store 10 Terabytes and weighs 600 grams today. Second, DNA can last several centuries even in harsh storage environments; hard disk drives and tape have life times of five and thirty years. Third, it is very easy, quick, and cheap to perform in-vitro replication of DNA; tape and hard disk drive have bandwidth limitations that result in hours or days for copying large Exabyte-sized archives.
OligoArchive (www.oligoarchive.eu) is a EU-funded, prestigious Future and Emerging Technologies (FET) initiative that brings together a consortium of six partners across three countries (UK, France, Ireland) to research various aspects of using DNA as a digital data storage medium. At EURECOM, we are focusing on efficient encoding and decoding algorithms for storing and retrieving structured databases in DNA.
Methodology / Approach
The general workflow of using DNA as a digital storage medium involves using an encoding method built on coding theory to convert databases from their digital binary form into a quaternary code (A,C,G,T) that represents a short DNA strand, also called an oligo. The oligos are then synthesized chemically manufacturing actual DNA. When data needs to be retrieved from DNA, we use modern Next-Generation Sequencing technology to read the nucleotide sequence. NGS sequencers are capable of reading DNA with very high coverage, where each DNA strand we generate during synthesis will be covered by multiple reads. However, the sequencing reads do not always correspond one-to-one with the original oligos due to the errors introduced by the sequencing. Thus, the first stage of decoding data stored in DNA is to cluster these reads and identify the original source oligos.
In project OneOligo, we are developing hardware-accelerated read clustering methods that will be used to rapidly read back from synthetic DNA. The reason we need to use hardware acceleration is the fact that these clustering algorithms are very computationally intensive as they use edit-distance as the metric for grouping reads into clusters. A typical sequencing run at 30x coverage can easily produce billions of strings that need to be clustered. Today, state-of-the-art techniques focus on scaling out using a distributed cluster of machines to perform this task. OneOligo focuses on scaling up instead using a server equipped with heterogeneous processing units.
Technologies Used
We are developing a new clustering algorithm that uses low-distortion embeddings for the task of string clustering. We have already been using Intel TBB for parallelizing our code across multiple CPUs. We also use VTune heavily for performance profiling both at the application level for detecting hotspots, and the microarchitecture level for improving CPU utilization. Recently, we have started rewriting our code using DPC++ and OneAPI for extending our work to span CPUs, GPUs, and possibly FPGAs. We are also investigating novel ways of using persistent memory for accelerating string similarity matching.
Repository
https://github.com/Eug9/oneoligo.git
Other links
Collaborators
There are no people to show.