CLustClosed
Edian Franco
Unknown
- 0 Collaborators
The finalization step of the genome assembly, which includes the ordering and orientation of contigs and closing gaps, represents one of the most important. The gaps can be produced by the low coverage observed in certain regions of the genome, generating errors in the process of assembly and makes it difficult to annotation. uses machine learning (ML) methods, which could improve accuracy in closing gaps. ML techniques can help solve several biological problems, such as improving the assembly process by detecting assembly errors and finishing and closing the gaps in the genome due to its versatility ...learn more
Project status: Under Development
Overview / Usage
Due to the advent of new generation technologies (NGS) released around 2005, the cost and time of the sequencing have reduced considerably, resulting in the increase the projects like Whole Genome Sequencing (WGS) and Whole Transcriptome Sequencing (WTS). NGS platforms produce a large amount of data compared to previous technologies, although there are still features that can make data assembly difficult as moderate sequencing error rates for some platforms, short reads that hamper the assembly process, low complexity, low coverage and difficulty in solving repetitive regions, which may compromise the representation of certain regions of the genome, generating gaps, making it difficult to finalize the genomes and reflecting the high amount of draft genomes deposited in public databases.
Methodology / Approach
The CLustGClosed pipeline is based on two steps: the first consists in to define the clusters based on the reads GC-content. Thus, for each cluster file, the best k-mer is calculated, finally, each group of reads is assembled, and the contigs are used to close the gaps of the traditional assembly genome produced with all read.
Technologies Used
Python
thread library
multiprocessing library