Using advanced indexing strategies for genome reconstruction from metagenomic data

Sivakumar, Srinivas

Please use this identifier to cite or link to this item: http://dspace.cityu.edu.hk/handle/2031/9351

Title:	Using advanced indexing strategies for genome reconstruction from metagenomic data
Authors:	Sivakumar, Srinivas
Department:	Department of Electrical Engineering
Issue Date:	2020
Supervisor:	Supervisor: Dr. Sun, Yanni; Assessor: Prof. Chow, Tommy W S
Abstract:	DNA sequencing machines cannot sequence entire genomes, but they can sequence short fragments of DNA strands, known as 'reads.' Next Generation Sequencing (NGS) provide large amounts of reads from the sample. The objective of this project is to find the presence of a genome within the metagenomic data. For example, detecting the presence of the HIV virus in a human sample. The two approaches to finding genomes from the set of reads are read alignment and de novo assembly. Alignment is the process of aligning or matching each read with a reference genome. This is inherently a string-matching problem, where the query is the read, and the text is the genome. De novo assembly or overlap assembly, on the other hand, is the process of assembling a genome without a reference. In the process, reads are effectively joined together with a prefix-suffix match. The objective of this project is to obtain and 'remove' all bacteria genomes for viral metagenomic data. Once the bacterial reads have been removed from the sequencing data, both new and old viruses can then be recognized. The approach used in this project is to combine these two strategies as most bacteria do not have complete genomes. First, the reads are mapped to 16s rRNA strands - which are present in all types of bacteria - using Bowtie2 (Langmead B, 2012) or BWA (Li & Durbin, 2010), an efficient read-alignment tool. The output of the alignment is a subset of reads, mapped to the reference genome, referred to as the 'seed.' The seed is then extended iteratively using overlap extension with the remaining reads. Serial and multi-threaded implementations were developed using base code from SGA (Jared T. Simpson, June 2010) as it is an open-source project. The backbone of both implementations is the Burrows-Wheeler Transform (Burrows & Wheeler, 1994) and the FM index (Ferragina & Manzini, 2000). The effectiveness and correctness of the program were checked by creating simulated sequencing data. A scalable and memory-efficient implementation was developed using SGA’s codebase.
Appears in Collections:	Electrical Engineering - Undergraduate Final Year Projects

Files in This Item:

File	Size	Format
fulltext.html	147 B	HTML	View/Open

Show full item record