A Hash-Based Fast Mapping Algorithm Optimized for Low Coverage Bisulfite Treated  DNA Sequence

Yu, Yonghan

Please use this identifier to cite or link to this item: http://dspace.cityu.edu.hk/handle/2031/8953

Title:	A Hash-Based Fast Mapping Algorithm Optimized for Low Coverage Bisulfite Treated DNA Sequence
Authors:	Yu, Yonghan
Department:	Department of Electronic Engineering
Issue Date:	2018
Supervisor:	Supervisor: Dr. Chan, Rosa H M; Assessor: Prof. Chiang, K S
Abstract:	Background - The advent of Next-Generation Sequencing technology enables various studies of genomics, which involves mapping vast number of sequenced reads on to the already available reference genome. As the cost of high throughput whole genome sequencing still remains a bottleneck for many researches, some studies, with less requirement for high resolution genome feature, focus on low-coverage sequenced reads. For instance, the clinical non-invasive prenatal detection of trisomy 21 utilize low-coverage whole genome sequencing data to detect copy number variants. However, currently widely adopted mapping software, either for bisulfite sequence or not, is not able to process large amount of low-coverage sequence data efficiently, especial concerning mismatch and indel. Objective - This project dedicated to developing a novel hash-based mapping algorithm optimized for low-coverage whole genome sequenced bisulfite reads that can handle vast amount of data efficiently. In order to reduce the cost of mapping low coverage reads, the designed algorithm should occupy less computational resources while reducing the computational time. As studies involving low-coverage sequenced data focus less on high resolution feature of genome, the algorithm should tolerate loss of accuracy to a little extant. Methodology - Our algorithm adopts sliding-window k-mer encoding to encode reads into bit vector signatures such that read with several mismatches or indel will have similar signature with the read that it is supposed to align with. Then, locality sensitive hashing with hamming distance is utilized to hash similar signature into the same hash code. In order to solve the asymmetric mapping problem with bisulfite reads, all the 'C' in both reference and reads are treated as 'T' during the hashing process. All the hashed file will be kept on disk and sorted according to the hash code. The reads file is also sorted by its hash code so that the large hash file will only be accessed one times for high efficiency. Result - The project develops a highly efficient mapping algorithm for low-coverage bisulfite reads with feasible CPU and memory usage. The sparseness of our hashing scheme ensures the performance of the mapping while keeping low computational resources utilization. The algorithm can also be applied to non-bisulfite reads as the sparseness of the hash table can be maintained after reducing the k-mer space.
Appears in Collections:	Electrical Engineering - Undergraduate Final Year Projects

Files in This Item:

File	Size	Format
fulltext.html	147 B	HTML	View/Open

Show full item record