Please use this identifier to cite or link to this item:
|Department:||Department of Computer Science|
|Supervisor:||Supervisor: Dr. Li, Shuaicheng; First Reader: Dr. Wong, Hau San Raymond; Second Reader: Prof. Zhang, Qingfu|
|Abstract:||Undoubtedly, the advent of metagenomics in the past a few years would be one of the most notable events in bioinformatics. Metagenomics refers to the study of biological samples from environment without cultivation, thus could better capture detailed information that might be missed by traditional cultivation-based analysis methods. However, most of the current processes of metagenomic analysis are limited in accuracy and further applications. For instance, many analysis would only focus on the basic part of metagenomics, instead of deeping delve into the genetic data. Besides, for the basic part, improving efficiency and usability are still in need. In this case, with the explosion of metagenomic analysis, a standard work ow, or pipeline, of metagenomics is urgent, where both basic analysis towards microbial data and advanced analysis part would be integrated. In this project, a metagenomic analysis pipeline (MGAP) would be implemented. MGAP is featured with two layers' analysis: basic analysis and network analysis. Starting with meta-genome data, basic analysis would perform filtering data with low quality, removing genes with reference of host, assembly, coding sequence prediction, alignment with reference databases, and finally abundance statistical calculation. Moreover, the first layer of MGAP is capable of processing hundreds of samples in parallel, which could significantly promote efficiency. Besides, while processing, intermediate products will be continuously visualized and summarized, serving as supplementary files for user to supervise the whole progress. In terms of the second layer, network analysis will be conducted. Genes will be clustered following certain standards. Then, relations between clusters of genes and certain phenotypes will be measured in either Pearson Correlation or Context Likelihood of Relatedness. Finally, for those clusters with closer relations with phenotypes, in uence analysis will be applied to locate the key genes that drive the whole subnetowrk of genes. To better digging the information behind the abundance data, this layer implemented several algorithms, for example, to detect key drivers and to perform Principal Component Analysis Rotation, towards genetic data and has obtained good results. Moreover, after MGAP has been implemented, considering usability and user- friendly, mechanism has been modified to better fit for metagenomics researchers. Now MGAP comes into two versions: semi-automation and full-automation. The former one is running in Linux environment and supervision is necessary. Once list of samples is prepared, MGAP will be excuted step by step. Advantage of this version is that there are more exiblities. Users could try different parameters for different steps according to the real situation. The latter one is put online now. Featured with FlowSmart, an automatic backend pipeline scheduling and monitoring framework that was developed in a previous Final Year Project, MGAP became more efficient because of parallel processing and shorted system idle time.|
|Appears in Collections:||Computer Science - Undergraduate Final Year Projects |
Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.