
CityU Institutional Repository >
3_CityU Electronic Theses and Dissertations >
ETD  Dept. of Computer Science >
CS  Doctor of Philosophy >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2031/5249

Title:  Pseudoperiodic repeats, biclustering algorithms and quasibicliques 
Other Titles:  Ni chong fu xu lie, shuang ju lei suan fa he zhun wan quan er fen tu 擬重複序列, 雙聚類演算法和準完全二分圖 
Authors:  Liu, Xiaowen (劉曉文) 
Department:  Department of Computer Science 
Degree:  Doctor of Philosophy 
Issue Date:  2007 
Publisher:  City University of Hong Kong 
Subjects:  Computational biology. Bioinformatics. 
Notes:  xi, 106 leaves : ill. 30 cm. Thesis (Ph.D.)City University of Hong Kong, 2007. Includes bibliographical references (leaves 98105) CityU Call Number: QH324.2 .L583 2007 
Type:  thesis 
Abstract:  In this thesis, we study several important problems in computation biology and
bioinformatics. These problems are the pseudoperiodic repeat discovery problem,
the biclustering problem and the quasibiclique problem.
The genomes of many species are dominated by short sequences repeated consecutively.
It is estimated that over 10% of the human genome consists of tandemly
repeated sequences. Finding repeated regions in long sequences is important in
sequence analysis. In Chapter 2, we develop a software, LocRepeat, that finds
regions of pseudoperiodic repeats in a long sequence. We use the definition of the
pseudoperiodic partition of a region and design an algorithm that can select the
repeated region from a given long sequence and give the pseudoperiodic partition
of the region.
One of the main goals in the analysis of microarray data is to identify groups
of genes and groups of experimental conditions (including environments, individuals
and tissues) that exhibit similar expression patterns. This is the socalled
biclustering problem. In Chapter 3, we consider two variants of the biclustering
problem: the consensus submatrix problem and the bottleneck submatrix problem.
The input of the problems contains an m× n matrix A and integers l and k. The consensus submatrix problem is to find an l × k submatrix with l < m and
k < n and a consensus vector such that the sum of distances between the rows in
the submatrix and the consensus vector is minimized. The bottleneck submatrix
problem is to find an l × k submatrix with l < m and k < n, an integer d and
a center vector such that the distance between every row in the submatrix and
the vector is at most d and d is minimized. We show that both problems are NPhard
and give randomized approximation algorithms for special cases of the two
problems. Using standard techniques, we can derandomize the algorithms to get
polynomial time approximation schemes for the two problems. To our knowledge,
this is the first time that approximation algorithms with guaranteed ratios are
presented for the biclustering problem.
We have another strike on the biclustering problem in Chapter 4. We define
the maximum similarity score for a bicluster and design a polynomial time algorithm
to find an optimal bicluster with the maximum similarity score. To our
knowledge, this is the first formulation for biclustering problems that admits a
polynomial time exact algorithm. The algorithm works for a special case, where
the biclusters are approximately squares. We then extend the algorithm to handle
various kinds of other cases. Experiments on simulated data and real data show
that the new algorithms outperform most of the existing methods in many cases.
Our new algorithms have the following advantages: (1) no discretization procedure
is required, (2) performs well for overlapping biclusters, and (3) works well
for additive biclusters. Proteinprotein interactions (PPIs) are one of the most important mechanisms
in cellular processes. To model protein interaction sites, recent studies have suggested
to find interacting protein group pairs from large PPI networks at the first
step, and then to search conserved motifs within the protein groups to form interacting
motif pairs. To consider noise effect and incompleteness of biological data,
we propose to use quasibicliques for finding interacting protein group pairs. In
Chapter 5, we investigate two new problems which arise from finding interacting
protein group pairs: the maximum vertex quasibiclique problem and the maximum
balanced quasibiclique problem. We prove that both problems are NPhard.
This is a surprising result as the widely known maximum vertex biclique problem
is polynomial time solvable [75]. We then propose a heuristic algorithm which uses
the greedy method to find the quasibicliques from PPI networks. Our experimental
results on real data show that this algorithm has a better performance than a
benchmark algorithm for identifying highly matched BLOCKS and PRINTS motifs.
We also report results of two case studies on interacting motif pairs which
map well with two interacting domain pairs in iPfam. 
Online Catalog Link:  http://lib.cityu.edu.hk/record=b2268760 
Appears in Collections:  CS  Doctor of Philosophy

Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.
