City University of Hong Kong

CityU Institutional Repository >
3_CityU Electronic Theses and Dissertations >
ETD - Dept. of Computer Science  >
CS - Doctor of Philosophy  >

Please use this identifier to cite or link to this item:

Title: Class discovery and its application
Other Titles: Mo shi fa xian ji qi ying yong
Authors: Yu, Zhiwen (余志文)
Department: Department of Computer Science
Degree: Doctor of Philosophy
Issue Date: 2008
Publisher: City University of Hong Kong
Subjects: Pattern recognition systems.
Notes: xiii, 166 leaves : ill. 30 cm.
Thesis (Ph.D.)--City University of Hong Kong, 2008.
Includes bibliographical references (leaves 145-166)
CityU Call Number: TK7882.P3 Y28 2008
Type: thesis
Abstract: Class discovery is an important task in pattern recognition whose goal is to classify the data points into a number of classes. Given a data set P, the objectives of class discovery are (i) to classify the data points into the corresponding clusters, which maximizes the similarity between the points in one cluster, and maximizes the dissimilarity between the points in the different clusters, (ii) to identify the number of clusters. In order to achieve these goals, the clustering algorithms and the cluster validity indices are necessary. As a result, both the clustering algorithms and cluster validity indices are addressed in this thesis. In practice, there exist two categories of datasets: one category of datasets is fea- tured by a large number of low dimensional data points, such as pixels in a images, while the other category of datasets is characterized by a small number of high dimen- sional data points, such as the samples in microarray. The first category of datasets pays more attention to the efficiency of the class discovery approach, while the second category of datasets focuses on the accuracy of the class discovery approach. In order to perform class discovery in these categories of datasets, we propose a fast cluster algorithm combining with the traditional cluster validity index to discovery the un- derlying structure of the first category of datasets as efficient as possible, and design a general framework which integrates the data perturbation technique, the cluster ensemble approach and the new cluster validity index to discover the classes from the second category of datasets as accurate as possible. First, we present a hashing based clustering algorithm (HBCA) which is a variant of the K-means algorithm to classify a large number of low dimensional data points in the first category of datasets efficiently. Unlike previous clustering algorithms, HBCA places more emphasis on the running time of the algorithm. Specifically, HBCA first hashes the data points into a set of histogram bins by a hashing function. Then, it determines the initial centers of the clusters (regions) according to this point distribution. Finally, HBCA performs clustering at the histogram bin level, rather than the data point level. Two approaches are proposed to improve the performance of HBCA further: (i) a shrinking process is performed on the histogram bins to reduce the number of distance computations, and (ii) a hierarchical structure is constructed to perform efficient indexing on the histogram bins. Finally, the performance of HBCA is analyzed theoretically and experimentally. The experimental results show that HBCA (i) can be easily implemented, (ii) identifies the clusters effectively and (iii) outperforms most of the current state-of-the-art clustering approaches in terms of efficiency when applied to image segmentation. HBCA is also applied to solve the problem of the detection of visual attention re- gions (VAR). We propose a rule based technique for the extraction of visual attention regions at the object level based on HBCA, such that VAR detection can be per- formed in a very efficient way. The proposed technique consists of four stages: (i) a fast segmentation technique based on HBCA; (ii) a refined specification of VAR which is known as the hierarchical visual attention regions (HVAR); (iii) a new algorithm known as the rule based detection algorithm (RADA) to obtain the set of HVARs in real time, and (iv) a new adaptive image display module and the corresponding adaptation operations using HVAR. We also define a new background measure which combines both feature contrast and the geometric property of the region to identify the background region, and a confidence factor which is used to extract the set of hierarchical visual attention regions. Compared with existing techniques, RADA has two advantages: (i) The approach detects the visual attention region at the object level, which bridges the gap between traditional visual attention regions and high-level semantics; (ii) Our approach is efficient and easy to implement. Second, we design a general framework for class discovery from the second category of datasets which is featured by a small number of a very high dimensional data points. The new framework integrates the data perturbation technique, the cluster ensemble approach and the cluster validity index. Based on the new framework, we propose two cluster validation approaches. The first approach integrates the random subspace technique, the cluster ensemble approach based on the normalized cut and the Mod- ified Rand Index, while the second approach integrates the noise injection technique, the cluster ensemble based on Neural Gas and the Disagreement/Agreement (DA) index. The experiments in the synthetic datasets and real datasets show that (i) Both the Modified Rand Index and the Disagreement/Agreement index successfully discovers the underlying structure from most of the synthetic datasets and the real datasets, and (ii) both new indices outperform most of the state-of-the-art cluster validity indices when applied to gene expression data.
Online Catalog Link:
Appears in Collections:CS - Doctor of Philosophy

Files in This Item:

File Description SizeFormat
abstract.html132 BHTMLView/Open
fulltext.html132 BHTMLView/Open

Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.


Valid XHTML 1.0!
DSpace Software © 2013 CityU Library - Send feedback to Library Systems
Privacy Policy · Copyright · Disclaimer