CityU Institutional Repository >
CityU Electronic Theses and Dissertations >
ETD - Dept. of Computer Science >
CS - Doctor of Philosophy >
Please use this identifier to cite or link to this item:
|Title: ||Effective document clustering approach based on phrase feature extraction|
|Other Titles: ||Ji yu duan yu te zheng ti qu de wen jian fen qun fang fa|
|Authors: ||Chim, Hung (詹洪)|
|Department: ||Department of Computer Science|
|Degree: ||Doctor of Philosophy|
|Issue Date: ||2008|
|Publisher: ||City University of Hong Kong|
|Subjects: ||Cluster analysis.|
Cluster analysis -- Data processing.
Phrase structure grammar.
|Notes: ||CityU Call Number: QA278 .C487 2008|
xiii, 158 leaves : ill. 30 cm.
Thesis (Ph.D.)--City University of Hong Kong, 2008.
Includes bibliographical references (leaves 146-158)
|Abstract: ||This thesis proposes to use phrases in document clustering, to develop algorithms
for this purpose, to evaluate the performance with extensive data analysis, to
discuss applications that would bene¯t from the new technology.
First, the su±x tree model is studied as an e±cient tool to identify and extract
the repeated phrases in documents. Then a new phrase-based document similarity
is derived from the tf-idf weights of phrases and used in the group-average hier-
archical clustering algorithm. The e®ectiveness and e±ciency of the new phrase-
based document clustering approach are validated by extensive experiments. The
quality of the clustering results signi¯cantly surpasses the conventional clustering
algorithms that are based on the idea of \bag of words".
Inspired by the success of the new phrase-based document similarity, we con-
ducted an empirical study to investigate the property of the su±x tree document
model and to explore the roles of di®erent phrases to re°ect the intra-similarity
of a document class or cluster. The experimental result of the study indicates
that there are some special phrases that frequently appear in the documents of
one class but rarely appear in the documents of other classes. Signi¯cantly, some
phrases are very relevant to the prede¯ned topic of the document class. This pro-
vides a strong evidence to explain why the new phrase-based document similarity
is e®ective in clustering documents. Consequently, we propose a novel cluster in-
terpretation approach, in which the signi¯cant phrases in a document cluster are
selected to generate a topic hierarchy to present the corresponding cluster.
To sum up, our work described in the thesis presents a complete document
clustering solution, which starts from using su±x tree to build a data model for document representation, to developing an e®ective phrase-based document clus-
tering approach, to ¯nally that of automatically generating a topic phrase hierarchy
to expose the main topic of the cluster and present the documents. We use the
new clustering technique in a popular information system on the Web - BBS fo-
rum system, and develop an information recommender system to help people ¯nd
valuable information more e±ciently. In addition, a group decision approach has
been proposed to enable and encourage the forum members to take part in forum
information assessment and knowledge collaboration.|
|Online Catalog Link: ||http://lib.cityu.edu.hk/record=b2340762|
|Appears in Collections:||CS - Doctor of Philosophy |
Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.