City University of Hong Kong

CityU Institutional Repository >
3_CityU Electronic Theses and Dissertations >
ETD - Dept. of Computer Science  >
CS - Doctor of Philosophy  >

Please use this identifier to cite or link to this item:

Title: Effective document clustering approach based on phrase feature extraction
Other Titles: Ji yu duan yu te zheng ti qu de wen jian fen qun fang fa
Authors: Chim, Hung (詹洪)
Department: Department of Computer Science
Degree: Doctor of Philosophy
Issue Date: 2008
Publisher: City University of Hong Kong
Subjects: Cluster analysis.
Cluster analysis -- Data processing.
Phrase structure grammar.
Notes: CityU Call Number: QA278 .C487 2008
xiii, 158 leaves : ill. 30 cm.
Thesis (Ph.D.)--City University of Hong Kong, 2008.
Includes bibliographical references (leaves 146-158)
Type: thesis
Abstract: This thesis proposes to use phrases in document clustering, to develop algorithms for this purpose, to evaluate the performance with extensive data analysis, to discuss applications that would bene¯t from the new technology. First, the su±x tree model is studied as an e±cient tool to identify and extract the repeated phrases in documents. Then a new phrase-based document similarity is derived from the tf-idf weights of phrases and used in the group-average hier- archical clustering algorithm. The e®ectiveness and e±ciency of the new phrase- based document clustering approach are validated by extensive experiments. The quality of the clustering results signi¯cantly surpasses the conventional clustering algorithms that are based on the idea of \bag of words". Inspired by the success of the new phrase-based document similarity, we con- ducted an empirical study to investigate the property of the su±x tree document model and to explore the roles of di®erent phrases to re°ect the intra-similarity of a document class or cluster. The experimental result of the study indicates that there are some special phrases that frequently appear in the documents of one class but rarely appear in the documents of other classes. Signi¯cantly, some phrases are very relevant to the prede¯ned topic of the document class. This pro- vides a strong evidence to explain why the new phrase-based document similarity is e®ective in clustering documents. Consequently, we propose a novel cluster in- terpretation approach, in which the signi¯cant phrases in a document cluster are selected to generate a topic hierarchy to present the corresponding cluster. To sum up, our work described in the thesis presents a complete document clustering solution, which starts from using su±x tree to build a data model for document representation, to developing an e®ective phrase-based document clus- tering approach, to ¯nally that of automatically generating a topic phrase hierarchy to expose the main topic of the cluster and present the documents. We use the new clustering technique in a popular information system on the Web - BBS fo- rum system, and develop an information recommender system to help people ¯nd valuable information more e±ciently. In addition, a group decision approach has been proposed to enable and encourage the forum members to take part in forum information assessment and knowledge collaboration.
Online Catalog Link:
Appears in Collections:CS - Doctor of Philosophy

Files in This Item:

File Description SizeFormat
abstract.html132 BHTMLView/Open
fulltext.html132 BHTMLView/Open

Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.


Valid XHTML 1.0!
DSpace Software © 2013 CityU Library - Send feedback to Library Systems
Privacy Policy · Copyright · Disclaimer