City University of Hong Kong
DSpace
 

CityU Institutional Repository >
3_CityU Electronic Theses and Dissertations >
ETD - Dept. of Computer Science  >
CS - Master of Philosophy  >

Please use this identifier to cite or link to this item: http://hdl.handle.net/2031/4413

Title: The identification of stop words and keywords : a study of automatic term weighting in natural language text processing
Other Titles: Ting dun ci yu guan jian ci de jian bie : guan yu zi ran yu yan wen ben chu li zhong zi dong shu yu jia quan de yan jiu
停頓詞與關鍵詞的鑑別 : 關於自然語言文本處理中自動術語加權的研究
Authors: Zou, Feng (鄒鳳)
Department: Dept. of Computer Science
Degree: Master of Philosophy
Issue Date: 2006
Publisher: City University of Hong Kong
Subjects: Natural language processing (Computer science)
Text processing (Computer science)
Notes: CityU Call Number: QA76.9.N38 Z68 2006
Includes bibliographical references (leaves 80-88)
Thesis (M.Phil.)--City University of Hong Kong, 2006
x, 88 leaves : ill. ; 30 cm.
Type: Thesis
Abstract: This thesis addresses two important problems related to term weighting in Natural Language Text Processing. The first problem is the identification of stop words in Chinese text processing, which focuses on automatically constructing a complete Chinese stop word list to save the time and release the burden of manual stop word selection. The second problem is the identification of keywords, which might be considered as an opposite question to stop words identification. These two problems are important in many fields related to text processing, for instance, information retrieval, text categorization and summarization, since they could greatly affect the experiment performances. Compared with English, even though being one of the languages, which are used by a large number of people all around the world, no Chinese stop words identification methods exist until now. The lack of spaces or other word delimiters and little diversity in the length of words in Chinese increase the difficulty of extracting stop words. In this thesis, we first investigate the Chinese segmentation problem, which is an inevasible process before stop words identification. We propose a unified segmentation algorithm for Chinese with web mining. Experiments prove that this algorithm outperforms traditional segmentation algorithms. With this better understanding of Chinese segmentation, we develop an efficient method for automatically extracting Chinese stop word lists afterwards. In our experiments, we construct a complete Chinese stop word list with a large corpus. In the meanwhile, we present several novel methodologies to evaluate the effectiveness of our Chinese stop word list, including applications of stop words in the field of automatic abstract extraction and word segmentation. Based on the study of Chinese stop words weighting, we look deep into several important keyword weighting schemes used nowadays in Natural Language Text Processing. Taking into consideration of an information factor “entropy” which describes the special characteristics of the keywords distribution, we propose a new term weighting scheme based on TF*IDF. Comprehensive comparisons of traditional schemes and ours are presented, which show that this scheme outperforms TF*IDF scheme in some circumstances.
Online Catalog Link: http://lib.cityu.edu.hk/record=b2147171
Appears in Collections:CS - Master of Philosophy

Files in This Item:

File Description SizeFormat
fulltext.html159 BHTMLView/Open
abstract.html159 BHTMLView/Open

Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0!
DSpace Software © 2013 CityU Library - Send feedback to Library Systems
Privacy Policy · Copyright · Disclaimer