City University of Hong Kong

CityU Institutional Repository >
3_CityU Electronic Theses and Dissertations >
ETD - Dept. of Computer Science  >
CS - Doctor of Philosophy  >

Please use this identifier to cite or link to this item:

Title: Effectiveness of phrase in information retrieval
Other Titles: Pian yu zai xin xi jian suo de cheng xiao ying xiang
Authors: Chang, Chor Ming Matthew (鄭礎明)
Department: Department of Computer Science
Degree: Doctor of Philosophy
Issue Date: 2008
Publisher: City University of Hong Kong
Subjects: Indexing.
Information retrieval.
Notes: CityU Call Number: Z695.9 .C43 2008
xii, 136 leaves : ill. 30 cm.
Thesis (Ph.D.)--City University of Hong Kong, 2008.
Includes bibliographical references (leaves 116-136)
Type: thesis
Abstract: With the advent of Internet and advancement in computer technology, there are two notable challenges in document retrieval: retrieving huge amount of data efficiently and identifying the most useful documents, out of many relevant pages, correctly. Although various phrase-finding and indexing methods have been proposed in the past, conclusions on the usefulness of phrases as indexing units have been generally inconsistent. Nevertheless, a number of recent research groups, including the leading groups who have participated in TREC campaigns, have used phrases as indexing units and have been able to obtain some improvement. As phrases have traditionally been regarded as precision-enhancing tools, recent research continues to apply the concept of phrase in different IR problems. In this thesis, following the tradition, we are interested in the concept of phrase in information retrieval, especially for document retrieval. To address the two challenges, we first propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with modest extra storage cost. Further improvement in efficiency can be attained by implementing our index, according to our observation of the dynamic nature of common word sets. In experimental evaluation, a common phrase index using 255 common words yields an improvement of about 11% and 62%, in query time for all queries and large queries (queries of long phrases) respectively, over an auxiliary nextword index. Moreover, it needs only about 19% extra storage space. Compared with an inverted index, our improvement works out to about 72% and 87% for all and large queries respectively. We also propose to implement a common phrase index with dynamic update feature. Our experiments show that more improvement in time efficiency can be achieved. For improving the quality of retrieval results, we devise a proximity-based ranking function that combines an “ordered loose phrase” scoring with the state-of-the-art Okapi probabilistic model (BM25). We say that a phrase occurs in a document in an ordered loose phrase form, when the words of the phrase appear sufficiently close to each other, and in the same order as in the query. The occurrence of an ordered loose phrase, constituted by words of a query phrase in a document, may indicate a high relevance of the document to the query. We design our experiments using the query sets in TREC-11, TREC-12 and TREC-13 and the .GOV document collection. The results show that our method compares favorably with the pure BM25 and three recent works based on term proximity and co-occurrence, in most of the performance measures. For TREC-12 and TREC-13, our results successfully demonstrate that our method can improve the quality of search results significantly.
Online Catalog Link:
Appears in Collections:CS - Doctor of Philosophy

Files in This Item:

File Description SizeFormat
abstract.html132 BHTMLView/Open
fulltext.html132 BHTMLView/Open

Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.


Valid XHTML 1.0!
DSpace Software © 2013 CityU Library - Send feedback to Library Systems
Privacy Policy · Copyright · Disclaimer