|
|
CityU Institutional Repository >
CityU Electronic Theses and Dissertations >
ETD - Dept. of Computer Science >
CS - Doctor of Philosophy >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2031/5505
|
| Title: | Effectiveness of phrase in information retrieval |
| Other Titles: | Pian yu zai xin xi jian suo de cheng xiao ying xiang 片語在信息檢索的成效影響 |
| Authors: | Chang, Chor Ming Matthew (鄭礎明) |
| Department: | Department of Computer Science |
| Degree: | Doctor of Philosophy |
| Issue Date: | 2008 |
| Publisher: | City University of Hong Kong |
| Subjects: | Indexing. Information retrieval. |
| Notes: | CityU Call Number: Z695.9 .C43 2008 xii, 136 leaves : ill. 30 cm. Thesis (Ph.D.)--City University of Hong Kong, 2008. Includes bibliographical references (leaves 116-136) |
| Type: | thesis |
| Abstract: | With the advent of Internet and advancement in computer technology, there
are two notable challenges in document retrieval: retrieving huge amount of
data efficiently and identifying the most useful documents, out of many relevant
pages, correctly.
Although various phrase-finding and indexing methods have been proposed
in the past, conclusions on the usefulness of phrases as indexing units have been
generally inconsistent. Nevertheless, a number of recent research groups, including
the leading groups who have participated in TREC campaigns, have
used phrases as indexing units and have been able to obtain some improvement.
As phrases have traditionally been regarded as precision-enhancing tools, recent
research continues to apply the concept of phrase in different IR problems.
In this thesis, following the tradition, we are interested in the concept of phrase
in information retrieval, especially for document retrieval.
To address the two challenges, we first propose a common phrase index
as an efficient index structure to support phrase queries in a very large text
database. Our structure is an extension of previous index structures for phrases
and achieves better query efficiency with modest extra storage cost. Further
improvement in efficiency can be attained by implementing our index, according
to our observation of the dynamic nature of common word sets. In experimental
evaluation, a common phrase index using 255 common words yields
an improvement of about 11% and 62%, in query time for all queries and
large queries (queries of long phrases) respectively, over an auxiliary nextword
index. Moreover, it needs only about 19% extra storage space. Compared
with an inverted index, our improvement works out to about 72% and 87% for
all and large queries respectively. We also propose to implement a common
phrase index with dynamic update feature. Our experiments show that more
improvement in time efficiency can be achieved.
For improving the quality of retrieval results, we devise a proximity-based
ranking function that combines an “ordered loose phrase” scoring with the
state-of-the-art Okapi probabilistic model (BM25). We say that a phrase occurs
in a document in an ordered loose phrase form, when the words of the
phrase appear sufficiently close to each other, and in the same order as in the
query. The occurrence of an ordered loose phrase, constituted by words of a
query phrase in a document, may indicate a high relevance of the document
to the query. We design our experiments using the query sets in TREC-11,
TREC-12 and TREC-13 and the .GOV document collection. The results show
that our method compares favorably with the pure BM25 and three recent
works based on term proximity and co-occurrence, in most of the performance
measures. For TREC-12 and TREC-13, our results successfully demonstrate
that our method can improve the quality of search results significantly. |
| Online Catalog Link: | http://lib.cityu.edu.hk/record=b2340707 |
| Appears in Collections: | CS - Doctor of Philosophy
|
Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.
|