CityU Institutional Repository >
3_CityU Electronic Theses and Dissertations >
ETD - Dept. of Computer Science >
CS - Doctor of Philosophy >
Please use this identifier to cite or link to this item:
|Title: ||Short text clustering for question answering systems|
|Other Titles: ||Yong hu jiao hu shi wen da xi tong zhong duan wen ben ju lei de yan jiu yu ying yong|
|Authors: ||Ni, Xingliang ( 倪興良)|
|Department: ||Department of Computer Science|
|Degree: ||Doctor of Philosophy|
|Issue Date: ||2011|
|Publisher: ||City University of Hong Kong|
|Subjects: ||Question-answering systems.|
Text processing (Computer science)
|Notes: ||CityU Call Number: QA76.9.Q4 N54 2011|
86 leaves : ill. 30 cm.
Thesis (Ph.D.)--City University of Hong Kong, 2011.
Includes bibliographical references (leaves 78-86)
|Abstract: ||With the rapid development of Web 2.0, the User-Interactive Question Answering (UIQA) systems have attracted more and more attention. The UIQA systems provide a bridge to connect askers and answerers, and stimulate the answerers in the QA community to solve questions. However, UIQA systems are also filled with duplicate or similar questions. The redundancy in UIQA systems prevent the users from quickly knowledge obtaining.
We investigate the short text clustering algorithm to group the questions in the UIQA system. A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. In order to find the core terms, we model the collection of short text snippets as a graph, in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. Each term can bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. The term, which minimizes the inter-class similarity and maximizes the inner-class similarity, is selected as the core term. TermCut then bisect the short text collection into two clusters, in which one cluster contains the term, whereas those snippets in the other cluster do not. We iteratively bisect the collection, and finally a set of clusters are formed.
Based on the TermCut strategy, we propose two clustering algorithms, namely Cluster Number based TermCut (CNTC) and Threshold based TermCut (TTC) respectively. CNTC uses the prior knowledge of target cluster number as the stop condition. The output cluster terminates the bisection when the target cluster number is obtained. However, it is difficult to obtain the prior knowledge of the target cluster number in some cases. Unlike CNTC, TTC uses a similarity threshold to determine whether to stop bisecting. The clustering process of TTC stops, when the bisection does not lead to any improvement of the inter-class similarity and the inner-class dissimilarity.
We design a prototype to apply the proposed short text clustering algorithm to question recommendation. A topic based user interest model is proposed to capture the different user interests. Based on the model, we can rank the questions according to each user's interest. Top ranked questions are clustered and recommended to the user. The demonstration of the clustering algorithm is then given.|
|Online Catalog Link: ||http://lib.cityu.edu.hk/record=b4086660|
|Appears in Collections:||CS - Doctor of Philosophy |
Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.