City University of Hong Kong

CityU Institutional Repository >
3_CityU Electronic Theses and Dissertations >
ETD - Dept. of Chinese, Translation and Linguistics  >
CTL - Doctor of Philosophy  >

Please use this identifier to cite or link to this item:

Title: Enhanced term extraction based on probabilistic estimation from syntactic parse trees
Other Titles: Ji yu ju fa shu gai lü ce suan de shu yu you hua ti qu
Authors: Zhang, Xing ( 張杏)
Department: Department of Chinese, Translation and Linguistics
Degree: Doctor of Philosophy
Issue Date: 2011
Publisher: City University of Hong Kong
Subjects: Terms and phrases -- Data processing.
Notes: CityU Call Number: P305.18.D38 Z46 2011
xii, 200 leaves 30 cm.
Thesis (Ph.D.)--City University of Hong Kong, 2011.
Includes bibliographical references (leaves 165-175)
Type: thesis
Abstract: This research is an effort to explore how the syntactic information of term candidates can be exploited for the task of term extraction. It proposes an approach that represents a novel, linguistically motivated perspective in the area of terminological processing. The hypothesis of this work is that terms tend to perform certain types of syntactic functions more prominently than others. This syntactic behaviour of terms can be captured as termhood by estimating term probabilities from their occurrences in different syntactic paths. Based on a large corpus of parse trees, this feature allows for highly reliable statistics on properties of term occurrences. In essence, this method is a weighting scheme that measures probabilistic relations between term occurring patterns and syntactic paths, which is discussed in this thesis as Syntactic Function Value (SF-Value) and implemented in a term extraction system. Experiments conducted in this study begin by building up an automatic term extraction system that integrates such a weighting scheme. The purpose of these experiments is not to design a term extraction system with the best performance but to investigate the contributions of syntactic information to term extraction, including single-word terms, multi-word terms, and new terms. Specifically, these experiments are aimed at answering several research questions, including the following: whether linguistic knowledge as term rates in syntactic paths is useful for recognising candidate terms in medical texts; to what extent singleword terms can be extracted by this linguistic indicator; and subsequently how this linguistically based metric can be used to improve the ranking of multi-word terms, and whether term rates in syntactic paths can be used effectively for new term extraction. Finally, with the aim of investigating whether this linguistic metric can be used as an effective feature within a machine learning framework, a series of experiments are conducted on general term extraction and new term extraction using the method of Conditional Random Fields (CRF). Empirical results strongly argue that the term extraction approach proposed in this study demonstrates superior performance when compared with two existing term extractors. The key technique of this term extraction system, SF-Value, proves to be especially useful in selecting single-word terms and is also effective in enhancing the ranking of multi-word term candidates after their initial ranking by a statistical measure, C-Value. With regard to new term extraction, results show that SF-Value does not perform as well, which suggests that more features are needed to distinguish new terms from known terms. CRF framework is subsequently applied with the uses of SF-Value and term rate as added features for the extraction of new terms. Results show that this machine learning framework performs quite well in general term extraction. However, for the task of generating a list of new term candidates, this framework does not show good performance as expected. This result indicates that, for the task of new term extraction, more features related to new term candidates should be taken into consideration, in addition to syntactic function information. In conclusion, this study devises an innovative, linguistically motivated measure for term extraction and implements it in a software system. Comprehensive experiments are conducted to evaluate its performance, and empirical results demonstrate its superior performance in comparison with existing term extraction systems.
Online Catalog Link:
Appears in Collections:CTL - Doctor of Philosophy

Files in This Item:

File Description SizeFormat
abstract.html133 BHTMLView/Open
fulltext.html133 BHTMLView/Open

Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.


Valid XHTML 1.0!
DSpace Software © 2013 CityU Library - Send feedback to Library Systems
Privacy Policy · Copyright · Disclaimer