|
|
CityU Institutional Repository >
CityU Electronic Theses and Dissertations >
ETD - Dept. of Electronic Engineering >
EE - Doctor of Philosophy >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2031/6250
|
| Title: | Developing effective algorithms for document data mining |
| Other Titles: | Wen ben shu ju wa jue de gao xiao suan fa yan jiu 文本數據挖掘的高效算法研究 |
| Authors: | Zhang, Haijun (張海軍) |
| Department: | Department of Electronic Engineering |
| Degree: | Doctor of Philosophy |
| Issue Date: | 2010 |
| Publisher: | City University of Hong Kong |
| Subjects: | Data mining. Text processing (Computer science) |
| Notes: | CityU Call Number: QA76.9.D343 Z38 2010 xiv, 164 leaves : ill. 30 cm. Thesis (Ph.D.)--City University of Hong Kong, 2010. Includes bibliographical references (leaves 156-163) |
| Type: | thesis |
| Abstract: | In this thesis, novel models, effective algorithms, and their corresponding applications
for document data mining are studied. The Internet has, undoubtedly, become an
indispensable component of our daily life ranged from restaurant booking, to technology
research. Data mining techniques for Internet related documents have huge influence on
many fields. The focus of this thesis includes the development of new models and efficient
algorithms for document retrieval, plagiarism detection, and anti-phishing applications.
Internet access, such as World Wide Web (WWW), has made document retrieval
increasingly demanding as collection and searching of documents has become an integral
part of many people's lives. Accuracy and speed are two key measures of effective retrieval
methodologies. Existing document retrieval systems use statistical methods and natural
language processing approaches combined with different document representation and query
structures. Document retrieval has created many interests in the information retrieval
community. Document retrieval refers to finding similar documents for a given user's query.
A user's query can be ranged from a full description of a document to a few keywords. Most
of the extensively used retrieval approaches are keywords based searching methods, e.g.,
www.google.com, in which untrained users provide a few keywords to the search engine
finding the relevant documents in a returned list. Another type of document retrieval is to use
a query document to search similar ones. Using an entire document as a query performs well
in improving retrieval accuracy, but it is more computationally demanding compared with
the keywords based method. In addition to retrieval task, document classification and
clustering has also become important in organizing the massive amount of document data,
which also uses similar feature extraction approaches to facilitate the classification and
clustering process. Until now, most conventional models use rough document features, such
as terms in documents as feature units. Usually the connections among terms are overlooked
which results in losing important semantic information of documents. Thus, there is a need
of developing more effective document representation scheme to enhance the performance of
relevant document data mining. In this thesis, first we develop a graph model for document
representation that resulting in more semantic information to be included; second, we develop two statistic models, i.e. dual wing harmonium models, to generate distributed latent
representations of documents with modeling multiple features jointly; third, we also
introduce a new document similarity measure employing the concept of the Earth Mover's
Distance.
The online fashion is, however, posing a severe challenge to textual intellectual
property because the Internet and computer technology have made disseminating knowledge
across the world facile. People can search, copy, save, and reuse online sources in ease.
Cut-and-paste plagiarism detection, at present, has become a growing concern in education
system. One of the difficulties of efficiently detecting plagiarism is to search the source with
speedy query response because people may copy from one of millions of documents in the
Internet, where each document usually involves thousands of words. In this thesis, we
propose a coarse-to-fine framework to efficiently thwart plagiarism. Each document is
represented by a multilevel structure, i.e. document-paragraph-sentence. Different signatures
are constructed to represent components in different levels. Relevant document retrieval
approaches by adding or only using local information to explore rich semantics from
documents are introduced to retrieve the suspected sources. Plagiarism algorithms by further
sentence matching are designed to identify the plagiarized sources.
Detection of phishing web pages, forged web pages to mimic web pages of real web
sites, is another major concern for the current information technology (IT) world. Malicious
people create phishing web pages to steal individuals' personal information such as bank
account, password, credit card number and other financial data. Recognition of these
phishing web pages has attracted much attention from security and software providers,
financial institutions to academic researchers, as it is a severe security and privacy problem
and has caused huge negative impacts on the Internet world. In this thesis, we present a
hybrid anti-phishing framework. This framework synthesizes multiple cues, i.e. textual
content and visual content, from the given web page and automatically reports a phishing
web page by using a text classifier, an image classifier, and a data fusion process of the
classifiers. A Bayesian model is proposed to estimate the threshold, which is required in the
classifiers to determine the class of a given web page. We also develop a Bayesian approach,
i.e. a fusion algorithm, to fuse the classification results from the textual and visual contents.
To sum up, new models and effective algorithms for document data mining are
proposed in this thesis. Graph model, dual wing harmonium model and multi-level structured
document model are used to model the semantics of documents, especially with large size.
Corresponding algorithms are developed for document retrieval application. A coarse-to-fine
framework is reported to efficiently thwart plagiarism. A textual and visual content-based
methodology is introduced for anti-phishing application. |
| Online Catalog Link: | http://lib.cityu.edu.hk/record=b3947866 |
| Appears in Collections: | EE - Doctor of Philosophy
|
Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.
|