City University of Hong Kong

CityU Institutional Repository >
3_CityU Electronic Theses and Dissertations >
ETD - Dept. of Electronic Engineering  >
EE - Doctor of Philosophy  >

Please use this identifier to cite or link to this item:

Title: Developing effective algorithms for document data mining
Other Titles: Wen ben shu ju wa jue de gao xiao suan fa yan jiu
Authors: Zhang, Haijun (張海軍)
Department: Department of Electronic Engineering
Degree: Doctor of Philosophy
Issue Date: 2010
Publisher: City University of Hong Kong
Subjects: Data mining.
Text processing (Computer science)
Notes: CityU Call Number: QA76.9.D343 Z38 2010
xiv, 164 leaves : ill. 30 cm.
Thesis (Ph.D.)--City University of Hong Kong, 2010.
Includes bibliographical references (leaves 156-163)
Type: thesis
Abstract: In this thesis, novel models, effective algorithms, and their corresponding applications for document data mining are studied. The Internet has, undoubtedly, become an indispensable component of our daily life ranged from restaurant booking, to technology research. Data mining techniques for Internet related documents have huge influence on many fields. The focus of this thesis includes the development of new models and efficient algorithms for document retrieval, plagiarism detection, and anti-phishing applications. Internet access, such as World Wide Web (WWW), has made document retrieval increasingly demanding as collection and searching of documents has become an integral part of many people's lives. Accuracy and speed are two key measures of effective retrieval methodologies. Existing document retrieval systems use statistical methods and natural language processing approaches combined with different document representation and query structures. Document retrieval has created many interests in the information retrieval community. Document retrieval refers to finding similar documents for a given user's query. A user's query can be ranged from a full description of a document to a few keywords. Most of the extensively used retrieval approaches are keywords based searching methods, e.g.,, in which untrained users provide a few keywords to the search engine finding the relevant documents in a returned list. Another type of document retrieval is to use a query document to search similar ones. Using an entire document as a query performs well in improving retrieval accuracy, but it is more computationally demanding compared with the keywords based method. In addition to retrieval task, document classification and clustering has also become important in organizing the massive amount of document data, which also uses similar feature extraction approaches to facilitate the classification and clustering process. Until now, most conventional models use rough document features, such as terms in documents as feature units. Usually the connections among terms are overlooked which results in losing important semantic information of documents. Thus, there is a need of developing more effective document representation scheme to enhance the performance of relevant document data mining. In this thesis, first we develop a graph model for document representation that resulting in more semantic information to be included; second, we develop two statistic models, i.e. dual wing harmonium models, to generate distributed latent representations of documents with modeling multiple features jointly; third, we also introduce a new document similarity measure employing the concept of the Earth Mover's Distance. The online fashion is, however, posing a severe challenge to textual intellectual property because the Internet and computer technology have made disseminating knowledge across the world facile. People can search, copy, save, and reuse online sources in ease. Cut-and-paste plagiarism detection, at present, has become a growing concern in education system. One of the difficulties of efficiently detecting plagiarism is to search the source with speedy query response because people may copy from one of millions of documents in the Internet, where each document usually involves thousands of words. In this thesis, we propose a coarse-to-fine framework to efficiently thwart plagiarism. Each document is represented by a multilevel structure, i.e. document-paragraph-sentence. Different signatures are constructed to represent components in different levels. Relevant document retrieval approaches by adding or only using local information to explore rich semantics from documents are introduced to retrieve the suspected sources. Plagiarism algorithms by further sentence matching are designed to identify the plagiarized sources. Detection of phishing web pages, forged web pages to mimic web pages of real web sites, is another major concern for the current information technology (IT) world. Malicious people create phishing web pages to steal individuals' personal information such as bank account, password, credit card number and other financial data. Recognition of these phishing web pages has attracted much attention from security and software providers, financial institutions to academic researchers, as it is a severe security and privacy problem and has caused huge negative impacts on the Internet world. In this thesis, we present a hybrid anti-phishing framework. This framework synthesizes multiple cues, i.e. textual content and visual content, from the given web page and automatically reports a phishing web page by using a text classifier, an image classifier, and a data fusion process of the classifiers. A Bayesian model is proposed to estimate the threshold, which is required in the classifiers to determine the class of a given web page. We also develop a Bayesian approach, i.e. a fusion algorithm, to fuse the classification results from the textual and visual contents. To sum up, new models and effective algorithms for document data mining are proposed in this thesis. Graph model, dual wing harmonium model and multi-level structured document model are used to model the semantics of documents, especially with large size. Corresponding algorithms are developed for document retrieval application. A coarse-to-fine framework is reported to efficiently thwart plagiarism. A textual and visual content-based methodology is introduced for anti-phishing application.
Online Catalog Link:
Appears in Collections:EE - Doctor of Philosophy

Files in This Item:

File Description SizeFormat
abstract.html132 BHTMLView/Open
fulltext.html132 BHTMLView/Open

Items in CityU IR are protected by copyright, with all rights reserved, unless otherwise indicated.


Valid XHTML 1.0!
DSpace Software © 2013 CityU Library - Send feedback to Library Systems
Privacy Policy · Copyright · Disclaimer