Please use this identifier to cite or link to this item:
http://dspace.cityu.edu.hk/handle/2031/9247
Title: | Forum Data Parsing System |
Authors: | Yum, Tsz Ho |
Department: | Department of Computer Science |
Issue Date: | 2019 |
Supervisor: | Supervisor: Dr. Chow, Chi Yin Ted; First Reader: Dr. Chan, Mang Tang; Second Reader: Prof. Tan, Kay Chen |
Abstract: | Data analysis has been very important in various fields, such as marketing, supply chain, insurance and retail. Web forums are large source of data because of the advanced technology nowadays. Many people express their viewpoints in the forum and therefore forum can be one of research target for data analyst. However, there is no efficient method or tool to extract and manage such forum data. It is time consuming and troublesome to download the posts manually. This aim of this study is to introduce a new method to help data analyst to extract huge amount of posts for analysis purpose. The proposed system needs to be adaptable into different types of forum systems, minimizing the extraction time, maintaining the extracted posts and able to detect and deal with changes of target forum. This study looked into feature-based and learning-based data extraction method in order to figure out the best way to maximize the extraction accuracy and minimize the human force. Some algorithms such as supervised wrapper generation algorithm are used to parse the HTML document and generate the learning outcome. Parallel Computing and incremental update techniques are applied to reduce the time of extracting the forum posts. Webpage rendering engine are also used to deal with documents which are generated dynamically by client side scripts. Moreover, 6 popular forums were selected for testing the system, which showed that 99% of their posts can be completely extracted by this system and the rest 1% contains about valueless information such as embedded advertisements. In conclusion, the proposed system performs well and generates satisfactory results in web forum post extraction. Detection for embedded advertisements will be required in further development. |
Appears in Collections: | Computer Science - Undergraduate Final Year Projects |
Files in This Item:
File | Size | Format | |
---|---|---|---|
fulltext.html | 148 B | HTML | View/Open |
Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.