Skip navigation
Run Run Shaw Library City University of Hong KongRun Run Shaw Library

Please use this identifier to cite or link to this item: http://dspace.cityu.edu.hk/handle/2031/8219
Full metadata record
DC FieldValueLanguage
dc.contributor.authorChu, Yileien_US
dc.date.accessioned2016-01-07T01:24:09Z
dc.date.accessioned2017-09-19T09:14:52Z
dc.date.accessioned2019-02-12T07:33:20Z-
dc.date.available2016-01-07T01:24:09Z
dc.date.available2017-09-19T09:14:52Z
dc.date.available2019-02-12T07:33:20Z-
dc.date.issued2015en_US
dc.identifier.other2015eecy303en_US
dc.identifier.urihttp://144.214.8.231/handle/2031/8219-
dc.description.abstractThe algorithms behind document similarity comparison have been widely applied in fields like (1) plagiarism checking in academic libraries, (2) redundancy elimination in large collections of web pages, (3) web search engines like Google, etc. However, the past research relies on huge database consisting of millions or billions of webpages – recall of their experiments usually cannot be justified. This final year project has explored three most famous English text similarity detecting techniques: (1) cosine distance, (2) shingling and (3) SimHash. The database consists of short news articles crawled from BBC news. Both the accuracy and efficiency have been evaluated in order to find the most suitable algorithm for a short text search engine. All the experiments were conducted using Python and Java, relying on supports from open-source libraries like NLTK, Stanford POS Tagging, and Guava.en_US
dc.rightsThis work is protected by copyright. Reproduction or distribution of the work in any format is prohibited without written permission of the copyright owner.en_US
dc.rightsAccess is restricted to CityU users.en_US
dc.titleDocument Similarity Comparisonen_US
dc.contributor.departmentDepartment of Electronic Engineeringen_US
dc.description.supervisorSupervisor: Prof. CHOW, Tommy W S; Assessor: Prof. CHEN, Guanrongen_US
Appears in Collections:Electrical Engineering - Undergraduate Final Year Projects 

Files in This Item:
File SizeFormat 
fulltext.html145 BHTMLView/Open
Show simple item record


Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.

Send feedback to Library Systems
Privacy Policy | Copyright | Disclaimer