Please use this identifier to cite or link to this item:
http://dspace.cityu.edu.hk/handle/2031/8976
Title: | Design and Evaluation of improved approximate matching algorithm for string similarity join |
Authors: | Xiao, Zhen |
Department: | Department of Electronic Engineering |
Issue Date: | 2018 |
Supervisor: | Supervisor: Dr. Pao, Derek C W; Assessor: Mr. Ng, Kai Tat |
Abstract: | In the system of data cleaning, it often needs to build the string similarity join on large data set, which gives all the similar string pairs of the collections. It is a significant challenge to find the join more efficiently. Two strings are similar if the edit distance between them is less than a given threshold related to the length of the string. A well-known tabulating method computes the edit distance using a matrix made up of two input strings. The existing work improves the algorithm based on the observation that only certain elements on some diagonals are essential during the calculation. Taking one-step further, this project develops the algorithm with the fact that we want to find whether two strings are similar but not their edit distance. We can reject the candidate string pairs when their distance is larger than the threshold during the process of calculation. The promotion decreases the frequency of computation to accelerate the speed of building the similarity join. The experiment evaluates the performance of the improved algorithm compared with other methods to compute edit distance, which reduces the time consumption in a significant level. |
Appears in Collections: | Electrical Engineering - Undergraduate Final Year Projects |
Files in This Item:
File | Size | Format | |
---|---|---|---|
fulltext.html | 147 B | HTML | View/Open |
Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.