Please use this identifier to cite or link to this item:
|Title:||Machine Learning Application: Classification and Summarization of Legal Documents|
|Department:||Department of Computer Science|
|Supervisor:||Supervisor: Dr. Chun, Hon Wai Andy; First Reader: Dr. Hou, Junhui; Second Reader: Dr. Yu, Yuen Tak|
|Abstract:||Text mining is a widely researched area in machine learning technology. With various models developed such as text categorization (Moulinier, 1996), Naïve Bayes Model (McCallum and Nigam, 1998) and words to vector machines (Mikolov, Chen, Corrado and Dean, 2013), text mining became a versatile solution in many real-life world problems. The extensive models available for text mining gives a vast opportunity for its application. There are numerous research that are established relevant to legal case documents processing. Each models incorporates different algorithms and approaches which results in a unique forte for each developed model. Although many model has been established, the degree of acceptance of these model are relatively low in real life situations (Remus and Levy, 2015). One among many reasons for rejections of these models is machine learning's limitation to process high-contextual information. This research project aims to utilize text mining and machine learning technology to address the mentioned above concern. Legal documents are often complicated and difficult to be understood by commoners (Howe and Wogalter, 1994). The project is meant to create a machine learning system which produces a categorized and summarized information derived from the original legal documents. The simplified document produced by the model is designed to ease the understanding of the legal documents. The research aims to build a predictive machine learning model by utilizing a series of algorithm to produce a comprehensive automatic summarization machine. Blei, Ng and Jordan's (2003) Latent Dirichlet Allocation algorithm is implemented for identifying the major topics of the legal documents. Word2vec technique (Mikolov et al, 2013) is applied afterwards to convert sentences into vector matices, generating a feature space for LexRank algorithm (Erkan and Radev, 2004) to compute connectivity matrix of intra sentences based on IDF-modified-cosine formula to summarize the corpus. The extracted information is consolidated into a single coherent document at the final stage.|
|Appears in Collections:||Computer Science - Undergraduate Final Year Projects |
Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.