Textual Similarity

Aqeel Hussain

AbstractThe purpose of this thesis is to identify methods for textual similarity measurement. Many proposed solutions for this problem are suggested in literature. Three of these proposals are discussed in depth and implemented. Two focuses on syntax similarity and one focus on semantic similarity. The two syntax algorithms represents edit distance and vector space model algorithms. The semantic algorithm is an ontology based algorithm, which lookup words in WordNet. Using this tool the relatedness between two given texts is estimated. The other algorithms use Levenshtein and n-gram, respectively. The performance of these implementations are tested and discussed.
The thesis concludes that performance is very different and all algorithms perform well in their respective fields. The algorithms cannot be distinguished as to determining one, which outshines the others. Thus an algorithm implementation has to be picked based on the task at hand.
TypeBachelor thesis [Academic thesis]
Year2012
PublisherTechnical University of Denmark, DTU Informatics, E-mail: reception@imm.dtu.dk
AddressAsmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark
SeriesIMM-B.Sc.-2012-16
NoteSupervised by Professor Robin Sharp, ris@imm.dtu.dk, DTU Informatics
Electronic version(s)[pdf]
Publication linkhttp://www.imm.dtu.dk/English.aspx
BibTeX data [bibtex]
IMM Group(s)Computer Science & Engineering