Automatic Recognition of Blog Entries |
Casper Ahrensberg
|
Abstract | The goal of the thesis is to determine if a page can be recognized as a specific type of page based on the structure of its HTML elements. It will try to do so by using Tree Edit Distance to generate a matching structure from said pages structures which then in turn can be used to test against when an arbitrary page is presented, thus answering if the page is a Wordpress blog or not. The algorithm used is the Restricted Top Down Mapping which imposes restrictions enforcing the DocType of HTML while mapping from one tree to another. A series of test will be run on the algorithm to determine its precision when answering if a site is a blog or not. |
Type | Bachelor thesis [Academic thesis] |
Year | 2012 |
Publisher | Technical University of Denmark, DTU Informatics, E-mail: reception@imm.dtu.dk |
Address | Asmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark |
Series | IMM-B.Sc.-2010-23 |
Note | |
Electronic version(s) | [pdf] |
Publication link | http://www.imm.dtu.dk/English.aspx |
BibTeX data | [bibtex] |
IMM Group(s) | Computer Science & Engineering |