Automatic Recognition of Blog Entries |
Casper Ahrensberg
|
| Abstract | The goal of the thesis is to determine if a page can be recognized as a specific type of page based on the structure of its HTML elements. It will try to do so by using Tree Edit Distance to generate a matching structure from said pages structures which then in turn can be used to test against when an arbitrary page is presented, thus answering if the page is a Wordpress blog or not. The algorithm used is the Restricted Top Down Mapping which imposes restrictions enforcing the DocType of HTML while mapping from one tree to another. A series of test will be run on the algorithm to determine its precision when answering if a site is a blog or not. |
| Type | Bachelor thesis [Academic thesis] |
| Year | 2012 |
| Publisher | Technical University of Denmark, DTU Informatics, E-mail: reception@imm.dtu.dk |
| Address | Asmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark |
| Series | IMM-B.Sc.-2010-23 |
| Note | |
| Electronic version(s) | [pdf] |
| Publication link | http://www.imm.dtu.dk/English.aspx |
| BibTeX data | [bibtex] |
| IMM Group(s) | Computer Science & Engineering |