Automated Indexing of Danish Online Shops

Søren Lisby Schmidt

AbstractThis thesis presents an approach to automated discovery of online shops on the Internet. A prototype system with web crawling and web page classification functionality based on supervised machine learning is implemented in the Python programming language.
The Navne & Numre Erhverv (NNE) corporate register is proposed as a source of company websites that could potentially be online shops, and therefore should be considered for crawling and classification.
The positive and negative classification training datasets are obtained differently. Existing e-mærket member shops are used as the positively labeled data. The negative dataset (non-shop websites) is obtained from the Open Directory Project.
Two separate supervised classifiers are used: naïve Bayes with Bernoulli eventmodel and decision tree. The use of these classifiers was based on the implementations in the Scikit-learn Python package. The implemented prototype system showed promising results in shop classification with average F1-scores of 0.94 for the Bernoulli naïve Bayes classifier and 0.95 for the decision tree classifier.
A number of recommendations of further work and improvements are given, which should be considered by e-mærket, if they are to pursue this approach further.
TypeMaster's thesis [Industrial collaboration]
Year2013
PublisherTechnical University of Denmark, DTU Compute, E-mail: compute@compute.dtu.dk
AddressMatematiktorvet, Building 303-B, DK-2800 Kgs. Lyngby, Denmark
SeriesM.Sc.-2013-25
NoteDTU supervisor: Finn Årup Nielsen, faan@dtu.dk, DTU Compute. Thesis not publicly available.
Publication linkhttp://www.compute.dtu.dk/English.aspx
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing


Back  ::  IMM Publications