Automated Indexing of Danish Online Shops
|Søren Lisby Schmidt|
|Abstract||This thesis presents an approach to automated discovery of online shops on the Internet. A prototype system with web crawling and web page classification functionality based on supervised machine learning is implemented in the Python programming language.|
The Navne & Numre Erhverv (NNE) corporate register is proposed as a source of company websites that could potentially be online shops, and therefore should be considered for crawling and classification.
The positive and negative classification training datasets are obtained differently. Existing e-mærket member shops are used as the positively labeled data. The negative dataset (non-shop websites) is obtained from the Open Directory Project.
Two separate supervised classifiers are used: naïve Bayes with Bernoulli eventmodel and decision tree. The use of these classifiers was based on the implementations in the Scikit-learn Python package. The implemented prototype system showed promising results in shop classification with average F1-scores of 0.94 for the Bernoulli naïve Bayes classifier and 0.95 for the decision tree classifier.
A number of recommendations of further work and improvements are given, which should be considered by e-mærket, if they are to pursue this approach further.
|Type||Master's thesis [Industrial collaboration]|
|Publisher||Technical University of Denmark, DTU Compute, E-mail: email@example.com|
|Address||Matematiktorvet, Building 303-B, DK-2800 Kgs. Lyngby, Denmark|
|Note||DTU supervisor: Finn Årup Nielsen, firstname.lastname@example.org, DTU Compute. Thesis not publicly available.|
|BibTeX data|| [bibtex]|
|IMM Group(s)||Intelligent Signal Processing|