@MASTERSTHESIS\{IMM2013-06549, author = "S. L. Schmidt", title = "Automated Indexing of Danish Online Shops", year = "2013", school = "Technical University of Denmark, {DTU} Compute, {E-}mail: compute@compute.dtu.dk", address = "Matematiktorvet, Building 303{-B,} {DK-}2800 Kgs. Lyngby, Denmark", type = "", note = "{DTU} supervisor: Finn {\AA}rup Nielsen, faan@dtu.dk, {DTU} Compute. Thesis not publicly available.", url = "http://www.compute.dtu.dk/English.aspx", abstract = "This thesis presents an approach to automated discovery of online shops on the Internet. A prototype system with web crawling and web page classification functionality based on supervised machine learning is implemented in the Python programming language. The Navne \& Numre Erhverv (NNE) corporate register is proposed as a source of company websites that could potentially be online shops, and therefore should be considered for crawling and classification. The positive and negative classification training datasets are obtained differently. Existing e-m{\ae}rket member shops are used as the positively labeled data. The negative dataset (non-shop websites) is obtained from the Open Directory Project. Two separate supervised classifiers are used: naïve Bayes with Bernoulli eventmodel and decision tree. The use of these classifiers was based on the implementations in the Scikit-learn Python package. The implemented prototype system showed promising results in shop classification with average F1-scores of 0.94 for the Bernoulli naïve Bayes classifier and 0.95 for the decision tree classifier. A number of recommendations of further work and improvements are given, which should be considered by e-m{\ae}rket, if they are to pursue this approach further." }