@MASTERSTHESIS\{IMM2003-02799, author = "S. Birch", title = "Statistical text modelling - Towards modelling of matching problems", year = "2003", keywords = "Statistical text modelling, Probalistic Latent Semantic Indexing, {PLSI,} {PLINK,} Supervised {PLSI,} Job matching", school = "Informatics and Mathematical Modelling, Technical University of Denmark, {DTU}", address = "Richard Petersens Plads, Building 321, {DK-}2800 Kgs. Lyngby", type = "", url = "http://www2.compute.dtu.dk/pubdb/pubs/2799-full.html", abstract = "Intelligente computer-baserede systemer, som behandler tekstdokumenter er et aktivt forskningsomr{\aa}de. S{\o}gemaskiner er h{\o}jst sandsynligt det bedst kendte eksempel p{\aa}, at anvendelsen af statistisk analyse af ustruktureret tekst kan v{\ae}re s{\ae}rdeles effektiv. Men der findes et v{\ae}ld af andre mulige anvendelser, hvor statistisk tekst modellering vil v{\ae}re velegnet. Et andet eksempel er matching af jobans{\o}gere og job. Arbejdsgivere bruger mange resourcer p{\aa} at finde den rette ans{\o}ger, og jobans{\o}gere bruger tilsvarende mange resourcer i s{\o}gningen efter det rette job. Intelligente systemer, som automatisk og p{\aa}lideligt kunne matche ans{\o}gere og job ville spare mange kr{\ae}fter. Dette eksamensprojekt er blevet inspireret af jobmatching-problemet og lignende matching-problemer. M{\aa}let er at finde frem til velegnede statistiske modeller for matching-problemer og at udf{\o}re en grundig analyse af disse modeller - b{\aa}de teoretisk og eksperimentielt. S{\o}gemaskiner h{\o}rer til indenfor Information Retrieval. Vores studier af dette omr{\aa}de viser, at der har v{\ae}ret en del fokus p{\aa} modeller, som besk{\ae}ftiger sig med linkstrukturer. Med linkstrukturer menes dokumenter, som er indbyrdes forbundet med links. I forbindelse med matching-problemer kan det vise sig meget interessant, fordi et link minder meget om et match. Probalistic Latent Semantic Indexing (PLSI) er en model som kan udvides til, udover ord, ogs{\aa} at repr{\ae}sentere links. I dette eksamensprojekt bliver det vist, at {PLSI} er i stand til at {''}forst{\aa}'' semantik i tekstdokumenter. Udvides {PLSI} dern{\ae}st med links, er modellen ogs{\aa} til en vis grad i stand til at forudsige hvilke links et dokument burde have udfra dets tekstindhold. Dette virker bedst, hvis linkstrukturen er rimelig t{\ae}t. Intentionen er, at den insigt i statistisk tekstmodellering, som formidles i denne rapport vil vise sig nyttig under design af intelligente computersystemer for matching-problemer. in English: Automatic systems dealing with text is an active area of research. Search engines are probably the most successful and well-known area where statistical analysis of unstructured text documents have proven very useful. But there are a vast amount of potential applications, which are suitable for statistical text modelling. One such application is the matching of job applicants with job offers. Large amounts of human resources are used today on the search for future employees and - from the applicant's perspective - on the search for the right job. Successful automatic systems would be very welcome in this area. This master thesis has been motivated by the job matching problem and related matching problems. The objective is to find suitable models for matching problems and to perform a thorough analysis of these models from a theoretical and from an experimental perspective. Search engines belong to the area of Information Retrieval. In Information Retrieval there has been a lot of focus on models dealing with link structures - that is, documents interconnected by links - and links are interesting, because a link is very similar to a match. The Probalistic Latent Semantic Indexing (PLSI) model is a model, which can be extended to incorporate link information. This thesis shows that {PLSI} is a model capable of capturing semantics in text documents. Furthermore when extended with link information it is capable of predicting links to some degree in environments where link information is not too sparse. It is the hope that the insight and the capabilities of the models presented here will turn out useful when building automatic systems for matching problems or related problems." }