@PHDTHESIS\{IMM2003-02821, author = "A. S. Have", title = "Datamining on distributed medical databases", year = "2003", school = "Informatics and Mathematical Modelling, Technical University of Denmark, {DTU}", address = "Richard Petersens Plads, Building 321, {DK-}2800 Kgs. Lyngby", type = "", note = "Vejleder: Lars Kai Hansen", url = "http://www2.compute.dtu.dk/pubdb/pubs/2821-full.html", abstract = "This Ph.D. thesis focuses on clustering techniques for Knowledge Discovery in Databases. Various data mining tasks relevant for medical applications are described and discussed. A general framework which combines data projection and data mining and interpretation is presented. An overview of various data projection techniques is offered with the main stress on applied Principal Component Analysis. For clustering purposes, various Generalized Gaussian Mixture models are presented. Further the aggregated Markov model, which provides the cluster structure via the probabilistic decomposition of the Gram matrix, is proposed. Other data mining tasks, described in this thesis are outlier detection and the imputation of the missing data. The thesis presents two outlier detection methods based on the cumulative distribution and a special designated outlier cluster in connection with the Generalized Gaussian Mixture model. Two models for imputation of the missing data, namely the {K-}nearest neighbor and a Gaussian model are suggested. With the purpose of interpreting a cluster structure two techniques are developed. If cluster labels are available then the cluster understanding via the confusion matrix is available. If data is unlabeled, then it is possible to generate keywords (in case of textual data) or key-patterns, as an informative representation of the obtained clusters. The methods are applied on simple artificial data sets, as well as collections of textual and medical data. In Danish: Denne ph.d.-afhandling fokuserer p{\aa} klyngeanalyseteknikker til ekstraktion af viden fra databaser. Afhandling pr{\ae}senterer og diskuterer forskellige datamining problemstillinger med relevans for medicinske applikationer. Specielt pr{\ae}senteres en generel struktur der kombinerer data-projektion, datamining og automatisk fortolkning. Indenfor data-projektion gennemg{\aa}s en r{\ae}kke teknikker med speciel v{\ae}gt p{\aa} anvendt Principal Komponent Analyse. En r{\ae}kke generaliserede Gaussisk miksturmodeller foresl{\aa}s til klyngeanalyse. Desuden foresl{\aa}s en aggregatet Markov model, som estimerer klyngestrukturen via dekomposition af en sandsynlighedsbaseret Grammatrix. Herudover beskriver afhandlingen to andre datamining problemstillinger nemlig {''}outlier{''} detektion og imputering af manglende data. Afhandlinger pr{\ae}senterer {''}outlier{''} detektionsmetoder. Dels baseret p{\aa} akumulerede fordelinger, dels baseret p{\aa} introduktion af en speciel {''}outlier{''} klynge i forbindelse med den generaliserede Gaussisk mikstur-model. Med hensyn til imputation af manglende data pr{\ae}senteres to metoder baseret p{\aa} a {K-}n{\ae}rmeste-nabo eller en Gaussisk model antagelse. Der er udviklet to metoder til automatisk fortolkning af klyngestrukturen. N{\aa} klynge annoteringer {''}labels{''} er tilg{\ae}ngelige vil konfusionsmatricen danne grundlaget for fortolkningen. Hvis s{\aa}danne annoteringer ikke er tilg{\ae}ngelige, er det muligt at generere n{\o}gleord (i tilf{\ae}lde af tekst data) eller generelt n{\o}gle-m{\o}nstre, som s{\aa}ledes bibringer til fortolkning af klyngerne. De foresl{\aa}ede metoder er testet p{\aa} simple kunstige datas{\ae}t s{\aa}vel som kollektioner af tekst og medicinske data." }