Machine Learning for Tagging of Biomedical Literature

Machine Learning for Tagging of Biomedical Literature
Caroline Persson
Abstract	Introduction: Named entity recognition of gene terms plays a big role in the increasing challenge of extracting gene terms from literature. Gene terms exists in many variants and the amount of gene terms is growing continuously. The goal of this project is to understand how the tagging of gene terms works, especially the understanding of the algorithms behind the recognition systems. A good understanding of the learning mechanisms is a necessary part of improving existing methods. Methods: The steps for training a Naive Bayes classifier are explained in details through- out the report. Examples of how the training compute different probabilities, and how the classifier handles raw unlabelled text are showed and evaluated. Furthermore a Naive Bayed classifier is implemented in Python, and the performance are compared to similar tasks. Conclusion: A Naive Bayes classifier is definitely an useful tool for named entity recognition of gene terms. The performance is dependent of the selection of features, and the final performance of an implementation in Python receive an f-measure of 0.58. This is comparable, though in the lower end, of the results from the BioCreative I challenge task 1.A.
Type	Bachelor thesis [Academic thesis]
Year	2012
Publisher	Technical University of Denmark, DTU Informatics, E-mail: reception@imm.dtu.dk
Address	Asmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark
Series	IMM-B.Sc.-2012-33
Note
Electronic version(s)	[pdf]
Publication link	http://www.imm.dtu.dk/English.aspx
BibTeX data	[bibtex]
IMM Group(s)	Intelligent Signal Processing