Machine Learning for Tagging of Biomedical Literature

Caroline Persson

AbstractIntroduction:
Named entity recognition of gene terms plays a big role in the increasing challenge of extracting gene terms from literature. Gene terms exists in many variants and the amount of gene terms is growing continuously. The goal of this project is to understand how the tagging of gene terms works, especially the understanding of the algorithms behind the recognition systems. A good understanding of the learning mechanisms is a necessary part of improving existing methods.

Methods:
The steps for training a Naive Bayes classifier are explained in details through- out the report. Examples of how the training compute different probabilities, and how the classifier handles raw unlabelled text are showed and evaluated. Furthermore a Naive Bayed classifier is implemented in Python, and the performance are compared to similar tasks.

Conclusion:
A Naive Bayes classifier is definitely an useful tool for named entity recognition of gene terms. The performance is dependent of the selection of features, and the final performance of an implementation in Python receive an f-measure of 0.58. This is comparable, though in the lower end, of the results from the BioCreative I challenge task 1.A.
TypeBachelor thesis [Academic thesis]
Year2012
PublisherTechnical University of Denmark, DTU Informatics, E-mail: reception@imm.dtu.dk
AddressAsmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark
SeriesIMM-B.Sc.-2012-33
Note
Electronic version(s)[pdf]
Publication linkhttp://www.imm.dtu.dk/English.aspx
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing