Classification of Protein Sequences using Markov Models

Claus Thomsen, Simon Larsen

AbstractThis project deals with a specific classification problem in the area of bioinformatics and biology. The problem, typically referred to as secondary structure prediction deals with how the structure of protein sequences may be classified using a number of predefined structure classes.

This project analyses the possible use of Markov models for this classification problem. Markov models are statistical models which may be used to infer the different structure classes for protein sequences based on some training data.

The performance of the developed models are compared to other known models in the area, specifically the GOR models, which are similar to Markov models since they are both statistical models.

The obtained results show that Markov models may be used for secondary structure prediction achieving better performances than just guessing at the most frequent structure class. Starting out with a simple Markov model able to predict around 51% of the structures correctly, the model has been extended and combined with other methods resulting in a prediction accuracy of 57.2% (an increase of around 6%). This resulting model may be characterized as a first generation secondary structure predictor.

Given the time needed several of the weaknesses found in the Markov models may be removed or at least minimized possibly resulting in better performances. The models proposed in this project are not directly usable compared with some of the best predictors current available (having prediction accuracies of around 80%). However there may be room for further development incorporating biological background knowledge into the proposed Markov models.
TypeMaster's thesis [Academic thesis]
Year2004
PublisherInformatics and Mathematical Modelling, Technical University of Denmark, DTU
AddressRichard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby
SeriesIMM-Thesis-2004-15
NoteSupervised by ass. professor Paul Fischer
Electronic version(s)[pdf]
BibTeX data [bibtex]
IMM Group(s)Computer Science & Engineering