Detection of protein catalytic sites in the biomedical literature

Pac Symp Biocomput. 2013:433-44.

Abstract

This paper explores the application of text mining to the problem of detecting protein functional sites in the biomedical literature, and specifically considers the task of identifying catalytic sites in that literature. We provide strong evidence for the need for text mining techniques that address residue-level protein function annotation through an analysis of two corpora in terms of their coverage of curated data sources. We also explore the viability of building a text-based classifier for identifying protein functional sites, identifying the low coverage of curated data sources and the potential ambiguity of information about protein functional sites as challenges that must be addressed. Nevertheless we produce a simple classifier that achieves a reasonable ∼69% F-score on our full text silver corpus on the first attempt to address this classification task. The work has application in computational prediction of the functional significance of protein sites as well as in curation workflows for databases that capture this information.

MeSH terms

  • Amino Acids / chemistry
  • Artificial Intelligence
  • Binding Sites
  • Catalytic Domain
  • Computational Biology
  • Data Mining / statistics & numerical data
  • Databases, Protein / statistics & numerical data
  • Ligands
  • Natural Language Processing
  • Proteins / chemistry*
  • Proteins / classification
  • Proteins / metabolism

Substances

  • Amino Acids
  • Ligands
  • Proteins