  • This is a repository of biomedical corpora which can be visualized using Stav on-line visualization tool.
  • The datasets contain semantic annotations which range from named-entities (e.g., genes and drugs) and binary relationships (e.g., protein-protein interactions) to biomedical events (e.g., phosphorylation).
  • It is a great resource for the visualization of datasets and in order to get nice examples (screen-shots) for publications and presentations.
  • We are glad to receive your feedback and suggestion of new corpora to be included here.


This repository uses Stav for displaying the datasets, a tool that has been developed in the the University of Tokyo (Japan) and which allows the visualization of annotations on textual documents. Stav works better when running on Chrome, Safari and Opera browsers and it does not work on Internet Explorer.

  1. In the table below, choose a dataset you are interested in.
  2. Clicking on the link in the second column will open the dataset in a new tab.
  3. A window dialog will be open with a list of documents, double click on a document to view it.
  4. To get back to the dataset's list of documents, use your browser's *[back]* button OR the [Collection] button at the top left.

Stav allows searching for a particular text, entity, relationship or event.

  1. Click on the [Login] button at the top right to log-in into Stav.
  2. Enter "corpora" for the user and "stav" for the password.
  3. Click on the [Search] button at the top left to perform the searching.

In the table below you can find a list of the corpora included here with the respective links to the datasets in Stav. For more information about each corpus, click on the name of it.





protein-protein interactions *
corpus BibTeX BioC
Anatomical Entity Mention corpus
corpus (available from NACTEM) publication BioC
BioCreative 2 Gene Mention task
gene and protein mentions
the training corpus was splitted in groups of 40 sentences BibTeX [license] (soon)
protein-protein interactions *
corpus BibTeX BioC
BioNLP 2009 Shared Task on Event Extraction
biological events, such as gene expression, regulation, phosphorylation, etc.
training and development BibTeX [license] (soon)
BioNLP 2011 Shared Task
[GE] GENIA: training and development BibTeX [license] BioC
[EPI] Epigenetics and Post-translational Modifications: training and development BibTeX [license] BioC
[ID] Infectious Diseases: training and development BibTeX [license] BioC
[REL] Entity relations: training and development BibTeX [license] (soon)
disease and treatment relationships
the corpus was splitted in groups of 40 sentences BibTeX (soon)
CellFinder 1.0
entities related to the stem cell domain
full text and split by sections BibTeX [license] BioC
Data Deposition
statements of data deposition
training corpus splitted in groups of 20 sentences BibTeX (soon)
Drug-Drug Interaction Extraction 2011
First Challenge
training corpus BibTeX BioC
Drug-Drug Interaction Extraction 2013
Second challenge (tasks 9.1 and 9.2)
Medline training corpus BibTeX (soon)
DrugBank training corpus (soon)
regulation of gene expression
corpus BibTeX [license] BioC
GENIA term annotation
corpus BibTeX BioC
gene expression in anatomical locations
corpus BibTeX [license] BioC
gene regulation
E. coli BibTeX [license] BioC
Human BioC
protein-protein interactions *
corpus BibTeX BioC
protein-protein interactions *
corpus BibTeX BioC
protein-protein interactions *
corpus BibTeX BioC
human variations
corpus BibTeX BioC
protein-protein interactions
corpus splitted in groups of 20 sentences BibTeX (soon)
chemical compounds
chemicals BibTeX
IUPAC chemicals training
IUPAC chemicals test
corpus BibTeX [license] BioC
Variome Corpus
genetic variation
corpus BibTeX BioC

* For the five protein-protein interaction corpora (AIMed, BioInfer, HPRD50, IEPA, LLL), we have used the derived version from the Turku University.

All corpora were converted or adapted to the Stav format by our team, except the BioNLP 2011 datasets which are included within the Stav distribution.

Citing us

Please cite us whenever using this resource:

