Corpora in Stav

  • This is a repository of biomedical corpora which can be visualized using Stav on-line visualization tool.
  • The datasets contain semantic annotations which range from named-entities (e.g., genes and drugs) and binary relationships (e.g., protein-protein interactions) to biomedical events (e.g., phosphorylation).
  • It is a great resource for the visualization of datasets and in order to get nice examples (screen-shots) for publications and presentations.
  • We are glad to receive your feedback and suggestion of new corpora to be included here.


ScreenshotGene and protein mentions from the BioCreative 2 Gene Mention task corpus ScreenshotDiseases and treatments from the BioText corpus
ScreenshotInteractions between proteins from the BioInfer corpus ScreenshotBiological events from the BioNLP 2011 GENIA shared task corpus

Using Stav

This repository uses Stav for displaying the datasets, a tool that has been developed in the the University of Tokyo (Japan) and which allows the visualization of annotations on textual documents. Stav works better when running on Chrome, Safari and Opera browsers and it does not work on Internet Explorer.

  1. In the table below, choose a dataset you are interested in.
  2. Clicking on the link in the second column will open the dataset in a new tab.
  3. A window dialog will be open with a list of documents, double click on a document to view it.
  4. To get back to the dataset's list of documents, use your browser's *[back]* button OR the [Collection] button at the top left.

Stav allows searching for a particular text, entity, relationship or event.

  1. Click on the [Login] button at the top right to log-in into Stav.
  2. Enter "corpora" for the user and "stav" for the password.
  3. Click on the [Search] button at the top left to perform the searching.

What is included here?

In the table below you can find a list of the corpora included here with the respective links to the datasets in Stav. For more information about each corpus, click on the name of it.





protein-protein interactions *
corpus BibTeX BioC
Anatomical Entity Mention corpus
corpus (available from NACTEM) publication BioC
BioCreative 2 Gene Mention task
gene and protein mentions
the training corpus was splitted in groups of 40 sentences BibTeX [license] (soon)
protein-protein interactions *
corpus BibTeX BioC
BioNLP 2009 Shared Task on Event Extraction
biological events, such as gene expression, regulation, phosphorylation, etc.
training and development BibTeX [license] (soon)
BioNLP 2011 Shared Task
[GE] GENIA: training and development BibTeX [license] BioC
[EPI] Epigenetics and Post-translational Modifications: training and development BibTeX [license] BioC
[ID] Infectious Diseases: training and development BibTeX [license] BioC
[REL] Entity relations: training and development BibTeX [license] (soon)
disease and treatment relationships
the corpus was splitted in groups of 40 sentences BibTeX (soon)
CellFinder 1.0
entities related to the stem cell domain
full text and split by sections BibTeX [license] BioC
Data Deposition
statements of data deposition
training corpus splitted in groups of 20 sentences BibTeX (soon)
Drug-Drug Interaction Extraction 2011
First Challenge
training corpus BibTeX BioC
Drug-Drug Interaction Extraction 2013
Second challenge (tasks 9.1 and 9.2)
Medline training corpus BibTeX (soon)
DrugBank training corpus (soon)
regulation of gene expression
corpus BibTeX [license] BioC
GENIA term annotation
corpus BibTeX BioC
gene expression in anatomical locations
corpus BibTeX [license] BioC
gene regulation
E. coli BibTeX [license] BioC
Human BioC
protein-protein interactions *
corpus BibTeX BioC
protein-protein interactions *
corpus BibTeX BioC
protein-protein interactions *
corpus BibTeX BioC
human variations
corpus BibTeX BioC
protein-protein interactions
corpus splitted in groups of 20 sentences BibTeX (soon)
chemical compounds
chemicals BibTeX
IUPAC chemicals training
IUPAC chemicals test
corpus BibTeX [license] BioC
Variome Corpus
genetic variation
corpus BibTeX BioC

* For the five protein-protein interaction corpora (AIMed, BioInfer, HPRD50, IEPA, LLL), we have used the derived version from the Turku University.

All corpora were converted or adapted to the Stav format by our team, except the BioNLP 2011 datasets which are included within the Stav distribution.

Citing us

Please cite us whenever using this resource:

...the WBI corpora repository\footnote{\url{http://}...}

As well as Stav:

...the stav text annotation visualiser\footnote{\url{}}
  author    = {Stenetorp, Pontus and Topi\'{c}, Goran and Pyysalo, Sampo
      and Ohta, Tomoko and Kim, Jin-Dong and Tsujii, Jun'ichi},
  title     = {BioNLP Shared Task 2011: Supporting Resources},
  booktitle = {Proceedings of BioNLP Shared Task 2011 Workshop},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {112--120},
  url       = {}


  • 13-March-2015: Anatomical Entity Mention (AnEM) corpus
  • 19-September-2013: Download of BioC XML format files for many of the corpora. Conversion was carried out using the Brat2BioC library.
  • 21-August-2013: Variome corpus
  • 15-March-2013: DDI 2013 corpus
  • 26-March-2012: Data Deposition and Picad corpora
  • 23-March-2012: CellFinder 1.0 corpus
  • 01-December-2011: First version of the repository


  • Thanks to all people who were involved in generating any of those corpora in first place.
  • Thanks to Sampo Pyysalo (NaCTeM and University of Manchester) and Pontus Stenetorp (University of Tokyo) for the support with Stav.
  • Thanks to the Turku University for reformatting the five protein-protein interaction corpora (AIMed, BioInfer, HPRD50, IEPA, LLL).
  • Thanks to Philippe Thomas and Tim Rocktäschel and Sebastian Arzt.
  • Thanks to Thomas Stoltmann and Norbert Herold for the techical support.
  • This repository is powered by: Stav Stav, developed in the University of Tokyo, Japan.


  • Mariana Neves: format convertion and Stav installation and configuration
  • Illés Solt: format convertion
  • Ulf Leser: project leader and coordinator

Please send any comments, questions or suggestions to Mariana Neves (neves (youknowwhat)