[BioNLP] New data set for biomedical word sense disambiguation (MSH WSD data set)

Antonio Jimeno antonio.jimeno at gmail.com
Mon Nov 21 11:21:51 EST 2011

We have prepared a data set for Word Sense Disambiguation WSD based on
a method that can be used to automatically develop a WSD test
collection using the Unified Medical Language System (UMLS)
Metathesaurus and the manual MeSH indexing of MEDLINE. Our work has
been recently published in BMC Bioinformatics:

        title={Exploiting MeSH indexing in MEDLINE to generate a data
set for word sense disambiguation},
        author={Jimeno-Yepes, A.J. and McInnes, B.T. and Aronson, A.R.},
        journal={BMC bioinformatics},
        publisher={BioMed Central}

The resulting dataset is called MSH WSD and consists of 106 ambiguous
abbreviations, 88 ambiguous terms and 9 which are a combination of
both, for a total of 203 ambiguous words. Each instance containing the
ambiguous word was assigned a CUI from the 2009AB version of the UMLS.
For each ambiguous term/abbreviation, the data set contains a maximum
of 100 instances per sense obtained from MEDLINE; totaling 37,888
ambiguity cases in 37,090 MEDLINE citations.

The corpus is available at:


There are two available formats.

* The "Small MSH WSD Data Set" contains three files. The first file is
the benchmark_mesh.txt file which lists the ambiguous word and
candidate CUIs. The second file is the term_pmid_cui file which
contains one line for each ambiguous word, the PMID, and the
disambiguated CUI. The third file is a README.txt file which explains
the files in more detail.

* The "Full MSH WSD Data Set" contains two files and a directory. The
first file is the benchmark_mesh.txt file as above. The second file is
a README.txt file which explains the files in more detail. The
directory contains a file for each of the 203 ambiguous words
containing the pmid, the citation text (title and abstract only), and
the sense based on the name derived from the benchmark file (M1, M2,
...). In the citation text, the instance of the ambiguous word
considered for disambiguation is denoted by the e tag (e.g.<e>AA</e>).

Please Note: The 37,090 MEDLINE citations included in this "Full MSH
WSD Data Set" are for exclusive use with the MSH WSD Data Set and
cannot be redistributed. In addition, the citations were retrieved in
July 2010 and represent a static view of MEDLINE at that time. The
data set has been reformatted such that none of the MEDLINE ASCII
element labels (e.g., PMID- or TI -") remain and only the Title (TI)
and Abstract (AB) elements were used.

Contact information:

Antonio Jimeno-Yepes, U.S. National Library of Medicine,
antonio.jimeno at gmail.com
Bridget T. McInnes, University of Minnesota Twin Cities, btmcinnes at gmail.com

