[BioNLP] New Paper on Recognition of Chemical Entities

Roman Klinger roman.klinger at scai.fraunhofer.de
Fri Apr 20 06:49:39 EDT 2012

Hi Phil, all,

On 20.04.12 12:32, Phil Gooch wrote:
> The training and test corpora do seem very different. The 463 training
> abstracts focus on IUPAC-type mentions, but also contain a frequently
> occurring terms that were not manually annotated such as amino acids,
> and simple amine, ketone etc which a standard tagger would pick up.
> Whereas the test corpora seems to have fewer long IUPAC-type mentions
> but more inorganic compounds, amino acids, hormones, generic drug names etc.

Keep in mind the original use of the corpora: The IUPAC corpora (train 
of 463 and test of 1000) were focusing on IUPAC-like entities. The 
corpus which Tim called SCAI corpus is of 100 abstracts with different 
entities. The use of this corpus was to check how many of non-IUPAC 
entities could be found with dictionaries. That's the reason for the 
differences. I do not expect a system trained on the IUPAC train corpus 
to work really well on other classes ;-).

> I've been interested in how far you can get with NER with just morphemes
> in other areas (e.g. anatomy), so I put together a quick reg-ex based
> chemical tagger that uses ~500 morphemes categorised into 10 types and a
> few rules based on the wikipedia IUPAC entries. Just out of interest, I
> ran it blind on the corpora used in Tim's paper and it gave 55/84/66
> p/r/f on the training set and 42/76/54 on the test set. Still a long way
> off, otoh it's less than 100K and runs pretty fast. Probably not
> reportable/worth writing up but I'll upload it to my web site next week
> (it's a GATE plugin).

I am interested :-).


Dr. Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fraunhofer.de

More information about the BioNLP mailing list