[BioNLP] New Paper on Recognition of Chemical Entities
roman.klinger at scai.fraunhofer.de
Fri Apr 20 06:49:39 EDT 2012
Hi Phil, all,
On 20.04.12 12:32, Phil Gooch wrote:
> The training and test corpora do seem very different. The 463 training
> abstracts focus on IUPAC-type mentions, but also contain a frequently
> occurring terms that were not manually annotated such as amino acids,
> and simple amine, ketone etc which a standard tagger would pick up.
> Whereas the test corpora seems to have fewer long IUPAC-type mentions
> but more inorganic compounds, amino acids, hormones, generic drug names etc.
Keep in mind the original use of the corpora: The IUPAC corpora (train
of 463 and test of 1000) were focusing on IUPAC-like entities. The
corpus which Tim called SCAI corpus is of 100 abstracts with different
entities. The use of this corpus was to check how many of non-IUPAC
entities could be found with dictionaries. That's the reason for the
differences. I do not expect a system trained on the IUPAC train corpus
to work really well on other classes ;-).
> I've been interested in how far you can get with NER with just morphemes
> in other areas (e.g. anatomy), so I put together a quick reg-ex based
> chemical tagger that uses ~500 morphemes categorised into 10 types and a
> few rules based on the wikipedia IUPAC entries. Just out of interest, I
> ran it blind on the corpora used in Tim's paper and it gave 55/84/66
> p/r/f on the training set and 42/76/54 on the test set. Still a long way
> off, otoh it's less than 100K and runs pretty fast. Probably not
> reportable/worth writing up but I'll upload it to my web site next week
> (it's a GATE plugin).
I am interested :-).
Dr. Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
D-53754 Sankt Augustin
email: roman.klinger at scai.fraunhofer.de
More information about the BioNLP