[BioNLP] New Paper on Recognition of Chemical Entities
philgooch at gmail.com
Fri Apr 20 06:32:21 EDT 2012
The training and test corpora do seem very different. The 463 training
abstracts focus on IUPAC-type mentions, but also contain a frequently
occurring terms that were not manually annotated such as amino acids, and
simple amine, ketone etc which a standard tagger would pick up. Whereas the
test corpora seems to have fewer long IUPAC-type mentions but more
inorganic compounds, amino acids, hormones, generic drug names etc.
I've been interested in how far you can get with NER with just morphemes in
other areas (e.g. anatomy), so I put together a quick reg-ex based chemical
tagger that uses ~500 morphemes categorised into 10 types and a few rules
based on the wikipedia IUPAC entries. Just out of interest, I ran it blind
on the corpora used in Tim's paper and it gave 55/84/66 p/r/f on the
training set and 42/76/54 on the test set. Still a long way off, otoh it's
less than 100K and runs pretty fast. Probably not reportable/worth writing
up but I'll upload it to my web site next week (it's a GATE plugin).
On Fri, Apr 20, 2012 at 6:00 PM, Peter Corbett <
> peter.corbett at linguamatics.com> wrote:
>> On 19/04/12 18:19, Phil Gooch wrote:
>> > Hi Ulf
>> > Thanks for this. Unfortunately I don't have access to the full paper.
>> > Can I ask: is the 68.1% F1 measure calculated using strict (exact
>> > boundary match) or lenient (some overlap allowed) criteria?
>> No access here either.
>> I think there's a bigger issue with evaluation here. I've reported F
>> scores as high as 83.2% on chemistry before (strict boundary match):
>> http://www.biomedcentral.com/1471-2105/9/S11/S4/ - I think a lot depends
>> a) What the source text for the evaluation corpus was.
>> b) Exactly which chemical named entities were being annotated.
>> c) How well-defined the annotation task was; i.e. how extensive the
>> guidelines were.
>> d) How good the inter-annotator agreement was.
>> e) Whether the software was developed for the corpus - i.e. whether
>> development sets were annotated with the same guidelines as the test data.
>> f) Whether the training set was annotated with the same guidelines as
>> the test set (e.g. by cross validation).
>> Given all of these, it's not hard to see how F scores might go up or
>> down by 20% or so depending on evaluation conditions. Really, we need a
>> BioCreative for chemical NER.
>> (Incidentally, F is a perverse metric, as precision-recall curves are
>> typically the mirror image of F score contours, so another point is: g)
>> Whether the software tried to balance precision and recall. But that's
>> just a pet peeve of mine.)
>> Peter Corbett
>> BioNLP mailing list
>> BioNLP at lists.ccs.neu.edu
>> The BioNLP website: http://www.bionlp.org
> BioNLP mailing list
> BioNLP at lists.ccs.neu.edu
> The BioNLP website: http://www.bionlp.org
-------------- next part --------------
HTML attachment scrubbed and removed
More information about the BioNLP