[BioNLP] New Paper on Recognition of Chemical Entities

BalaKrishna Kolluru balakkvj at gmail.com
Fri Apr 20 09:43:30 EDT 2012


On 20 April 2012 13:02, Peter Corbett <peter.corbett at linguamatics.com>wrote:

> On 20/04/12 11:08, Sampo Pyysalo wrote:
> > Peter, I realize you're not working on this anymore, but would you
> > happen to have any information about possible release of these corpora?
> >  From what I've read on Peter Murray-Rust's blog, I understand that the
> > publishers have agreed to the release of the relevant annotated
> > sentences from the full-text publications (e.g. comments at
> >
> http://blogs.ch.cam.ac.uk/pmr/2011/11/29/scientists-should-never-use-cc-nc-this-explains-why/
> ),
> > which would be enough to allow retraining and direct comparison. These
> > resources would be very valuable for future work on chemical NER.
>
>
>
>
Having worked on Chemistry NER (for U-Compare) for a decent amount of time,
I think we will need a more project-independent corpus. I have worked with
Sciborg corpus ( Peter Corbett et al used it for Oscar3) and had a good
look at SCAI.

They are both seem very project specific corpora and any work with is out
of the scope of these projects will involve tinkering with the corpus which
kind of defeats the purpose of having a Chemical NER, just chemical NER.

To illustrate, there are roughly 3797 IUPAC names in SCAI and ~4102
Chemical compounds in Sciborg.
All other classes of chemicals namely, drugs, reactions, enzymes, chemical
words are so few in number that most modelling techniques will not
converge, even if they did, the model is likely to be ordinary at best. I
am not sure why these other classes are defenestrated during annotation.

If we do go for a shared task or anything like BioCreative, I reckon it is
better take a fresh stock rather than dwell on the past and keep re-jigging
the same old corpora again and again.

That's my two pence anyway ...
-------------- next part --------------
HTML attachment scrubbed and removed


More information about the BioNLP mailing list