[BioNLP] dictionaries are never sufficient

Bob Carpenter carp at alias-i.com
Mon Apr 23 14:20:32 EDT 2012

Entities other than chemicals have the same problem
with insufficiency of dictionaries.  For instance,
new people are being born and named all the time.

For (terrestrial) locations, one could argue that
latitude/longitude are sufficient, but I don't think
that's right.  For example, "London" doesn't refer
to a point (even if we've resolved the referent to
the largest city in the UK).

Won't chemicals have the same problem in practice?
Is everything you deal with pure enough that the
chemical description is right?  Certainly the cup
of water on my desk isn't entirely H_{2}0.

The problem arises for genes, because even if you know
the species and chromosomal location, people may have
slightly different variants.  Sometimes there are dictionaries
that cover some known variants, but the usage is often
generic across instances (and sometimes across species).

- Bob Carpenter
   LingPipe, Inc.
   Columbia Uni, Dept. of Statistics

>> On 20.04.12 11:00, Peter Corbett wrote:

> Chemistry is in an interesting situation with regards to normalization,
> in that dictionary IDs are neither necessary nor sufficient for much of
> it. Not necessary, because you can use chemical structure to say what an
> entity is (for example, using a line notation such as SMILES or InChI),
> not sufficient because new compounds are invented all of the time, and
> the dictionaries and databases (especially those which are publicly
> available) may not contain it. For me, this is the thing that makes
> Chemical NER different from all of the other NER tasks, and I think that
> a shared task that missed this would be missing an opportunity.

More information about the BioNLP mailing list