[BioNLP] dictionaries are never sufficient

David States dstates at stateslab.org
Mon Apr 23 16:55:06 EDT 2012

There is a level at which covalent compounds can be uniquely and
specifically described by a SMILES or InChI string.  Integers and real
numbers would be other examples.  The encoding is extensible to represent
new entities that have not previously been observed.

Agree that there are lots of examples where chemists are referring to a
substance or mixture as opposed to an abstract covalent compound where this
representation is insufficient (e.g. you glass of water + chlorine + salts +
...).  Nevertheless, normalization to covalent compound is an interesting
task where novel entities could be scored independent of a dictionary

The problem with genes is that the definition requires a population biology
context.  At the point when chimps and humans diverged as species, their
albumin genes diverged as well, even if nothing was altered at a molecular
level.  Just the fact that the two populations were no longer exchanging
genetic information requires that new species be defined each with their own
set of genes.


David J. States MD PhD FACMI
Chief Scientific Officer, OncProTech LLC

-----Original Message-----
From: bionlp-bounces at lists.ccs.neu.edu
[mailto:bionlp-bounces at lists.ccs.neu.edu] On Behalf Of Bob Carpenter
Sent: Monday, April 23, 2012 2:21 PM
To: bionlp at lists.ccs.neu.edu
Subject: [BioNLP] dictionaries are never sufficient

Entities other than chemicals have the same problem with insufficiency of
dictionaries.  For instance, new people are being born and named all the

For (terrestrial) locations, one could argue that latitude/longitude are
sufficient, but I don't think that's right.  For example, "London" doesn't
refer to a point (even if we've resolved the referent to the largest city in
the UK).

Won't chemicals have the same problem in practice?
Is everything you deal with pure enough that the chemical description is
right?  Certainly the cup of water on my desk isn't entirely H_{2}0.

The problem arises for genes, because even if you know the species and
chromosomal location, people may have slightly different variants.
Sometimes there are dictionaries that cover some known variants, but the
usage is often generic across instances (and sometimes across species).

- Bob Carpenter
   LingPipe, Inc.
   Columbia Uni, Dept. of Statistics

>> On 20.04.12 11:00, Peter Corbett wrote:

> Chemistry is in an interesting situation with regards to 
> normalization, in that dictionary IDs are neither necessary nor 
> sufficient for much of it. Not necessary, because you can use chemical 
> structure to say what an entity is (for example, using a line notation 
> such as SMILES or InChI), not sufficient because new compounds are 
> invented all of the time, and the dictionaries and databases 
> (especially those which are publicly
> available) may not contain it. For me, this is the thing that makes 
> Chemical NER different from all of the other NER tasks, and I think 
> that a shared task that missed this would be missing an opportunity.

BioNLP mailing list
BioNLP at lists.ccs.neu.edu
The BioNLP website: http://www.bionlp.org

More information about the BioNLP mailing list