Rule-Based Huamn Gene Normalization in Biomedical Text with Confidence Estimation

William W. Lau, Calvin A. Johnson*, Kevin G. Becker

Center for Information Technology, National Institutes of Health, Bethesda, MD 20892-5624, USA. johnson@mail.nih.gov

Proc LSS Comput Syst Bioinform Conf. August, 2007. Vol. 6, p. 371-379. Full-Text PDF

*To whom correspondence should be addressed.


The ability to identify gene mentions in text and normalize them to the proper unique identifiers is crucial for "down-stream" text mining applications in bioinformatics. We have developed a rule-based algorithm that divides the normalization task into two steps. The first step includes pattern matching for gene symbols and an approximate term searching technique for gene names. Next, the algorithm measures several features based on morphological, statistical, and contextual information to estimate the level of confidence that the correct identifier is selected for a potential mention. Uniqueness, inverse distance, and coverage are three novel features we quantified. The algorithm was evaluated against the BioCreAtIvE datasets. The feature weights were tuned by the Nealder-Mead simplex method. An F-score of .7622 and an AUC (area under the recall-precision curve) of .7461 were achieved on the test data using the set of weights optimized to the training data.


[CSB2007 Conference Home Page]....[CSB2007 Online Proceedings]....[Life Sciences Society Home Page]