|Antonio Zamora, Consultant|
|Specialized Information Services Division,|
National Library of Medicine
The Specialized Information Services Division (SIS) of the National Library of Medicine (NLM) provides Web access to more than a dozen scientific databases on toxicology and the environment on TOXNET1. Search queries on TOXNET often include misspelled or variant English words, medical and scientific jargon and chemical names. Following the example of search engines like Google and ClinicalTrials.gov, we set out to develop a spelling "suggestion" system for increased recall and precision in TOXNET searching.
This paper describes development of dictionary technology that can be used in a variety of applications such as orthographic verification, writing aid, natural language processing, and information storage and retrieval. The design of the technology allows building complex applications using the components developed in the earlier phases of the work in a modular fashion without extensive rewriting of computer code. Since many of the potential applications envisioned for this work have on-line or web-based interfaces, the dictionaries and other computer components must have fast response, and must be adaptable to open-ended database vocabularies, including chemical nomenclature.
The dictionary vocabulary for this work was derived from SIS and other databases and specialized resources, such as NLM's Unified Medical Language Systems (UMLS). The resulting technology, A-Z Dictionary (AZdict), has three major constituents: 1) the vocabulary list, 2) the word attributes that define part of speech and morphological relationships between words in the list, and 3) a set of programs that implements the retrieval of words and their attributes, and determines similarity between words (ChemSpell). These three components can be used in various applications such as spelling verification, spelling aid, part-of-speech tagging, paraphrasing, and many other natural language processing functions.
Creating a word list is a tedious manual procedure. Words must be selected from lists of words collated from databases. Words with low frequencies must be carefully scrutinized because they could be misspellings or infrequent terms that should not be added to the dictionary. The list of words may be checked against existing computer-readable dictionaries and verified against commercial dictionaries by importing the word lists into word processors and using their spelling checking functions. Words flagged by the spelling checkers may be outside the scope of the commercial dictionaries and are not necessarily errors. At this point begins the arduous task of adding word attributes and verifying the words against printed specialized dictionaries. This is discussed in more detail below.
The goal of the vocabulary building stage is to create word lists that will meet the requirements of the applications that they are to serve. For this project, it was necessary to build a large medical dictionary containing approximately 66,000 terms, and a fairly comprehensive dictionary for English (US-UK) with approximately 100,000 words which combined British and American spellings. The medical dictionary is used only when a search in the English dictionary fails, so the dictionaries complement each other and do not need to contain overlapping vocabulary.
The entries of the dictionaries contain single alphabetic strings, but they also include compound words (e.g., bullfrog), hyphenated nouns (e.g., mother-in-law), some words with apostrophes (e.g., O'Hare), and capitalized proper names or acronyms. Hyphenated adjectival strings such as "fast-running" are not included in the dictionaries. Chemical compounds are excluded from the dictionary, except for frequently occurring one-word trivial names. The US-UK dictionary may contain common chemicals such as "aspirin", "heroin", and "uranium", while the medical dictionary includes less common substances such as "benzene", "toluene", "dopamine", etc. Similarly, common diseases and medical terminology such as "meningitis" are included in the US-UK dictionary, but the less common term "meningitic" is found in the medical dictionary.
Building the dictionaries required evaluating lists of unique author terms found in selected NLM/SIS databases. Terms that were questionable were eliminated. Terms that were non-idiomatic or judged to have a very small chance of recurring were also excluded. The words were examined in isolation, without recourse to any clarifying context. Given the word "biasness", the dictionary editor had to decide if this was a real word or the erroneous use of "bias" as a noun. Is the word "Valparafso" a correct proper noun or is it a typographical error for "Valparaiso"? Is "dieselize" a valid word? It is not unusual to have to consult several dictionaries to resolve such problems. On average, it may take 10 to 15 seconds to resolve an easy problem. A complex problem may require a lot more time. For the sake of accuracy, it is better to omit a questionable term from the dictionary than to include it. However, the editor also has to keep in mind that a new term in an active area of research may not be in printed reference dictionaries, in which case the term should be added to the dictionary.
Word morphology is usually categorized into "inflectional" and "derivational" morphology. Each word in the A-Z Dictionary is tagged with one or more inflectional codes that define its part of speech and inflectional morphology. Inflectional morphology describes the transformation of words depending on syntactic constraints. The verb "reduce", for example, has three inflectional verb forms: "reduces", "reduced" and "reducing". Derivational morphology deals with the transformations of the stem of a word to generate other words that retain the same concept but may have different syntactic roles. Thus, "reduce" and "reduction" refer to the concept of "making less" or "making smaller", but one is a verb and the other one a noun. Prefixes and suffixes may create new words whose meaning would still be constrained by the concept of the root, e.g., unreduced, multireduction, antireduction, polyreducibility, etc. Each of the derivational variants will have the inflectional forms appropriate for its part of speech or role in a sentence. Derivational variants are defined by grouping the related words in an auxiliary file.
The inflectional morphology codes are based on the part of speech and suffixes that apply to a word. The word "stop", for example, may be a verb, a noun, or an adjective. As a verb, the last letter is doubled for the past tense and past participle ("stopped", "stopping"), so "stop" shares the same inflectional morphology as the verbs "bag", "blot", etc. All these words are given the same verb code. Similarly, adding an "s" forms the plural of "stop" and the word is assigned the corresponding noun inflectional code. Homographs with different characteristics require multiple classifications, e.g., "lie" as a verb with two different conjugations requires two different verb inflectional codes. British and American spellings sometimes share the same codes because they have the same inflectional morphologies. For example, the plural of "theater" and "theatre" are both formed by adding an "s", and the words may use the same noun inflectional code.
The dictionary access code is a set of programs that retrieves words and their attributes from dictionaries and can be invoked using Visual Basic, C-Language, or Java interfaces. Additional subroutines may be called while the dictionary is being scanned to determine the degree of similarity between words. This functionality makes it possible to find exact matches for spelling verification and similar matches for spelling aid.
The speed of the program depends on how fast a dictionary can be searched. The search time can be reduced significantly using heuristics to search only the relevant subset of the dictionary. Thus, if a word starts with the letter "t", searching only words starting with "t" of length equal to the target word will produce the desired results faster. For spelling aid, it is necessary to search portions of the dictionary that have alternate initial letters based on phonetic or typographic criteria. For example, when searching for words phonetically similar to "ceiling", spelling aid should also search words starting with "s" and suggest "sealing". For spelling aid, it is also necessary to consider words whose lengths are shorter or longer than the target word.
The programs allow searching the English dictionary or the English dictionary plus the medical dictionary. The English dictionary is always searched first because it will match over 95% of normal text. When two dictionaries are specified, the search stops as soon as the word is found. The medical dictionary is only searched when the word is not found in the English dictionary.
A string similarity program and several key generation programs are basic constituents of the software. The string similarity program determines the lexical distance between two strings2. The lexical distance is defined as the number of error operations required to convert one string to another, the error operations being: insertion, omission, substitution, and transposition of two adjacent characters. Application of the string similarity program to two words is generally satisfactory for minor typographical errors3. However, it is also possible to apply the string similarity to mappings of two words. A mapping is a string derived by an arbitrary transformation of a word. The transformation may involve application of phonetic rules to produce a phonetic key of the word. A phonetic key may be constructed by removing duplicate consecutive letters and applying some transformation rules such as "ce" → "se", "ph" → "f", etc. The transformation may also involve separating chemical locants from a chemical name to create a mapping in which the locants become insignificant. The programs that produce these mappings are called "key generation programs" because they produce strings that can be used as keys for sorting or indexing files. These key generation programs complement the string similarity program by providing different perspectives of the vocabulary and database index entries.
ChemSpell is an adaptation of standard spelling aid technology4,5,6 to large, unbounded chemical nomenclature databases. Chemical names differ substantially from normal English vocabulary. They can exceed 700 characters in length and may include imbedded spaces and punctuation. Unlike traditional dictionaries that have several hundred thousand words, chemical databases may have several million chemical names.
Previous approaches for spelling aid of chemical substances have used n-grams7. N-grams are bit maps created by hashing every two or three adjacent characters (bigram, trigram) of the chemical name. Bit-by-bit comparison with n-grams of other chemical names is used to establish similarity. This approach, although adequate, requires specialized bit-mapping software that is not part of standard spelling aid systems.
The ChemSpell approach emphasizes the use of common components and involves generation of keys that can be compared using the standard string similarity program. Instead of mapping the chemical names into n-grams, the ChemSpell keys reduce the names to basic forms that are relatively invariant for sets of names with related characteristics. Some keys are stored in the database index and others are not. Phonetic keys are generated every time that a dictionary is scanned, whereas chemical nomenclature keys are stored as an index to the file. Every time that a new chemical name is added to the file, its nomenclature key is generated and updated in the index. This makes it possible to scale up to the limits supported by the database software.
The chemical keys have a maximum length of 100 characters and consist only of lower case alphabetic characters. The first character of the key is the first character of the chemical name ignoring punctuation, numbers, single non-alphabetic characters isolated by punctuation, Greek letters, and stereo descriptors. The next characters of the key are 1) the consonants, except "y", listed in the order in which they occur in the chemical name, 2) the vowels or the letter "y", also listed in the order in which they occur in the chemical name, and 3) single letters isolated by punctuation. Adjacent duplicate letters in the chemical name are excluded from the chemical key. Examples: The chemical key for "p-Nitrobenzoic acid" is "ntrbnzccdioeoiaip" and the key for "N-Aminopyridine" is "amnprdnioyiien". Chemical keys are designed to be insensitive to consonant/vowel transpositions, letter doubling, case differences, and punctuation differences. The fact that the consonants and the vowels are separated makes the chemical key insensitive to consonant/vowel transpositions, thus "nitro" and "ntiro" have the same key. Ignoring duplicate adjacent letters and capitalization means that "Niitro" and "nitrro" have the same key as "nitro". Finally, by placing isolated single letters at the end of the key (which are generally locants) it is possible to sort together related substances in the chemical key index. Searching can be optimized by selecting subsets of the database based on the initial characters of the search keys and reducing the subsets through length and lexical distance constraints.
The AZdict and ChemSpell components have been tested in several applications. The compressed dictionaries are available with a Visual Basic interface. This package provides dictionary lookup with inflectional morphology support, spelling aid with phonetic proximity, and derivational morphology. The program also includes a help file containing a formal description of English. This grammar is the basis of a part-of-speech tagger that has also been coded using these components and is available with a Visual Basic interface . The tagger uses the inflectional morphology information in the dictionaries as the basis for deducing parts of speech using contextual syntactic criteria.
The dictionary technology has also been applied to selected NLM/SIS web-searchable databases. The primary goal was to provide spelling aid for queries with no results. Frequently, the terms in a database differ only slightly in punctuation or spelling from query terms specified the users. The ChemSpell technology makes it possible to find all similar terms in the database, including misspellings. A test interface for NLM's Hazardous Substance Data Bank (HSDB) verifies user query terms against the US-UK and medical dictionaries in addition to the vocabulary of the Hazardous Substance Data Bank. An interface was also created in the "Name/Synonym" search of ChemIDPlus, the main repository of chemical nomenclature for NLM.
The ChemSpell component for ChemIDPlus was designed as an add-on feature consisting of additional index files and a search component to facilitate integration into the existing operational systems. Two index files are used as sources of spelling aid candidates for the ChemIDPlus database. The first index file consists of the main chemical name index. The second index file consists of chemical keys which are derived from each chemical name.
The search component of ChemSpell is the mechanism for selecting database records and for determining the lexical distance between each record and the query term. Programs to calculate this distance normally compare complete strings2, but the program used in this project was designed to terminate as soon as the number of error operations exceeds a specified threshold. In principle, the smaller the lexical distance, the greater similarity between the terms. The number of identical initial characters of the terms, and the difference in length between the query term and the database term are used as additional criteria for ranking the final results.
Ideally, the best spelling aid candidates can be found when all the entries of the database are examined for similarity to the search term. This exhaustive approach can be prohibitively slow and expensive. A practical solution is to selectively search subsets of the database that are likely to contain the best spelling aid candidates. ChemSpell accomplishes this by using the two initial letters of the chemical key and of the name key. In addition, the initial letters of the most frequently encountered phonetic transformations for the name key are also used for retrieval, when applicable. Using two or more initial letters of a search term retrieves a small and manageable subset of the database that may be further reduced by length constraints. No database terms are retrieved whose key length differs from the search term key by more than 4 characters. Such words would be far too different from the search term. A search term such as "clorine" will be searched in the subset of the database starting with "cl", plus at the subsets that start with the phonetically equivalent "chl" and "kl". Figure 1 is a summary of the phonetic targeting rules:
ce → se, sce, ke, ch|
ci → si, sci, ki, ch
cy → sy, scy, ky, ch
ch* → c*, k*
cl → chl, kl
cr → chr, kr
cu → qu, ku
c* → k*, ch* (where * is not e,i,y,h,l,r or u)
f* → ph*
kl → chl, cl
kr → chr, cr
n* → gn*, kn*, mn*, pn* (where * is a,e,i,o,u,y)
ph* → f*, th
pn* → n*
ps* → s*
s* → c*, ps (where * is e,i,y)
s* → ps (where * is a,o,u)
t* → pt* (where * is not h)
th → ph
The phonetic targeting rules make it possible to use at least two letters of the query term to obtain manageable subsets of the database in the search for spelling aid candidates. A question that comes up frequently is: "What happens if the first two letters are misspelled?". Obviously, the name key retrieval will probably fail, unless one of the phonetic targeting rules takes care of the problem. However, the chemical key might still be able to provide useful candidates since it indexes the database from a different perspective. It would be possible to build a system that is unaffected by initial misspellings by generating an additional index consisting of keys created by reversing the words. In such a system, the word "claim" would be indexed at both "claim" and "mialc". Even if the word were misspelled as "alcim" the database access using the two letters "mi" from the reverse key would retrieve the relevant word. This is a process of diminishing returns that avoids searching the whole database, but which requires evaluation of the costs versus the benefits. The chemical keys, the name keys, and the phonetic targeting rules for ChemIDPlus have been very effective at retrieving relevant spelling aid candidates while maintaining fast response time. The spelling aid design for ChemIDPlus has always kept in mind that it is an interactive system with intelligent users who may need some basic spelling help but who can rephrase queries easily if the system does not provide the appropriate results. It is better to require a user who grossly misspells words to re-type them than to punish all users with the slower response time that would result from building a fractionally better system that uses greater computer resources.
The list of aid candidates is built dynamically as the database retrievals are evaluated. The ChemIDPlus name index is searched first and the chemical key index is searched in a second pass. As the name index retrievals are being examined, phonetic keys are generated for each retrieved entry. The minimum distance between the name keys or the phonetic keys is saved. The chemical name from the database and its distance are placed on the spelling aid list. The retrievals from searching the chemical key index are merged with those from the name index, but using the distance of the chemical keys. Entries with distances greater than 4 are never added to the aid candidate list. If a duplicate chemical name is found on the aid list, only its distance is replaced if the distance of the current entry is lower than the one already in the list. When the aid list is full with the maximum number of candidates, an entry with a lower distance will displace one with a higher distance. This has the effect of storing the chemical names with the lowest phonetic, name, or chemical distances in the spelling aid list. As an example, consider the search term "octadeine" and the database entry "Octa-2,3-diene" whose name key distance is 6. This entry has a chemical key distance of 1 since the two numeric locants and the three punctuation characters are ignored. The name key would reject "Octa-2,3-diene" as being too different from "octadeine", but the chemical key indicates that it is a good spelling aid candidate because there is only one error operation (the "ei/ie" transposition).
Once the name index and the chemical key index files have been scanned, the aid list is sorted in ascending distance order to place the best candidates at the top of the list. Figures 2, 3, and 4 show some examples of chemical spelling aid for ChemIDPlus:
|Input term: octadeine|
Octadiene [UN2309] [Flammable liquid]
|Input term: tronitro toleuene|
|Input term: dimethyl octadienol|
One final aspect that needs to be discussed is the ease with which the spelling aid mechanisms can be maintained. Basically, when a database is updated, keys must be generated for each chemical name and the corresponding spelling aid indexes need to be updated with the new entries. This is easily supported by databases such as Oracle and makes it possible for ChemSpell to provide the newest entries in the database as spelling aid candidates when appropriate.
The dictionary components are easily portable and can be included in many applications. We anticipate improvements in user interfaces for NLM/SIS database queries through the use of the spellchecking, spelling aid, and parsing technologies that have already been developed. Future work using other data sources will deal with semantic relationships, paraphrasing and nominalization based on derivational morphology and deep morphology, i.e., morphosemantic analysis8, to extract the meaning of word substrings for enriching queries by mapping medical terminology into English. For example, "hepatitis" would generate the phrase "inflammation of the liver". Other practical enhancements will include spellchecking multiple databases to display only spelling suggestions for terms that actually exist in a given database, bootstrapping the A-Z Dictionary technology to identify frequent misspellings in NLM/SIS databases and assisting writers and reviewers in the creation of the TOXNET Hazardous Substances Data Bank.