Index ScientificPsychic.com - Expand your mind, improve your body, have fun

Modifications to the Lancaster Stemming Algorithm
(Paice/Husk Stemmer)

Antonio Zamora

Stemming is a well-known technique for information retrieval.[1]  The use of stems for searching has the advantage of increasing recall by retrieving terms that have the same roots but different endings. A major disadvantage of stemming is a decrease of precision as compared to the use of untruncated terms. When searching with stems, it is not uncommon to retrieve many irrelevant terms that have similar roots but which are not related to the object of the search. For accurate retrieval, the search stems should be as long as necessary to achieve precision, but short enough to increase recall. Several commonly-used stemming programs and algorithms were evaluated to try to select a stemmer suitable for information retrieval of large databases. The evaluation was narrowed down to two stemmers: 1) the Paice/Husk stemmer developed at Lancaster University (Paice 1990)[2] which features a rule execution mechanism and externally stored rules, and 2) the Porter stemmer (Porter 1980)[3] which uses algorithmic rules rather than externally stored rules. Neither of these stemmers could be used in their original form because some of the stems generated were not substrings of actual words or the resulting stems were too short. Both of these are important requirements for accurately searching existing large databases.

The flexibility of being able to specify a new set of rules without extensive programming changes made the Paice/Husk stemmer more attractive than the Porter stemmer. The Paice/Husk stemmer is basically a rewrite rule interpreter which may be configured as a finite state automaton by using the appropriate rules. The C-language implementation by Andrew Stark of the Paice/Husk stemmer works adequately, but is not well suited for developing and experimenting with a new set of rules. Consequently, the program was modified to improve the handling of errors in the rules, allow interactive testing, provide more precise stems, and add some flexibility for implementing finite state automata. Fewer than 50 lines of code were added or altered without counting the replacement of the driver. The new driver and debugging options make it possible to test the execution of the rules interactively. This is important because it is possible for the execution of the rules to get in an infinite loop! For example, the rule "e,e,continue" will loop forever when a word ending in "e" is input. The interaction of several rules may also result in infinite loops when they all use the continue flag. The code was modified to prevent infinite loops by stopping when the number of rules executed exceeds twice the number of characters in the input word. The new debugging options helped to solve the mystery of why the original rules generated the stem "abud" from "abusively":

<abusively> 100->abusive 13->abusiv 94->abuj 27->abud

The Code and Test Programs

The source code and test programs for Windows are available in a ZIP file:
Click here to download Paice.zip (31K)

The source files are also separately viewable:
Stemmer driver: paicedrv.c
Stemmer: paice.c
Include file: rule.h
Stemming rules: azrules.txt

Web-based interactive test temporarily out of service.

Summary of the modifications:

  1. Developed a new driver that allows interactive or batch input. Increased the maximum size of the input words to 60 characters to allow for medical terminology.
  2. Added a debugging option to display the rules applied to the transformation of a word. This made it possible to determine the interaction of the rules and to identify rule problems. The option is turned on and off by using "debugon" and "debugoff" as words in the input stream. When the first word input is "debugon" or "printrules", the rule numbers and the rules are displayed. The "printrules" is used to obtain a reference printout of the rules and rule numbers for large batch files when it is not necessary to display debug information for the words.
  3. Added checks to prevent the program from crashing for rules with incorrect format. Generated error diagnostics for bad rules. Processed all rules before stopping when there were errors. To prevent infinite loops caused by carelessly constructed rules, the execution of rules is stopped when the number of rules executed exceeds twice the number of characters in the input word.
  4. Skipped blank lines and treated as comments lines starting with a semicolon in the rule file.
  5. Restricted valid stems to a minimum of 3 characters and a maximum of 10 characters. Stems shorter than three characters do not have enough precision for searching large databases and stems over 10 characters do not increase precision significantly.
  6. Increased the allowable length of the suffix from 7 to 11 characters.
  7. Modified the applyrule() module to compare the length of the suffix against the word as a pre-screen to classify the rules as 'not applicable' when:
    1. the suffix is longer than the word.
    2. application of the rule would result in unacceptably short stems.
    This change makes it possible to apply rules of smaller length that are substrings of another rule, e.g, given the word "actions", minimum stem size = 3, and the rules:
    (tions)->() stop
    (ions)->() stop
    (s)->() stop
    The suffix 'tions' will not apply because the resulting length would be too small, but 'ions' will apply and generate "act". For the word "lions", 'ions' will not apply, but 's' applies to generate 'lion'.
  8. Introduced the use of two-digit strings as special markers in rules. The markers may be considered to define states of a finite state automaton and are only allowed at the ends of strings. These special numeric markers are considered to have zero length when determining the minimum stem size. The need for such markers was recognized by Paice who used the letter "j" as a dummy marker. The current implementation provides 100 two-digit combinations from 00 to 99. In the accompanying rules, for example, "14" is used as a marker for endings which may be preceded by letters that double.
  9. Developed a new rule set of approximately 300 rules with limited use of the "continue" option. This was achieved by listing compound endings in the rules. The continue option was used primarily for the suffixes "less", "ness", and "ly" which may occur in combination with a large number of other endings, and for suffixes where the left context needs to be checked. The rules were tested and refined against a 160,000 English dictionary containing approximately 60,000 medical words (Doszkocs and Zamora 2003)[4]. Rule development is a process that is never finished. Rules to handle a small number of exceptions may be placed ahead of a more general rule, but the process becomes a slower lexical approach if too many exception rules are added.

Batch testing

The TEST.BAT file invokes the stemmer in batch mode. The command
paice.exe azrules.txt data.txt stems1.txt >>msg.txt

invokes the stemmer(paice.exe) and specifies the rules file(azrules.txt), the input file(data.txt), and the output file(stems1.txt). The sysout stream is redirected to the file "msg.txt" which will contain diagnostics and debugging output.

These are some examples of stemmed output using the AZRULES with the modified Paice/Husk stemmer. A space separates the stem from the ending. The number of the last rule applied follows the ending. This list was sorted alphabetically after stemming.

determin ativeness 115
determin e 49
determin ed 133
determin edly 133
determin edness 133
determin er 133
determin ers 133
determin es 180
determin ing 133
determin ism 77
determin ist 244
determin istic 9
determin istically 58
determin ists 215

This is the listing of a portion of the AZRULES as displayed when the first word input is "debugon" or "printrules". Debug output is sent to the sysout data stream, rather than to the output file.

; 14 marker is used to undouble some doubled letters
125 (bb14)->(b) stop
126 (dd14)->(d) stop
127 (ff14)->(f) stop
128 (gg14)->(g) stop
129 (mm14)->(m) stop
130 (nn14)->(n) stop
131 (pp14)->(p) stop
132 (rr14)->(r) stop
133 (tt14)->(t) stop
134 (14)->() stop
; === R ===
135 (ar)->() stop
136 (eer)->(eer) stop
137 (lier)->() stop
138 (ier)->(14) cont.
139 (ener)->() stop
140 (iser)->() stop
141 (izer)->() stop
142 (yzer)->(y) stop
143 (er)->(14) cont.
144 (ator)->(a10) cont.
145 (or)->(14) cont.
146 (eur)->() stop

The following listing shows the results from using the "debugon" option in a batch test. The rule number applied and the changes to the word are indicated in each line. The first line shows how the "er" ending of rule 143 matched against the word "stopper" and replaced it with the special "14" marker. Rule 131 matched "pp14" and stripped "p14" to generate the final stem "stop". Contrast this with the second line for the word "filler"

<stopper> 143->stopp14 131->stop
<filler> 143->fill14 134->fill
<determined> 20->determin14 134->determin
<determinedly> 265->determined 20->determin14 134->determin
<determinedness> 221->determined 20->determin14 134->determin
<determiner> 143->determin14 134->determin
<determiners> 212->determin14 134->determin
<determines> 181->determin
<determinants> 227->determina10 116->determin
<determinate> 41->determina10 116->determin
<determinateness> 221->determinate 41->determina10 116->determin
<determination> 99->determina10 116->determin
<determinations> 203->determina10 116->determin
<determinative> 44->determina10 116->determin
<determinatively> 262->determina10 116->determin
<determinativeness> 221->determinative 44->determina10 116->determin
<determinables> 170->determin
<determinately> 261->determina10 116->determin
<deterministically> 265->deterministical 58->determin

Interactive Testing

The interactive option is invoked by typing "paice azrules.txt" at an MSDOS prompt in the directory containing the program and the rules, or by clicking on the Windows "Start"->"run" and browsing for the "paice.exe" file. Once the executable file is selected, modify the line to be executed by appending a space and "azrules.txt", then click "OK". The interactive option is terminated by typing CTRL+C. The following listing shows the interactive dialog and the specification of the "debugon" option.

>paice azrules.txt

Input line of text to be stemmed:
refinement
refin ement 49

Input line of text to be stemmed:
debugon
<debugon>
debugon

Input line of text to be stemmed:
refinement
<refinement> 236->refine 49->refin
refin ement 49

Input line of text to be stemmed:
terminate with CTRL+C

Acknowledgement

This work was accomplished with funding from the Specialized Information Services Division (SIS) of the National Library of Medicine (NLM).

References:

  1. Ulmschneider, J.E. and Doszkocs, T.E., "A practical stemming algorithm for online search assistance", Online Review, 7(4), 301-318, 1983.
  2. Paice, C.D., 1990: "Another stemmer", SIGIR Forum, 24(3), 56-61, 1990.
  3. Porter, M.F., "An algorithm for suffix stripping", Program, 14(3), 130-137, 1980.
  4. Doszkocs, T. E. and Zamora, A., "Dictionary Services and Spelling Aid for Web-based Information Systems", National Library of Medicine, 2003. Click Here
  5. The Lancaster Stemming Algorithm

© Copyright  - Antonio Zamora