Stemming is a well-known technique for information retrieval.[1] The use of stems for searching has the advantage of increasing recall by retrieving terms that have the same roots but different endings. A major disadvantage of stemming is a decrease of precision as compared to the use of untruncated terms. When searching with stems, it is not uncommon to retrieve many irrelevant terms that have similar roots but which are not related to the object of the search. For accurate retrieval, the search stems should be as long as necessary to achieve precision, but short enough to increase recall. Several commonly-used stemming programs and algorithms were evaluated to try to select a stemmer suitable for information retrieval of large databases. The evaluation was narrowed down to two stemmers: 1) the Paice/Husk stemmer developed at Lancaster University (Paice 1990)[2] which features a rule execution mechanism and externally stored rules, and 2) the Porter stemmer (Porter 1980)[3] which uses algorithmic rules rather than externally stored rules. Neither of these stemmers could be used in their original form because some of the stems generated were not substrings of actual words or the resulting stems were too short. Both of these are important requirements for accurately searching existing large databases.
The flexibility of being able to specify a new set of rules without extensive programming changes made the Paice/Husk stemmer more attractive than the Porter stemmer. The Paice/Husk stemmer is basically a rewrite rule interpreter which may be configured as a finite state automaton by using the appropriate rules. The C-language implementation by Andrew Stark of the Paice/Husk stemmer works adequately, but is not well suited for developing and experimenting with a new set of rules. Consequently, the program was modified to improve the handling of errors in the rules, allow interactive testing, provide more precise stems, and add some flexibility for implementing finite state automata. Fewer than 50 lines of code were added or altered without counting the replacement of the driver. The new driver and debugging options make it possible to test the execution of the rules interactively. This is important because it is possible for the execution of the rules to get in an infinite loop! For example, the rule "e,e,continue" will loop forever when a word ending in "e" is input. The interaction of several rules may also result in infinite loops when they all use the continue flag. The code was modified to prevent infinite loops by stopping when the number of rules executed exceeds twice the number of characters in the input word. The new debugging options helped to solve the mystery of why the original rules generated the stem "abud" from "abusively":
Stemmer driver: | paicedrv.c |
Stemmer: | paice.c |
Include file: | rule.h |
Stemming rules: | azrules.txt |
(tions)->() stopThe suffix 'tions' will not apply because the resulting length would be too small, but 'ions' will apply and generate "act". For the word "lions", 'ions' will not apply, but 's' applies to generate 'lion'.
(ions)->() stop
(s)->() stop
paice.exe azrules.txt data.txt stems1.txt >>msg.txt
invokes the stemmer(paice.exe) and specifies the rules file(azrules.txt), the input file(data.txt), and the output file(stems1.txt). The sysout stream is redirected to the file "msg.txt" which will contain diagnostics and debugging output.
These are some examples of stemmed output using the AZRULES with the modified Paice/Husk stemmer. A space separates the stem from the ending. The number of the last rule applied follows the ending. This list was sorted alphabetically after stemming.
determin ativeness 115
determin e 49
determin ed 133
determin edly 133
determin edness 133
determin er 133
determin ers 133
determin es 180
determin ing 133
determin ism 77
determin ist 244
determin istic 9
determin istically 58
determin ists 215
This is the listing of a portion of the AZRULES as displayed when the first word input is "debugon" or "printrules". Debug output is sent to the sysout data stream, rather than to the output file.
; 14 marker is used to undouble some doubled letters
125 (bb14)->(b) stop
126 (dd14)->(d) stop
127 (ff14)->(f) stop
128 (gg14)->(g) stop
129 (mm14)->(m) stop
130 (nn14)->(n) stop
131 (pp14)->(p) stop
132 (rr14)->(r) stop
133 (tt14)->(t) stop
134 (14)->() stop
; === R ===
135 (ar)->() stop
136 (eer)->(eer) stop
137 (lier)->() stop
138 (ier)->(14) cont.
139 (ener)->() stop
140 (iser)->() stop
141 (izer)->() stop
142 (yzer)->(y) stop
143 (er)->(14) cont.
144 (ator)->(a10) cont.
145 (or)->(14) cont.
146 (eur)->() stop
The following listing shows the results from using the "debugon" option in a batch test. The rule number applied and the changes to the word are indicated in each line. The first line shows how the "er" ending of rule 143 matched against the word "stopper" and replaced it with the special "14" marker. Rule 131 matched "pp14" and stripped "p14" to generate the final stem "stop". Contrast this with the second line for the word "filler"
<stopper> 143->stopp14 131->stop
<filler> 143->fill14 134->fill
<determined> 20->determin14 134->determin
<determinedly> 265->determined 20->determin14 134->determin
<determinedness> 221->determined 20->determin14 134->determin
<determiner> 143->determin14 134->determin
<determiners> 212->determin14 134->determin
<determines> 181->determin
<determinants> 227->determina10 116->determin
<determinate> 41->determina10 116->determin
<determinateness> 221->determinate 41->determina10 116->determin
<determination> 99->determina10 116->determin
<determinations> 203->determina10 116->determin
<determinative> 44->determina10 116->determin
<determinatively> 262->determina10 116->determin
<determinativeness> 221->determinative 44->determina10 116->determin
<determinables> 170->determin
<determinately> 261->determina10 116->determin
<deterministically> 265->deterministical 58->determin
The interactive option is invoked by typing "paice azrules.txt" at an MSDOS prompt in the directory containing the program and the rules, or by clicking on the Windows "Start"->"run" and browsing for the "paice.exe" file. Once the executable file is selected, modify the line to be executed by appending a space and "azrules.txt", then click "OK". The interactive option is terminated by typing CTRL+C. The following listing shows the interactive dialog and the specification of the "debugon" option.
>paice azrules.txt
Input line of text to be stemmed:
refinement
refin ement 49
Input line of text to be stemmed:
debugon
<debugon>
debugon
Input line of text to be stemmed:
refinement
<refinement> 236->refine 49->refin
refin ement 49
Input line of text to be stemmed:
terminate with CTRL+C
This work was accomplished with funding from the Specialized Information Services Division (SIS) of the National Library of Medicine (NLM).