Index ScientificPsychic.com - Expand your mind, improve your body, have fun

2007-05-29
 

Database of Misspellings

On August 03, 2006, Google made available a database of words and adjacent word combinations, called N-grams, obtained by scanning 1,024,908,267,229 words of running text. The file of single words, or unigrams, contains 13,588,391 unique words.

On the surface, it would appear that this is a mountain of gold, but alas, the correct words are just as scarce as real gold. My unofficial estimate from looking at the list of single words is that up to 95% of the words are misspellings. Although the file has useful linguistic applications, not much can be done without great expenditure of manual effort. To give you an idea of what you can find in the unigram file, here is the word "BUSINESS" and some of the misspelled forms with their frequencies after consolidating differences in case, e.g. businnes 2771, Businnes 1692, and BUSINNES 556.

business 637134177, busines 475319, buisness 414822, busnes 325022, bussiness 267372, bussines 62980, buisiness 57730, busness 50547, busnesau 41934, businss 41188, bisiness 35434, businness 31919, busienss 29023, busniess 28329, businees 27374, businesse 26663, bsuiness 17297, busineess 16524, busuness 15509, busiiness 15095, businese 14907, buziness 14568, ubsiness 14367, bbusiness 14319, businesa 14126, busibess 13950, busineas 13819, busoness 13815, buseness 12346, busiuness 11329, businessa 11106, busieness 11081, buesiness 11048, busineass 11026, businiss 10846, buissness 10832, busioness 10817, busainess 10608, busuiness 10590, busoiness 10548, buseiness 10521, busibness 10475, businiess 10420, businessi 10330, buseeness 10234, businesso 10224, businessu 9845, businoess 9814, businoss 9813, busineiss 9732, busineoss 9704, bisnes 8466, bisness 7784, busnois 5288, businnes 5019, busseness 3596, buiseness 3252, buisnes 3186

Comments »

Josh Rubin said,
2007-09-07 @ 23:24:09

I don't see the problem, and I'm not surprised by the result. 95% of the words are misspelled (I almost spelled that wrong!), but that statistic is based on unweighted data: there are half a billion correct spellings, and a few million incorrect ones.




© Copyright  - Antonio Zamora