I’ve recently been reading An Introduction to Information Retrieval by Manning, Raghavan, and Schütze which has a lot of cool concepts in it. I had taken a class on data mining last year, but subsequently forgot most of the IR section. One algorithm I came by today was soundex and is used for indexing words based upon how it sounds(Its a phonetic algorithm). This allows users to search for the same words with similar spellings or misspellings.
The process of converting a word to its soundex code is based on a few simple rules and is very easy to implement in any language. Each word is coded into a letter followed by three numbers which represent the soundex code of the word. The first letter of the word is kept and all other letters are replaced or removed. All vowels, the letter H, and the letter W are removed from the word because they are easily interchangeable or ignorable. If adjacent letters are duplicates then one of them is removed. The rest of the letters are grouped together by how they sound. The groups of letters are as follows
1: B, F, P, V
2: C, G, J, K, Q, S, X, Z
3: D, T
4: L
5: M, N
6: R
After all the letters are replaced with numbers we take the first letter and the first three numbers. If its shorter than four characters long then pad the end with 0′s. This creates the soundex code for a word.
This allows words to be easily compared especially for common misspellings. Misspelled is often misspelled as mispelled. Mispelled is mapped to M214 and Misspelled is mapped to M214 which would be treated as the same word.
When used for indexing or for retrieval from an index. The soundex mapping will help retrieve up words that phonetically are similar and can be used for a rudimentary spell checker.