If you compare to Norvig's first corrector you should also compare to the much faster http://norvig.com/ngrams (which follows an algorithm I sent him).
Added: I'd guess the newer Norvig corrector is still slower, though it is more accurate (taking into account probabilities of different typos). There's a curiosity not mentioned in his article, that the new algorithm computes a very slightly different set of candidates. I only noticed the difference when I got around to checking these sets against each other for all the words in a sample document -- I'd meant them to be the same. Might be fun to check the OP's code this way too.
This is really cool! It reminds me of BiDirectional BFS for path finding. In this case searching from both directions cuts down the search space by three orders of magnitude. Genius.
It looks like what they've done is more along the lines of finding a clever trick to improve the runtime performance of one of the common building blocks of a spell correction system (the part that given an input computes a set of candidate words that are close to it in terms of edit distance). It doesn't even seem to say anything about how to pick the best candidate once you've generated this set, so you can't really evaluate its accuracy as if it were a complete spell correction system. Norvig originally just used the straightforward "pick the most common word that has the smallest edit distance", but I didn't see them mention anything about any kind of strategy on their blog post.
They seem to be comparing to Peter Norvig's spell corrector.
Compared to that, they have the exact same results. What they present is a faster way to find all dictionary words with edit distance <= 2 from a given word.
Isn't Norvig's spell corrector a really bare-bones implementation that he wrote while bored on a plane? Like, not really industrial-strength or anything. It's a very cool demo, but I wouldn't consider it worth benchmarking against, or anything... unless I've misunderstood what's going on here.
Plus they're benchmarking C# against Python. Python is generally 50-100x slower than C++, while C# is <2x slower. (based on computer languages shootout)
> I added a gist here with some reformatting because I found it hard to read the code on the author's site:
Thank you. Funny enough, that gist has the same problem on my screen (too much whitespace on either side of a box that I need to horrizontally scroll) - it's just not as bad.
This approach is far from been perfect. A simple recursive implementation of Levenshtein on a Trie representing the dictionary would produce better performance!
Can you elaborate? I don't see how you'd be able to come up with something 'simple' that still has a decent performance. Especially if the number of acceptable errors is configurable and 3 (or higher).
I think you mean a trie representing the dictionary plus edit-distance-2 (like the OP's algorithm does). There's no easy way to "skip" characters in a trie.
You can skip characters in a trie, at least towards the right-hand-end of a word, by recursing down every path. I've done so to good (and performant) effect, before.
I'm really intrigued. Working in the OCR field, this is something that we do a lot, using inhouse libraries or even dedicated 3rd party solutions. I'll definitely give this a spin in the next couple of days.
I do wonder what happened to normal C# style here though. Had a hard time reading this, because 'suggestItem' or 'editItem' etc. doesn't _look_ like a type/class. A single uppercase/lowercase change and I stumbled a couple of times.
I'm not quite sure what you're saying here. Swap letters as in transpositions? Yes, correct. Usually the errors are simple replaces or deletes/inserts (which arguable is might include 'swapping' an 1 for an l).
But the greater field I'm working in doesn't hand you random OCR and that's it. Most projects here contain a way for typists to correct recognition mistakes or complete the missing pieces of information on a document. For that (-> human typist, often you have a database with valid/expected values for fields) transpositions aren't rare at all.
Ugg, in the first comment block one line is 128 characters long. The whole thing is abhorrent to read in the nicely-columnized blog post; you have to scroll to the right constantly.
Not directly related, but does any algorithm takes into account the layout of the keyboard? E.g., alphabets closer together may be more likely candidates for replacement...
Is there such a thing as 'typo checking'? I'd like something that is able to correct accidentally swapped adjacent letters. I'm imagining something that looks at the sequence and timing of keydown and keyup events to make a correction rather than just the finished word.
That's interesting from an academic point of view, but in my opinion spell checkers that don't take into account the context in the sentence (words before, syntax, context) are so 1990...
I'd prefer to see research in this direction rather than speed improvement on a method we all know doesn't really work. Because your typo is close to one entry (an existing word) in a list (the dictionary) that you maybe never heard of before doesn't mean you intended to type this.
I have a toy project (In Java and Dart) doing spell checking. In a 10.000 word test dictionary, for 1 distance it finds around 11.000 corrections per second. For 2 distance 1000 check per second. It does not use Language models yet. And there are room for performance improvement.
Added: I'd guess the newer Norvig corrector is still slower, though it is more accurate (taking into account probabilities of different typos). There's a curiosity not mentioned in his article, that the new algorithm computes a very slightly different set of candidates. I only noticed the difference when I got around to checking these sets against each other for all the words in a sample document -- I'd meant them to be the same. Might be fun to check the OP's code this way too.