1000x Faster Spelling Correction: Source Code released

abecedarius · on Jan 13, 2014

If you compare to Norvig's first corrector you should also compare to the much faster http://norvig.com/ngrams (which follows an algorithm I sent him).

Added: I'd guess the newer Norvig corrector is still slower, though it is more accurate (taking into account probabilities of different typos). There's a curiosity not mentioned in his article, that the new algorithm computes a very slightly different set of candidates. I only noticed the difference when I got around to checking these sets against each other for all the words in a sample document -- I'd meant them to be the same. Might be fun to check the OP's code this way too.

haberman · on Jan 13, 2014

The more interesting blog post that actually describes the algorithm is here: http://blog.faroo.com/2012/06/07/improved-edit-distance-base...

somethingnew · on Jan 13, 2014

This is really cool! It reminds me of BiDirectional BFS for path finding. In this case searching from both directions cuts down the search space by three orders of magnitude. Genius.

chatman · on Jan 13, 2014

This is the same as http://link.springer.com/chapter/10.1007%2F978-3-642-36973-5...

jxf · on Jan 13, 2014

I notice the authors don't actually talk about accuracy. It's 1,000x faster, but what is gained or lost in terms of accuracy?

We can build a spelling corrector which is accurate most of the time and much faster than the Faroo implementation:

    def corrections_for(word)
      return [word]
    end

P.S. I added a gist here with some reformatting because I found it hard to read the code on the author's site:

https://gist.github.com/fj/8393399

The raw source is here:

https://gist.github.com/fj/8393399/raw/7d4acd14db21dd0c7d802...

pyduan · on Jan 13, 2014

It looks like what they've done is more along the lines of finding a clever trick to improve the runtime performance of one of the common building blocks of a spell correction system (the part that given an input computes a set of candidate words that are close to it in terms of edit distance). It doesn't even seem to say anything about how to pick the best candidate once you've generated this set, so you can't really evaluate its accuracy as if it were a complete spell correction system. Norvig originally just used the straightforward "pick the most common word that has the smallest edit distance", but I didn't see them mention anything about any kind of strategy on their blog post.

IanCal · on Jan 13, 2014

In essence it appears to be a pre-filtering step before you look at the actual edit distance.

I'm also not sure how they benchmarked the other solution, as that clearly isn't optimised for runtime but for readability (it's a nice bit of code).

kenferry · on Jan 13, 2014

They seem to be comparing to Peter Norvig's spell corrector.

Compared to that, they have the exact same results. What they present is a faster way to find all dictionary words with edit distance <= 2 from a given word.

Kluny · on Jan 13, 2014

Isn't Norvig's spell corrector a really bare-bones implementation that he wrote while bored on a plane? Like, not really industrial-strength or anything. It's a very cool demo, but I wouldn't consider it worth benchmarking against, or anything... unless I've misunderstood what's going on here.

Scaevolus · on Jan 13, 2014

Plus they're benchmarking C# against Python. Python is generally 50-100x slower than C++, while C# is <2x slower. (based on computer languages shootout)

WizzleKake · on Jan 13, 2014

> I added a gist here with some reformatting because I found it hard to read the code on the author's site:

Thank you. Funny enough, that gist has the same problem on my screen (too much whitespace on either side of a box that I need to horrizontally scroll) - it's just not as bad.

Here's how the site shows up for me: http://imgur.com/CjXRHBC

Grumble.

jxf · on Jan 13, 2014

Try using the raw link I posted afterwards. That'll give you as much room as your browser window has.

pygy_ · on Jan 13, 2014

They are trading space for time.

Roughly speaking, they pre-compute some of the character substitutions done in traditional algorithm, and add it to the dictionary.

jlemoine · on Jan 13, 2014

This approach is far from been perfect. A simple recursive implementation of Levenshtein on a Trie representing the dictionary would produce better performance!

darklajid · on Jan 13, 2014

Can you elaborate? I don't see how you'd be able to come up with something 'simple' that still has a decent performance. Especially if the number of acceptable errors is configurable and 3 (or higher).

colanderman · on Jan 13, 2014

I think you mean a trie representing the dictionary plus edit-distance-2 (like the OP's algorithm does). There's no easy way to "skip" characters in a trie.

ghusbands · on Jan 13, 2014

You can skip characters in a trie, at least towards the right-hand-end of a word, by recursing down every path. I've done so to good (and performant) effect, before.

cbsmith · on Jan 13, 2014

You wouldn't have to. As described in the article, the dictionary maps from possible entry texts to the corrected terms.

cbsmith · on Jan 13, 2014

In general a dynamic programming/DFA style approach would provide so much more win than what is discussed in this article.

darklajid · on Jan 13, 2014

I'm really intrigued. Working in the OCR field, this is something that we do a lot, using inhouse libraries or even dedicated 3rd party solutions. I'll definitely give this a spin in the next couple of days.

I do wonder what happened to normal C# style here though. Had a hard time reading this, because 'suggestItem' or 'editItem' etc. doesn't _look_ like a type/class. A single uppercase/lowercase change and I stumbled a couple of times.

jeltz · on Jan 13, 2014

This algorithm seems to be based on edit distance so should be a poor fit for the OCR field, since OCR rarely swaps letters.

darklajid · on Jan 13, 2014

I'm not quite sure what you're saying here. Swap letters as in transpositions? Yes, correct. Usually the errors are simple replaces or deletes/inserts (which arguable is might include 'swapping' an 1 for an l).

But the greater field I'm working in doesn't hand you random OCR and that's it. Most projects here contain a way for typists to correct recognition mistakes or complete the missing pieces of information on a document. For that (-> human typist, often you have a database with valid/expected values for fields) transpositions aren't rare at all.

chatman · on Jan 13, 2014

This is exactly the approach described in: http://link.springer.com/chapter/10.1007%2F978-3-642-36973-5...

This is not new or not previously known. Sensational title.

humbledrone · on Jan 13, 2014

Ugg, in the first comment block one line is 128 characters long. The whole thing is abhorrent to read in the nicely-columnized blog post; you have to scroll to the right constantly.

joosters · on Jan 13, 2014

Agreed. Bring back line-wrap, it's just stupid web design otherwise.

nullc · on Jan 13, 2014

Has anyone explored using a covering code for spelling correction? (e.g. https://eprint.iacr.org/2012/731.pdf‎ )

alok-g · on Jan 13, 2014

Not directly related, but does any algorithm takes into account the layout of the keyboard? E.g., alphabets closer together may be more likely candidates for replacement...

_r5wf · on Jan 13, 2014

Mine has but only for Turkish so far:

https://github.com/ahmetaa/leblebi/blob/master/src/main/java...

muxxa · on Jan 13, 2014

Is there such a thing as 'typo checking'? I'd like something that is able to correct accidentally swapped adjacent letters. I'm imagining something that looks at the sequence and timing of keydown and keyup events to make a correction rather than just the finished word.

gaius · on Jan 13, 2014

Every phone for the last 10 years has had this.

ar7hur · on Jan 13, 2014

That's interesting from an academic point of view, but in my opinion spell checkers that don't take into account the context in the sentence (words before, syntax, context) are so 1990...

I'd prefer to see research in this direction rather than speed improvement on a method we all know doesn't really work. Because your typo is close to one entry (an existing word) in a list (the dictionary) that you maybe never heard of before doesn't mean you intended to type this.

_r5wf · on Jan 13, 2014

I have a toy project (In Java and Dart) doing spell checking. In a 10.000 word test dictionary, for 1 distance it finds around 11.000 corrections per second. For 2 distance 1000 check per second. It does not use Language models yet. And there are room for performance improvement.

This is the Dart version: https://github.com/ahmetaa/dart-spell

jheriko · on Jan 13, 2014

I can't help but wonder how this would fare in a fast language with some optimisation and against real world spell check implementations...

Would we see the same performance increase in practice?

mrcactu5 · on Jan 13, 2014

where is the bottle neck in such code? why is this faster than say Norvig spell checker? http://norvig.com/spell-correct.html