Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
1000x Faster Spelling Correction: Source Code released (faroo.com)
113 points by boyter on Jan 13, 2014 | hide | past | favorite | 34 comments


If you compare to Norvig's first corrector you should also compare to the much faster http://norvig.com/ngrams (which follows an algorithm I sent him).

Added: I'd guess the newer Norvig corrector is still slower, though it is more accurate (taking into account probabilities of different typos). There's a curiosity not mentioned in his article, that the new algorithm computes a very slightly different set of candidates. I only noticed the difference when I got around to checking these sets against each other for all the words in a sample document -- I'd meant them to be the same. Might be fun to check the OP's code this way too.


The more interesting blog post that actually describes the algorithm is here: http://blog.faroo.com/2012/06/07/improved-edit-distance-base...


This is really cool! It reminds me of BiDirectional BFS for path finding. In this case searching from both directions cuts down the search space by three orders of magnitude. Genius.



I notice the authors don't actually talk about accuracy. It's 1,000x faster, but what is gained or lost in terms of accuracy?

We can build a spelling corrector which is accurate most of the time and much faster than the Faroo implementation:

    def corrections_for(word)
      return [word]
    end
P.S. I added a gist here with some reformatting because I found it hard to read the code on the author's site:

https://gist.github.com/fj/8393399

The raw source is here:

https://gist.github.com/fj/8393399/raw/7d4acd14db21dd0c7d802...


It looks like what they've done is more along the lines of finding a clever trick to improve the runtime performance of one of the common building blocks of a spell correction system (the part that given an input computes a set of candidate words that are close to it in terms of edit distance). It doesn't even seem to say anything about how to pick the best candidate once you've generated this set, so you can't really evaluate its accuracy as if it were a complete spell correction system. Norvig originally just used the straightforward "pick the most common word that has the smallest edit distance", but I didn't see them mention anything about any kind of strategy on their blog post.


In essence it appears to be a pre-filtering step before you look at the actual edit distance.

I'm also not sure how they benchmarked the other solution, as that clearly isn't optimised for runtime but for readability (it's a nice bit of code).


They seem to be comparing to Peter Norvig's spell corrector.

Compared to that, they have the exact same results. What they present is a faster way to find all dictionary words with edit distance <= 2 from a given word.


Isn't Norvig's spell corrector a really bare-bones implementation that he wrote while bored on a plane? Like, not really industrial-strength or anything. It's a very cool demo, but I wouldn't consider it worth benchmarking against, or anything... unless I've misunderstood what's going on here.


Plus they're benchmarking C# against Python. Python is generally 50-100x slower than C++, while C# is <2x slower. (based on computer languages shootout)


> I added a gist here with some reformatting because I found it hard to read the code on the author's site:

Thank you. Funny enough, that gist has the same problem on my screen (too much whitespace on either side of a box that I need to horrizontally scroll) - it's just not as bad.

Here's how the site shows up for me: http://imgur.com/CjXRHBC

Grumble.


Try using the raw link I posted afterwards. That'll give you as much room as your browser window has.


They are trading space for time.

Roughly speaking, they pre-compute some of the character substitutions done in traditional algorithm, and add it to the dictionary.


This approach is far from been perfect. A simple recursive implementation of Levenshtein on a Trie representing the dictionary would produce better performance!


Can you elaborate? I don't see how you'd be able to come up with something 'simple' that still has a decent performance. Especially if the number of acceptable errors is configurable and 3 (or higher).


I think you mean a trie representing the dictionary plus edit-distance-2 (like the OP's algorithm does). There's no easy way to "skip" characters in a trie.


You can skip characters in a trie, at least towards the right-hand-end of a word, by recursing down every path. I've done so to good (and performant) effect, before.


You wouldn't have to. As described in the article, the dictionary maps from possible entry texts to the corrected terms.


In general a dynamic programming/DFA style approach would provide so much more win than what is discussed in this article.


I'm really intrigued. Working in the OCR field, this is something that we do a lot, using inhouse libraries or even dedicated 3rd party solutions. I'll definitely give this a spin in the next couple of days.

I do wonder what happened to normal C# style here though. Had a hard time reading this, because 'suggestItem' or 'editItem' etc. doesn't _look_ like a type/class. A single uppercase/lowercase change and I stumbled a couple of times.


This algorithm seems to be based on edit distance so should be a poor fit for the OCR field, since OCR rarely swaps letters.


I'm not quite sure what you're saying here. Swap letters as in transpositions? Yes, correct. Usually the errors are simple replaces or deletes/inserts (which arguable is might include 'swapping' an 1 for an l).

But the greater field I'm working in doesn't hand you random OCR and that's it. Most projects here contain a way for typists to correct recognition mistakes or complete the missing pieces of information on a document. For that (-> human typist, often you have a database with valid/expected values for fields) transpositions aren't rare at all.


This is exactly the approach described in: http://link.springer.com/chapter/10.1007%2F978-3-642-36973-5...

This is not new or not previously known. Sensational title.


Ugg, in the first comment block one line is 128 characters long. The whole thing is abhorrent to read in the nicely-columnized blog post; you have to scroll to the right constantly.


Agreed. Bring back line-wrap, it's just stupid web design otherwise.


Has anyone explored using a covering code for spelling correction? (e.g. https://eprint.iacr.org/2012/731.pdf‎ )


Not directly related, but does any algorithm takes into account the layout of the keyboard? E.g., alphabets closer together may be more likely candidates for replacement...



Is there such a thing as 'typo checking'? I'd like something that is able to correct accidentally swapped adjacent letters. I'm imagining something that looks at the sequence and timing of keydown and keyup events to make a correction rather than just the finished word.


Every phone for the last 10 years has had this.


That's interesting from an academic point of view, but in my opinion spell checkers that don't take into account the context in the sentence (words before, syntax, context) are so 1990...

I'd prefer to see research in this direction rather than speed improvement on a method we all know doesn't really work. Because your typo is close to one entry (an existing word) in a list (the dictionary) that you maybe never heard of before doesn't mean you intended to type this.


I have a toy project (In Java and Dart) doing spell checking. In a 10.000 word test dictionary, for 1 distance it finds around 11.000 corrections per second. For 2 distance 1000 check per second. It does not use Language models yet. And there are room for performance improvement.

This is the Dart version: https://github.com/ahmetaa/dart-spell


I can't help but wonder how this would fare in a fast language with some optimisation and against real world spell check implementations...

Would we see the same performance increase in practice?


where is the bottle neck in such code? why is this faster than say Norvig spell checker? http://norvig.com/spell-correct.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: