Wednesday, June 06, 2007

Crowdsourcing the digitization of books

You may be familiar with the Open Content Alliance, which is digitizing books and hosting them at the Internet Archive. As you can imagine, as books get scanned, there is some text on the page that the scanning software (specifically, the optical character recognition software) can't properly recognize or convert. There is a backlog of messily scanned text that humans must look over and correct before it can be added to the library of digital books.

You may also be familiar with CAPTCHAs, which are those squiggly letters you are asked to type on the web before you can register at some sites. CAPTCHAs are put on web sites to ensure that humans, and not spam-producing computer programs, are filling out forms on the web.

Recently, some ingenious folks at the School of Computer Science at Carnegie Mellon University have come up with a new CAPTCHA tool that asks users to type in words that come from the Open Content Alliance's backlog of unreadable scanned text. Designers of web sites can now add this CAPTCHA tool, called reCAPTCHA, to a page that needs to be protected from spammers. As users of that page type in the words required in the CAPTCHA, the results are passed on to the Open Content Alliance, thereby chipping away at the backlog of messy text scans. You can read more about reCAPTCHA on the project web site. This is an fascinating example of crowdsourcing (for a definition of this term, see this BusinessWeek article from July 2006).

No comments: