OCR Summit Meeting
The IDHMC at Texas A&M University held an OCR Summit meeting on October 17-18, 2011, bringing together experts to work on the problem of Optical Character Recognition for early modern texts, when printing techniques make it difficult for machines to type text by “reading” page images. Though computer scientists have solved many of the problems that arise from attempts to mechanically read and type pages, these solutions have not yet been implemented on a large scale, except by the IMPACT group headquartered at the National Library of the Netherlands. Google and other text providers have done the best that they can, but now is precisely the moment when scholarly understandings of book history can help them train their OCR engines more successfully. Users are typically not aware of how flawed mechanically typed text can be for works published before 1820, and our cultural heritage threatens to be lost. Saving page images is not sufficient for being able to use and find works of literature and documents important to understanding history, and so this group is working on how to move forward to create better OCR engines directed at specific types of texts for maximum correctness. Once accomplished, those texts can be used for data-mining, for determining for instance when a word was first used as the Oxford English Dictionary has attempted to do through anecdotal information. They can be used to analyze history and literature quantitatively, to locate the texts that contribute to those numbers, and then allow us to zoom back into each text to read it: we may find out some things about literature and history that surprise us!
You will find the details of the meeting on our commentPress blog. Here follow the power-point slides presented by participants.
Screencast (without sound) of the Cobre tool as used in the Primeros Libros Project.