Officer’s Grant “18thConnect and Open Access Full-Text”
Reference Number: 31000125
January 29, 2011
Interim Report to the Mellon Foundation
(Final Report to be submitted August 2011)
- In Progress
We have learned a great deal so far, more than I could have hoped, about the process of cleaning up the textual output of OCR (Optical Character Recognition programs) that have been run on page images of eighteenth-century texts in order to make these texts findable and usable by scholars of the future, those to whom we leave this archive as our legacy. We knew in applying for the grant that the solution would need to be some combination of human and computer interventions, and we now have a very clear idea of how to proceed in eliciting optimum amounts of each. Before detailing our progress and outlining our next steps, I offer a narrative about our work of the last 4 months in building the Scholarly Editing Tool.
I hired designer Britt Carr to design the tool based on Annolex, as originally planned in the grant:
We were a little unhappy with this design because we did not believe that isolating the image of one word would necessarily help people to make corrections. The words are sometimes unreadable, and sometimes not segmented properly, and the context of other electronically-typed words may not help, especially if they too are indecipherable. I therefore requested from Gale permission to show more than one word of their page images, and we received consent.
Having gained the capacity to use the whole page image, I said to designer Britt Carr, “Go to the Australian Newspaper Digitisation Program (ANDP), use their tool for correcting OCR of newspapers, and design me something as simple and as intuitive.” This program first offered its tool in 2007, and 30 million articles have been corrected since then.
One reason for its immense success is its simple-to-use interface. You can see here that the page image appears on the right hand side, and then an open text box appears in which users can do free typing, with only the instruction: “lines are to be observed.”
The feedback from users described in articles about the ANDP indicate that they valued the way the library trusted them to correct things properly. We wanted however to combine the best of both worlds, freedom and constraint, and so we worked with Britt to develop a tool that resembles Annolex in some ways and the ANDP tool in others.
Annolex, the scholarly editing tool created by Martin Mueller and Craig Berry with funding from Robert Taylor of Northwestern University, goes through a text by word, offering users the capacity to correct the word as well as grammatical information about it, whereas the Australian Newspaper interface provides an open box containing text for the whole page.
In an email (attached), Ray Bankoski, Gale’s VP for Technology, consented to let us use whole page images but asked us not to do so in a way that would make the ECCO product unnecessary to buy. We know that scholars would go through and download a highly coveted text, page by page. We therefore knew that we had to deliver only sections of a page at a time. Using only a partial page image, Britt’s final wireframe design combines the best of both:
The Australian Newspaper Digitisation Project can offer text for correction in a text box a) because they can show a whole page image, and b) because the documents being corrected are relatively small—a “whole” text might only be an article of a few paragraphs. In contrast, a) we cannot reveal whole page images but only thumbnails and half-inch snippets. And b) our texts range from pamphlets to whole books, and so it is easy to lose one’s place in the text. Martin Mueller designed Annolex so that editors would know exactly what they are correcting, see it in a flash, and, thanks to his guidance, we have imitated Annolex in that way. We have divided the text pages into lines (three at a time) so that users can immediately orient themselves, it being very clear which word-images appearing in the half-inch, partial page image is connected to the box in which their cursor can move or type. Newspaper articles are short, but no one could edit a novel in a single sitting, even those that are broken into three volumes. Single-line correction boxes let us send our users back to exactly where they left off correcting their texts the next time they come to 18thConnect to work on them again.
We have modeled our tool on other features of Annolex as well. Whereas the ANDP tool involves no editorial oversight, allowing for massive amounts of data correction, the report “Many Hands” reveals a persistent worry about vandalism. In contrast, modeling our Scholarly Editing Tool on Annolex showed us how to set up editorial oversight of text correction by 18thConnect editors, to be streamlined if massive numbers of correctors sign up. We have implemented a “Review History” button that, when clicked, presents a list of corrections made by multiple users to any given line. Because everyone can see this view, and not just those monitoring the corrections for 18thConnect as a whole, teams of people can work together on correcting difficult texts.
We believe that our tool combines the best of freedom and information control:
One can move around the document in numerous ways, by clicking on places in the thumbnail, moving the larger image up and down, or moving through the lines of text. We ask for correction not one word at a time, but one line at a time, nonetheless allowing people the freedom to insert and delete lines, either through keystrokes or clickable buttons. Right now, red boxes on the thumbnail image and the section of the page image indicate where you are in any given text. Later, we hope to be able to add red squiggly lines in the page images under words that are questionable, likely to have been misread by the OCR. In that way users will be reminded of spelling errors as they appear in Microsoft Word—it is the same iconography—and can, if they wish, leap from squiggly line to squiggly line in order to make corrections. There is a find/replace button that will allow them to correct automatically any error that is the same universally throughout the text, saving them any manual labor that is irritating because too repetitive.
Right now, as is hinted at the very bottom of the above image, Performant Software has set up a “Debugging Info.” table (partly obscured). We are using this utility to see whether word counts alone can a) help us determine relative OCR correctness, so that we can know when to use Gale’s OCR rather than our own, and b) whether measures such as the spread of punctuation marks throughout n-grams as well as statistics about word occurrence can help us suggest to users that a word might be wrong (underline it with red squiggles).
Thus the interface for our Scholarly Editing Tool has been built into Ruby-on-Rails: its full functioning is currently being programmed. (Right now you can use it but nothing gets reported to editors nor saved.) Because we have now come into the business of delivering parts of page images as part of the crowdsourcing tool, Annolex will no longer work as a back end. For two weeks, we looked into using the Medici Content Management System developed at the NCSA at the University of Illinois (https://opensource.ncsa.illinois.edu/confluence/display/MMDB/Home). There are in fact open-source correction tools that have been developed for it. Another reason Medici seemed so valuable is that it builds image pyramids for image management and study, which again would protect Gale’s page images from being downloaded.
We came up with two possible scenarios about how to integrate the CMS with 18thConnect and NINES:
Ultimately, though, we decided that the Scholarly Editing Tool needed to be built as a standalone tool for our interface, designed with the understanding that we might be moving it to another system in the future, rather than attempting at this stage to hardwire two disparate systems together.
Meanwhile, computer scientists at Miami were working with the Project Directors and participants at Illinois to run the OCR. In coordination with Gale-Cengage Learning and the National Center for Supercomputing Applications (NCSA), Michael Simeone has transferred images of the 182,000 texts and their accompanying metadata (approximately 5 terabytes) from hard disks shipped from Gale to hardware housed at NCSA. The data exists both on a main laboratory server and in long-term storage in the form of magnetic tape. While stored on the lab server, the data is ready to be served to the supercomputing project space allotted to our research team.
Michael Behrens, doctoral candidate at the University of Illinois, has successfully trained Gamera for every Latinate character to be found in English words printed in Baskerville and Caslon fonts. He has spent countless hours working on the long “s” alone, again, with great success, something Gale has not solved except in post-processing, and something Google has not solved at all. If one runs “presumption” as a searchable n-gram, one might think that the word was used during the nineteenth-century, but not at all during the eighteenth, until one realizes that typefaces drop the long “s” completely around 1820, so searching for “prefumption” is necessary:
However, we have encountered a set of problems with Gamera so that, when its OCR output is good, it is very, very good, and when it is bad, it is horrid. First, there are the many texts that either are in foreign languages or contain passages in these languages, and texts or passages that use alphabets other than Roman. This problem requires moving Gamera’s output from ASCII to Unicode, not yet accomplished by Gamera developers, and more training would be needed for recognizing non-Latinate characters in new fonts. Other fonts needed are primarily Blackletter, and we might be able to benefit from work done for other projects especially in Germany.
The worst problem we are having with Gamera, however, involves line segmentation. When Gamera cannot find the lines, it returns gobbledy-gook:
You can see here from the red box drawn by Gamera that it is trying to read three lines as if they were one line, and it cannot recognize the characters as a result. In the line that Gamera did identify properly, “The Wound received,” the problems come from improper character segmentation: it has been trained to identify ligatures, but, in these images, microfilmed during the 1980s, non-ligatured letters blur together. Because the images are black and white and not gray-scale, there are not enough pixels in our digital historical record to allow for advanced methods of image filtering and cleanup. Per an email from Ray Bankoski (VP of Technology at Gale Cengage), no gray-scale images are available.
We have looked most recently at OCRopus, the OCR engine that has come from the release of Tesseract by Google to open-source developers by Google. One problem with OCRopus is that it has no documentation yet; another is that it does not output XML at all: we are required by our contract with Gale to deliver word coordinates, so that, even if we were willing to abandon TEI encoding, we could not give up XML output.
But OCRopus holds some promise for us nonetheless, and we are currently researching it: it is very good at line recognition. Because Gamera was made originally for music in which blocks of text would be very distinct, and line recognition within those blocks was not wanted, its methods for distinguishing lines are extremely simple: it tries to draw a straight line without hitting any ink, and, wherever it can do that, is identified as the bottom or top of a line. We may be able to import OCRopus’s line recognition routines into Gamera, and even the simple adjustments that we plan to make before running the 182,000 texts will help.
To launch the crowdsourcing tool, we have verbal agreement from Gale to use their OCR which is much better than Gamera’s worst, though still heavy on long-s errors. Through post-processing dictionary look-ups, or even the debugging utility developed by Paul Rosen of Performant Software (visible at the bottom of the crowd-sourcing tool), we will be able to rate the level of correctness for all texts outputted by Gamera, and we will throw out those that fall below the level of Gale’s, using the Gale texts instead (contract pending).
Brad Pasanek, Associate Director of 18thConnect, and I will be meeting with the ASECS (American Society for Eighteenth-Century Studies) President and Executive Council from 12:00 to 2:00 on March 16, 2011, in Vancouver in order to demonstrate the crowd-sourced correction tool which will by then be fully functional, in the form displayed above, and installed on the “My18” page of 18thConnect.org. We will ask the ASECS Executive Council to let us issue accounts and send an email to all ASECS members introducing the tool. Ideally, we will begin getting users interested in correcting particular documents for their research purposes right away, in August 2011, after all our work with Gamera is complete.
- Brian Pytlik Zillig created the tool “18mbda” for transforming OCR output into TEI-A.
- Britt Carr designed our crowd-sourced correction tool.
- Michael Simeone has made the page images available for running through the supercomputer.
- Michael Behrens has fully trained Gamera according to specific constraints, described above.
- Performant Software has built the Interface to the tool and determined the programming necessary to complete it. The backend programming will be complete by March 1.
- Gamera has been run on 2,188 texts to compare its output to the 2,188 keyed texts that we have been given from the Text Creation Partnership. This will help us understand how well it is working, develop look-ups and algorithms for determining text correctness, and formulate a plan for future development.
3. In Progress
- Laura is composing a new contract with Gale Cengage (Feb-March).
- Laura and Brad will meet with the Bamboo Corpora Space group Feb. 12-13.
- Laura and Brad will meet with ASECS, March 16, 2011.
- Between March and April, Laura will meet with Martin Mueller and Phil Burns to discuss her comparisons of the keyed and Gamera-produced texts, and the best ways of proceeding.
- On behalf of Martin Mueller and Laura Mandell, Ray Siemens is discussing collaboration with Els Van Eyck (a lead administrator in the National Library of the Netherlands) and Adriann vander Weel (Leiden), both of whose institutions are partnering with ProQuest to digitize their early modern texts.
- Laura will meet with people at the NCSA and I-CHASS (Institute for Computing in the Humanities, Arts, and Social Sciences) at the Univ. of Illinois to discuss future development.
- February: Members of the 18thConnect Boards will test our “Scholarly Editing Tool” before it is released.
- February: Performant Software is finishing under-the-hood programming to link crowd-sourced correction tool to MySQL tables and our SOLR index.
- March through August: The whole set of ECCO images will be run through the Supercomputer “Abe” (400,000 hours; with other jobs running as well, we now estimate that it could take as long as 6 months, which is why we must optimize the training at the outset).
- The tool will be released by August 2011.
There are three main things we need to do:
- A. Further work at 18thConnect to enhance our crowd-sourcing correction capacities so that we can address multiple audiences, multiple potential contributors, and not expert scholars alone;
- B. Figure out a permanent place for the images that are delivered to the correction tool;
- C. Conduct further research into how automatic image reading, statistical methods of correction, and human correction should optimally interact.
I will discuss each need individually, and then offer a possible future scenario involving the Bamboo Corpora Space, Northwestern University, and the HathiTrust.
A. Addressing Multiple Audiences: The Scholarly Editing Tool that we have built will live on the “My18” page, available to anyone who registers at 18thConnect by providing a user name and an email address. Even though the bar for registration is quite low, those who register will probably be people who wish to do research using our online finding aid, expert scholars and other interested intellectuals (lawyers may be among them). On the “My18” page, one can implement the “correct a text” functionality that will allow selecting a text to correct and loading it into the tool.
1) Casual Users:
Right now, when people search 18thConnect, the search returns are listed by author/title with a “Collect” and “Discuss” button next to each return as well as a snippet of text following the title. We would like to return the dirty OCR in those snippets and then have a “Correct” button next to the item. Low-attention users would click on it and be given the opportunity to correct any errors they see without signing in. This would require building a little “Correctlet,” it might be called, so that a word or two can be fixed—in the manner of “Captcha” —which will present at the bottom of its display an offer to “do more,” in which case the user will be invited to register and correct full texts of their choosing. Ideally, we could eventually offer our Correctlet as an exportable thing that could be put onto other sites such as those we have peer-reviewed through 18thConnect and NINES; each of our members could then feed us text-correction information. Also, the IBM World Computing Grid has offered to make such a tool available loaded with our data: I will be in contact with them about any Correctlet that we develop.
We would like to hook our Scholarly Editing Tool up to the “Groups” pages in NINES and 18thConnect. If we could do the programming necessary to hook up the Scholarly Editing Tool to Groups, we could offer group editing as a teaching tool. Amy Earhart has already written a ProfHacker article about using NINES exhibits to teach, and so we think that a Group Editing capacity for those teaching college or high-school classes would garner much attention as well. NINES and 18thConnect will thus be able to offer digital editing experience not only to students specializing in Book History and Digital Humanities, but also to students who are taking digital or information literacy classes, as a way of instructing them into how searching works, what it misses and excludes.
Per the attached email from Nick Laiacona, this more complicated and thorough integration of the Scholarly Editing Tool, itself to be completed shortly, would cost an additional $26,400. These funds are needed in addition to what has already been allotted through this officer’s grant about which I am reporting now, the one awarded in July.
B. Permanent Place for Images: As we have begun reducing and slicing up the page images to make them deliverable in our Scholarly Editing Tool, we have discovered that we will need about 10 Terabytes of space to keep those thumbnails and bits, and they need to be served up quickly and efficiently to our users. David Woods of Miami University is looking into the cost of such space at the Ohio Supercomputer Center, and I am discussing with people at the NCSA the possibility of storing them there. But it seems to me that our image thumbs and bits might collectively be a perfect resident of the future Bamboo Corpora Space, to be discussed at a meeting at the University of Maryland February 11-12. We could store page images and texts together, as well as the Scholarly Editing and Correctlet tools in Bamboo. The open-source tools could be checked out and modified or rebuilt in other code-bases; the image pieces and textual data could be used (with Gale’s permission) in other tools.
C. Future Research: I can imagine a way forward in studying OCR and its correction, as a means for preserving the archive that would involve teams of people working together.
1) Research on text images: Benjamin Pauley of Eastern Connecticut State Univ. and Brian Geiger of UC Riverside and the ESTC (English Short Title Catalog) have received a Google grant in order to put ESTC numbers into the metadata for Google-Books page-images. These images have been made much more recently than most of the page images contained in ECCO. The ECCO texts are already tied to ESTC numbers. We could conceivably offer a computer scientist interested in research on images, on OCR, and in high performance computing for Humanities data (Peter Bajscy of the Image Spatial Data Analysis [ISDA] Group at the NCSA, for instance) the opportunity to generate textual data by comparing two images of exactly the same text. It would be ideal, since we already have Gale’s 182,000 page images at the NCSA, to work with ISDA, Brian Geiger and Benjamin Pauley, and the HathiTrust to get these images together with Google images in one place for OCR research. We could investigate how running diff tools on images of varying quality but picturing the same object might improve the textual output of an OCR engine.
2) Research on Images and Interactions: Martin Mueller and Phil Burns of Northwestern University have been offering time and expertise all along; I have even visited them while working on this project in order to get help and advice. It seems to me, after much deliberation with them, that there are statistical methods for determining OCR errors, including the long-s problem, methods that if successful would render unnecessary further improving OCR engines. As a computer scientist and statistician, Phil Burns would be the perfect person to develop post-processing routines for automatically correcting typical OCR errors. Ideally as well, we could enlist the help of Doug Downey, an assistant professor in Computer science at Northwestern University, to determine optimum ways of getting the crowd-sourced correction tool to interact with post-processing algorithms, fine-tuning their interactions for optimum results. These results include not only improved, correct texts, but also happy user-correctors who are not frustrated by being asked to perform too many purely repetitive mechanical tasks. Being asked to use their minds in ways that machines cannot will, I believe, keep them engaged.
Conversations with Gale about its work on the British Library’s Burney Newspaper Collection (http://www.bl.uk/reshelp/findhelprestype/news/burneynews/index.html, http://gale.cengage.co.uk/product-highlights/history/19th-century-british-library-newspapers.aspx), as well as with non-profits such as the NCSE (Ninteenth-Century Electronic Serials Edition, positively peer-reviewed by NINES and supported by King’s College Centre for Computing in the Humanities: http://www.ncse.ac.uk/index.html) have convinced me that much work like this is needed and necessary if were are to be able to archive our cultural heritage in anything like usable form.