Honing transcriptions with algorithms and acumen

A question I often hear from paleographers who contribute transcriptions to Early Modern Manuscripts Online (or EMMO) is: What are you going to do with all these transcriptions? It’s a good question—central to the whole project, actually—but it’s also a complicated one. The short answer I usually give goes something like this: We aim to gather multiple independent transcriptions for each digitized page and compare them to create an aggregate transcription which an expert paleographer then checks over for accuracy. This explanation tends to satisfy most inquiries, and it has the added benefit of being true, but how exactly does it all happen?

Just as the broad question contains lots of smaller questions, the short answer leaves out lots of steps. To start, the multiple independent transcriptions may come from many sources: transcribers in the Folger network, OCR versions of printed transcriptions, crowd-sourced output from Shakespeare’s World, and participants in EMMO-sponsored courses and events. But what is used to compare transcriptions, and how is an aggregate transcription created? For that matter, what is an aggregate transcription? How are transcriptions checked, and by whom (or what)? What happens after the transcription is checked?

Whew, that’s a lot to cover. Let’s relax and go back to comparing transcriptions and work our way through things from there.

We have had the general comparing piece in place for a while. Dromio, the online transcription tool developed at the Folger, has an excellent collation feature that overlays user transcriptions (entered using Dromio) in order to see where they match up and where they don’t. The example below shows how this works with a poem from a miscellany (V.a.245) transcribed during one of our Practical Paleography (or PracPaleo) sessions earlier this year.

Dromio collation view with transcriptions overlaying one another

Dromio collation view of fol. 38v

Five transcriptions of the poem’s first stanza appear on the screen along with part of the digitized manuscript image at the top. Text in black shows agreement among the transcriptions while text in red shows disagreement.

This component of Dromio is especially useful as a learning aid for paleography, and indeed, that was the main reason for the development of the collation feature. An instructor and participants can quickly identify words or sections that were difficult to transcribe, the group can discuss these, and users can correct their individual transcriptions in the application. That’s great for as far as it goes. To assemble a corpus of encoded transcriptions on the web for EMMO, however, we needed to go further.

First of all, the red text of dissent appears even when a majority of transcriptions are in accord. We realized it would be helpful to know if a version of a word had greater agreement while still noting any disagreement (it would be easier to read, too). The collation view also doesn’t show the various encoding tags of words or letters, e.g., deletion (for text struck-through or cancelled by the scribe) or heading (for titles of sections). Finally and perhaps most significantly, the collation view is only a view. One cannot make edits to it nor save an edited version, and such capabilities would clearly have benefits for generating a more polished transcription.

After several discussions with Mike Poston, the programmer for EMMO (and original developer of Dromio), plus a fair amount of contemplation and computing work on his part, the Vetting feature was added this past spring. It marks an important step forward in helping us arrive at one collective transcription from many—an aggregate transcription.

Dromio Vetting feature showing a transcription with the most agreement and tags

Dromio Vetting feature for fol. 38v.

Drawing on the same five transcriptions for this page—this time including encoding tags—the vetting feature displays the version with the most agreement among transcribers. Disagreement is indicated by red underlining (similar to the collation view), but the majority opinion of a word or tag appears on the screen, rather than as an overlay of transcriptions. An expert paleographer may now check this one enhanced version rather than all of the individual transcriptions (using manual copying and pasting). By clicking on the underlined words, he or she may review any alternates and choose to leave the choice of the majority, select a minority interpretation, or enter something else.

Word edit feature in Vetting: "straynes" vs. "straines"

Dromio Vetting view: word edit feature

In the above example, the choices “straynes” and “straines” are shown for the underlined third word “straynes” in the second line. Someone checking would know that more people transcribed this word as “straynes” but at least one entered “straines” in his or her transcription. In this case, it is clear that the majority is right, so the paleographer vetting this page can leave “straynes” as-is or click to confirm it, if desired, and move on through the rest. If a change is made, either a majority opinion confirmed or an alternate judged to be correct instead—majorities are wrong sometimes—the edited word appears with a green underline to indicate the update. Any word or tag may be edited at the vetting stage, not only underlined ones. Pressing “Save” ensures such edits are stored in the vetted collective transcription.

Encoding tags are another aspect of the new vetting feature. As seen in the example above, small brackets indicate the presence of encoding tags applied to letters or words in the underlying transcriptions that form the collective or aggregate one. The light green outline around a pair of brackets (it must be a pair) signal that more than one transcriber agreed on marking a particular tag. For example, “<In commendation of Musick./>” has the heading tag applied to it, and “<can scarce>” has the deletion tag. These instances of mark-up would not be discernible in the Dromio collation view, but do show in the Vetting feature. Moreover, someone checking through this page can confirm or reject these tags easily with one click in much the same way spelling is affirmed.

Once the entire collective transcription has been checked, the reviewer, i.e., the vetter, presses the “Done” button to designate he or she is finished with transcription stage. Additional tagging may still be performed if needed, and XML attributes will be inserted as part of final coding. However, by leveraging several transcriptions in one screen for an expert to check, the transcription review process is streamlined, and accuracy is heightened for the aggregate transcription.

When all the encoding is done for XML validation and TEI compliance, a finished XML transcription will be loaded into the online EMMO database. Style sheets for browsing and a robust search feature will enable scholars or anyone interested in early modern history to learn more from exploring these fascinating texts. Look for announcements later this year about the launch of this exciting new resource.

So, that’s what we do with all the transcriptions.

Of course, some may well want to ask a follow-up question or two such as: what about the transcriptions from Shakespeare’s World? Another good question. Zooniverse has an aggregation algorithm of its own. Describing their approach and how it will integrate with Dromio and the EMMO Database when the finished transcriptions start flowing in will have to wait for another post. Since we worked together on the development of Shakespeare’s World, though, we don’t expect major hurdles. The crowd-sourcing transcriptions continue along at a lively pace on Shakespeare’s World. Sarah Powell, the EMMO Paleographer, spoke about this exciting partnership at the Digital Public Libraries of America Festival (or DPLAFest) this past April at the Library of Congress. For the latest news on the Shakespeare’s World project, follow us on @Shaxworld.

One last follow-up question: who are these expert paleographers checking transcriptions in Dromio? Fortunately, this question is a little easier to tackle. Vetting paleographers for EMMO are ones who have had advanced training in early modern paleography, are interested in the project, and are familiar with Dromio and the EMMO tag set (generally the EMMO team and scholars who have attended advanced paleography workshops or the equivalent).

We had two interns this summer who became special additions to our vetting team: Taryn Dollings and Breanne Weber. Both of them studied paleography as part of their recent masters degrees, and they spent the month of June transcribing at the Folger with Sarah. Besides vetting, they worked on deciphering and tagging manuscript pages in the Papers of the Goodricke Family of Ribston Hall (V.b.333) and the Poetical Miscellany of Anne Campbell, Countess of Argyll (V.a.89). We were happy to welcome Caitlin Rizzo as a third summer intern who worked with Mike Poston in July on advanced XML encoding, spelling regularization, preliminary search features, and browsing options for the EMMO database. And, the amazing Stella Gitelman-Willoughby, a younger visiting student, returned to work with Sarah on transcriptions for a couple days in May. A hearty thanks from the entire EMMO team to all four of these talented people!

Happily, what we do with transcriptions goes even further than encoding and vetting sometimes, as a group of dedicated paleographers found this summer, under the skillful eye of Marissa Nicosia. They set out to make one of the early modern recipes in a manuscript of cookery and medicinal recipes (V.a.429).

Section of fol. 52v showing recipe

You can read about the pursuit of Almond Jumballs (or is it Iumballs?) in Marissa’s blog here. By all reports, the group enjoyed a tasty reward for paleographical effort!

Tempting image of the freshly baked almond treats!

Paul Dingman

was the Project Manager for Early Modern Manuscripts Online (EMMO) at the Folger Shakespeare Library. He earned a PhD in History with a concentration in medieval/early modern Europe and is especially interested in cultural history and the digital humanities. Paul also worked in the field of information technology for years before pursuing his doctorate. — View all posts by Paul Dingman

Stay connected

Enter your email address to follow this blog and receive notifications of new posts by email.

Comments

During a long lunch with the expert paleographer, Alan Nelson, I asked him how 16th century secretary hand compares with modern longhand, in terms of legibility. He smiled, and said he gives his paleography students a copy of a letter that none of them can decipher. He then tells him it was written by his mother.

Richard M. Waugaman — August 9, 2016

Website navigation

What's on

Visit

Explore

Teach

Research

Join and Support

Honing transcriptions with algorithms and acumen

Comments

Leave a Reply Cancel reply