Skip to main

Call-to-action

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Text Data Processing for Humanists

Speaker

Hannah Jacobs

Humanities researchers can amass a considerable number of primary and secondary text-based sources for their research. These may include scans of archival documents such as manuscripts, newspapers, books, and other materials. They may also include varying-quality scans of secondary sources on loan from their own or other libraries. While close reading of this material is key for many humanities researchers, making use of so much data can also be supported by computation: by using computational tools to transcribe handwritten and printed text, scholars can query their text data to quickly find information. These processes, optical character recognition (OCR) for printed text and handwritten text recognition (HTR) for handwritten text, have improved significantly in recent years with machine learning and generative artificial intelligence. In this workshop, we will examine how these technologies work, practice using several tools for OCR and HTR, and consider the opportunities and challenges that can arise when using these technologies with different page layouts, languages, and scripts. Participants are encouraged to bring a laptop.

By the end of this workshop, you will be able to

  • describe how OCR and HTR work in general terms;
  • identify possible opportunities and challenges when applying OCR and HTR technologies to different page layouts, languages, and scripts;
  • implement several OCR and HTR technologies in your research; and
  • assess accuracy, clean up processed text, and document workflows for transparency.

This workshop will be facilitated by Hannah Jacobs, Digital Humanities Consultant with Duke Libraries.

Location: East Campus Music Library Seminar Room

Participation: General discussion, structured activity, and time for questions.

Audience: Humanities faculty & graduate students. RCR credit is available for both.


Categories

Digital Humanities, RCR Workshop