Using AI for Humanities Research Software | Software Sustainability Institute

AI is increasingly used in working with research software (both in creation and deployment), including in DH. As digitised archives, cultural collections, and large textual datasets continue to grow, researchers are turning to computational tools to analyse and manage materials.

This guide explores several practical applications of AI within humanities research software, focusing on how these tools can assist with analysing textual corpora and processing cultural and archival materials. The aim is to help researchers understand where AI can be useful in research software environments, how it is currently being applied in humanities projects, and how these tools can be used responsibly while maintaining critical scholarly interpretation.

AI for Text Mining

One of the most common applications of AI in humanities research software is text analysis across large document collections, or ‘text mining’. NLP techniques allow researchers to analyse thousands of texts simultaneously, enabling methods such as:

topic modelling;
sentiment analysis;
named entity recognition (NER);
language pattern detection.

Large Language Models (LLMs) can perform these functions in ways that open up NLP to less experienced digital humanists, e.g. by running Python with libraries like spaCy, NLTK, or transformers; ChatGPT will also allow you to see the code being run when processing a prompt. This also makes the results less likely to contain hallucinations as a specific program is being run, but outputs should still be checked carefully: LLMs like ChatGPT are often black boxes which do not allow users to see the training data, and so understanding the results requires a high level of knowledge of the domain you’re working with. There are options for smaller models which can be more easily adapted to specific research needs, and which can be made to focus on specific training data.

As well as deploying Python libraries, there is existing software which makes use of AI, such as Nvivo. But a key consideration is whether it is necessary to use resource-intensive tools like LLMs, or tools with AI-powered functionality, at all. Existing software such as Voyant Tools can provide insights into text without the need for an AI model, and without the risk of hallucinations. It is vital for researchers to understand their texts well, and understand how LLMs come to their results, before incorporating such findings into their research.

AI for Processing and Structuring Different Kinds of Data

Another major application of AI in humanities research software is processing digitised cultural materials such as scanned page images or manuscripts. Tools like Voyant (mentioned above) can work with an output of unstructured text (i.e. texts which are not in a machine-readable format, where things like chapter headings, page numbers, and sections are not encoded), but some text mining and visualisation tools will require the text to be structured.

Tools for processing and structuring data include:

Optical Character Recognition (OCR): AI-powered OCR tools convert scanned books or manuscripts into searchable text.
Handwriting recognition: Machine learning models can help transcribe handwritten archival documents (Transkribus is a major project in this area).
CV for visual materials: AI can identify visual features in artworks, photographs, and manuscripts (the Visual Geometry Research Group at the University of Oxford has a suite of tools for this).
Using LLMs to create, check and validate XML files. For those new to text encoding, LLMs can be used to work with specific guidelines to process text quickly and create structure. For example, TEImodeler provides recommendations, generates examples, and can problem-solve encoding decisions.

As noted above, a key concern here is to check accuracy before working with AI-generated material, and to bear in mind the pros and cons of using such tools: for example, handwriting recognition is very challenging, and the output will not be a perfect reproduction of the handwritten text. OCR software often provides a confidence score when producing text from images, and errors can derive from the quality of the image, the typeface used, and the page layout (e.g. the use of features like columns). If working with LLMs to create metadata such as TEI, the user needs to consider the kinds of decisions that are being relinquished: there is a balance between the speed of such tools and academic rigour which needs to be decided and documented in any project. LLMs may produce functional code, but it often falls short of best practice, even while tools like AI Code propose to help improve human-written code.

Risks and Biases

When using AI in any of these ways, it is necessary for researchers to consider that AI may produce inaccurate or biased results and must be checked carefully, with the researcher maintaining documentation about how AI was used; this should be part of the workflow. Researchers also need to ensure that any copyrighted material is not uploaded to AI systems without permission; these concerns define ethical AI use.

It should also be noted that AI models learn from existing data, which may reflect historical biases and unequal representation in archives: major digitisation projects have focused on particular authors, topics, countries and materials which reinforce cultural hegemony and bias, and have an impact on everything from the predictions software might make when attempting OCR to the kinds of elements in an image that CV identifies.

Conclusion

AI-powered research software is transforming humanities research by enabling scholars to analyse large cultural datasets, process digitised archives, and explore new research questions.

Key takeaways:

AI can facilitate the use and creation of research software for DH research.
Techniques such as NLP, OCR, and CV enable large-scale analysis of texts and images.
Generative AI can support research workflows but must be used critically.
Humanities expertise remains essential for interpreting results and ensuring ethical research practices.

Further Resources

Acknowledgements

This guide was written by Emily Middleton, Lecturer in Digital Humanities at the University of Leeds, and reviewed by Peace Ayegba.

Emily's ORCID