PhD Candidate, Department of History, King's College London
I am interested in how access to large bodies of historical sources in an electronic format can change the way historians are able to analyse and understand the past. A billion words tell a different story than a thousand. But just how do we distill a billion words into something meaningful?
Digitization initiatives in the humanities have increased the number of textual sources humanists have readily available to unmanageable levels. This new abundance of material means humanists are now faced with the problem of identifying the most relevant materials for their research amidst an ever-growing digital archive; this is a tricky challenge for many scholars. My own research looks at a small corner of this larger problem through a case study: which set of texts in a large collection are about an Irish person? And how can we determine this without reading the whole collection?
Many of us have been trained by search engines to believe that keywords are the best way to find relevant material. In practice, the limits of what historical sources typically contain makes this a real issue. Often a source will contain little more than a person’s name and a few scant details. What seemed like a simple information retrieval challenge, turns out to be much more complex.
My research has been able to solve this challenge by deriving an algorithm that can tell by a person’s name the likelihood of an Irish connection. Coupled with keyword searching and a process known as nominal record linkage, this three-pronged approach has allowed me to efficiently and effectively classify tens of thousands of historical records into piles of relevant or not relevant. For me, this is exciting, as it opens up a future in which academics working with historical sources can ask for relevant records, rather than forcing historians to keyword search an (often arbitrary) set of terms that may or may not have been used by the original authors. I believe this shift can put the historical community more firmly on the path towards reproducible research, and more comprehensive sets of relevant sources with which to work.
Check out contributions by and mentions of Adam Crymble on www.software.ac.uk