Sage Ocean launch new software tool to clean text

Posted by j.laird on 15 December 2020 - 9:30am Clock

Photo by Moritz Kindler on Unsplash

By Daniela Duca from Sage Ocean.

We are building Texti to reduce the amount of time digital scholars spend on cleaning and preparing text for mining and other forms of automated analysis. Texti is a free and simple tool, it is now in beta and we are looking for feedback, comments, reactions, ideas, and really any help in testing it and helping us discover how to make it more useful.

We started with the researcher use case in mind: a researcher that works with modest to large amounts of text, that may have some coding experience, and is excited to pick up tips and tricks to run a more automated analysis of their corpus. We found that these researchers are frustrated when it comes to getting their corpus ready for the type of analysis they want to carry out, they discover they are spending most of their time getting the format right even for simple word clouds or sentiment analysis, perhaps because they started with PDFs and when they extracted the text, there’s too much junk; or because they didn’t have time or the capacity to figure out how to extract monetary information from long-form text.

We wanted to find a way to help these researchers, by collating and writing blocks of code in Python that use a variety of packages and help transform the text in different ways (think extract text, reformat, remove custom words, remove short words, remove punctuation, remove custom patterns, and other variations). Texti is just the interface to access and work with these blocks of code, mix and match them, and build a workflow.

For the first version, we kept it simple, the interface allows researchers to load a PDF file, and create a workflow that would extract, filter and pre-process this document without inputting any code. The user can preview the changes applied to the document by each of these blocks of code. They should be able to export the code behind their workflow into a Python notebook. We are hoping to build more features for the next and more stable release, like batch processing and interoperability with data repositories and other sources. We are looking for any feedback, especially if you’ve been wrangling with text data or worked with researchers that needed that type of support.

To give it a try, you can register for a free account: https://sagetextipocapp.azurewebsites.net/

Or if you think this sounds like it might have potential in the future and you’d like to know more about it later, then you can register your interest: https://ocean.sagepub.com/texti or get in touch with me directly at Daniela.Duca@sagepub.co.uk.