A journey of reproducibility from Excel to Pandas
By Simon Hettrick, Deputy Director.
This is a story about reproducibility. It’s about the first study I conducted at the Institute, the difficulties I’ve faced in reproducing analysis that was originally conducted in Excel, and it’s testament to the power of a tweet that’s haunted me for three years.
The good news is that the results from my original analysis still stand, although I found a mistake in a subdivision of a question when conducting the new analysis. This miscalculation was caused by a missing “a” in a sheet containing 3763 cells. This is as good a reason as any for moving to a more reproducible platform.
Data and citation
Before getting to the nitty gritty, here's where you can find the data and analysis. The data collected during this survey is available for download from Zenodo. Please cite "S.J. Hettrick et al, UK Research Software Survey 2014", DOI:10.5281/zenodo.1183562.
The data is licensed under a Creative Commons by Attribution licence and the analysis is licensed under a BSD 3-clause licence (in both cases attribution to "The University of Edinburgh on behalf of the Software Sustainability Institute").
2014: a survey odyssey
Back In 2014, I surveyed a group of UK universities to see how their researchers used software. We believed that an inexorable link existed between software and research, but we had yet to prove it. I designed the study, but I never intended to perform the analysis. This was a job better suited to someone who could write code, and I could not. Unfortunately, things didn’t go to plan and I found myself in the disquieting situation of having an imminent deadline and no one available to do the coding. Under these circumstances, few people have the fortitude to take some time out to learn how to program. Instead they turn to what they know and - like so many researchers - I knew Excel.
The analysis took a few days of fairly laborious Excel bashing. The results looked good, they proved the connection we were looking for, and they were well received. It was all working out! Then I received a tweet from Lorena Barba:December 14, 2014
I really wanted to be annoyed by this tweet. The circumstances that had forced me to conduct the analysis were beyond my control, yet I’d picked up the slack when I was already working crazy hours. It felt like a victory had been snatched from me. But I just couldn’t get angry, because it was clear that the tweet was right. It was worse than ironic that I, as a representative of an organisation that extolls the virtues of reproducibility, had relied on Excel, which is known for making it difficult. In a fit of melodrama, I vowed to reproduce the analysis as soon as I could.
I’m sorry, Simon. I'm afraid I can't do that
“As soon as I could” turned out to be about three years later. Why so long? Aside from a feeling of righteousness - which is great, but doesn’t win you a promotion - there’s simply no incentive to invest time into reproducing results, but there are many, many disincentives.
Whilst I was wringing my hands over my lack of coding skills, people were quite happily citing the study and, after the original flurry of interest, no one was picking through the analysis. There’s a weird phenomenon where citations that people have seen used elsewhere pick up an authority that seems to inoculate them against further enquiry. (It’s almost like some people stop looking at the original document and simply copy the headline result.) Whilst there’s no credit for repeating an analysis, there is a huge cost in terms of time that could have been spent conducting new studies that would generate credit. And what happens if I find that my original analysis was wrong? Who would invest time for no reward, just to open themselves to potential criticism? You’d have to be insane to attempt to reproduce your own work.
But that tweet kept niggling away at me.
The first step was learning to code. No small task, but I wanted to learn and, with my position at the Institute, it always seemed a bit odd that I couldn’t (it's like a being vegetarian who's employed as a butcher). I chose to learn Python with the Pandas library, because it’s popular, gave me the kind of statistical analysis tools I needed, and I’d heard a lot of praise for its ease of use. I started learning in January this year and, whilst I’m certainly no expert, eight months later I felt that I was ready to reproduce the survey analysis in what I hope is a more reproducible manner.
My god! It’s full of crap
I recently taught the “Data Organisation in Spreadsheets” course at a Data Carpentry workshop, where I discussed 12 major mistakes that people make with spreadsheets. The original analysis of my survey included every single one of them.
There was a glorious proliferation of tabs, the analysis for one question alone spanned 40 columns of arcane transformations, and the summary page looked like it was designed by Mondrian. It’s not much of a defence, but I did some things right too: I included the raw data, I tried to include basic tests for errors, I broadly listed the operations I had conducted in a README, and I stored everything in Zenodo under a permissive licence.
The main problem on returning to the Excel analysis was simply determining the steps used to transform the data. It’s almost impossible to reconstruct the logic behind spreadsheet-based analysis, and I soon realised that I was wasting my time. Rather than excavating the original logic, I took the bolder step of starting the analysis fresh and simply hoping that the results aligned with the old analysis. If they did, then this would save me a lot of time and frustration. If they did not, well... there are other careers.
Just what do you think you're doing, Simon?
Back in 2014, I was new to the surveying game and I showed this through my frequent use of free-text questions. I have since learned that people who run surveys tend to use free-text responses in the same way as doctors use emetics: only when you're prepared to clean up the mess. One of the free-text questions had generated a particularly messy response, because I'd relied on respondents to comma separate their answers. This had resulted in text containing a bewildering array of different separators. I wrote a Python script to rationalise these answers and produce a reliable, comma-separated response. I then turned to OpenRefine for the bulk data-cleaning. Its clustering function saved me a huge amount of time in dealing with the other problem with free-text answers: people's ability to say the same thing in many different ways.
Once the data was clean, I wrote a second Python script to conduct the univariate analysis, store the results and plot them. This would be the end of the analysis usually, but I now had the additional step of comparing the results from the new and the old study. This required a third Python script that compared the results and saved the comparisons as a series of csv files.
Since I had decided to repeat the analysis blind rather than trying to replicate the steps used in Excel, I ended up with some differences in terms after cleaning the data. For example, in 2014 I was fond of the term ‘postdoc’, but in 2017 I favoured the more formal ‘postdoctoral researcher’. For the purposes of the comparison only, I edited my comparison script so that it swapped out the terms used in the old analysis with my new (and currently preferred) terms.
It can only be attributable to human error
There are a lot of comparison files in the Github repository, but for the purposes of this post I’ll use the following table. It shows the average difference between the old and the new analysis (or, in long form: for each question, I took the percentages generated for each answer by the new analysis and old analysis, calculated the absolute difference between these, and averaged the absolute difference on a per-question basis).
Everything looks rosy until you get to question 9. Whilst the headline figure is unchanged in that only 54% or researchers reported that they had received training in software development, the balance between the types of training received had changed (“self taught” and “taught course only” increased, and “both” decreased).
This mistake occurred because I was using a SEARCH() function to differentiate between possible answers. I had searched for “taught”, which is not unique across the potential answers, rather than “a taught” which is. Here lies one of the biggest problems with Excel: a mistake can throw out a tiny part of the overall analysis, which is hard to detect. By contrast, programming forces you to abstract your analysis so that it can be applied more generally, and this means that mistakes generally have wide-ranging effects, which makes them easier to detect.
The only other significant difference is with extra question 1: which software do you use in your research? This was a free-text question which redefined the term “wildly anarchic”. With such a variety of ways of describing the software people used, I’m actually surprised that such a small variation was found between the two analyses.
It’s difficult to clean messy data. This became apparent when I looked at the difference in number of responses that I deleted because they were inadequate (e.g. they were from a university not included in the survey). The old analysis was based on 417 responses, but the newly cleaned data excluded a further 26 responses. I believe that this is simply down to becoming a little battle-hardened when it comes to surveys. I’m far less forgiving of departures from the chosen path now that I’ve had years of experience of dealing with the consequences (and cleaning) of those departures. Importantly, the difference had no effect on the results.
I've still got the greatest enthusiasm and confidence in the mission
This has been a fascinating project. I’ve seen at first hand the horror of reproducing a very simple project - one that I planned and conducted only three years ago - and found it both difficult and time-consuming to repeat. It’s certainly shown the incredible value of things like good documentation. Even something as simple as listing the steps in the original analysis became invaluable when trying to fathom the logic in the original spreadsheet. On that front, I can conclusively say that whilst it’s not impossible to reconstruct the logic behind “40 columns of arcane transformations”, it is so, so much simpler to review a couple of lines of well-commented code.
Equally, I don't want to be drawn into Excel bashing. There's a reason why spreadsheets are so prevalent in research: they're accessible, quick and rather powerful. They're not reproducible, but we don't incentivise reproducibility so how can we expect researchers to care about it?
It's taken me three years to find the time to conduct this analysis, and eight months to learn how to use the tools. I conducted the majority of the work in my own time, because I couldn't justify spending work time on a project that doesn't win credit. I'm a senior representative of a national software organisation that cares deeply about reproducibility. If I find it difficult to make the case for spending time on a project like this, imagine how impossible it must seem to a new postdoc working in a field where everyone uses spreadsheets. It is a brave person indeed who chooses to learn reproducible methods rather than just adopting the tool used by their colleagues. In other words, we are right to campaign for researchers to use reproducible methods, but we must do so whilst respecting the circumstances under which they work.
To finish, I would like to say a huge thank you to Lorena Barba for the tweet that convinced me to begin this odyssey, and I hope that this re-analysis has answered those original concerns.