By Daisie Huang, Software Engineer, Dryad Digital Repository.
What differentiates scientific coders from research software engineers? Scientists tend to be data-discoverers: they view data as a monolithic chunk to be examined and explore it at a fairly fine scale. Research Software Engineers (and software engineers in general) tend to figure out the goal first and then build a machine to do it well. In order for scientists to fully leverage the discoveries of their predecessors, software engineers are needed to automate and simplify the tasks that scientists already know how to do.
Scientists want to explore. Engineers want to build
I've been thinking a lot about the role of coding in science. As a software engineer turned scientist, my research is extremely computational in nature: I work with genomes, which are really just long character strings with biological properties. My work depends on software developed by myself and many, many other scientists. Scientists are, by and large, inquisitive and intelligent people who are fast learners and can quickly pick up new skills, so it seems natural that many would teach themselves programming. When I first started talking to scientist-coders, I thought that perhaps I could relate to them from a programming perspective, and maybe bring some experience in formal software design practices to teaching scientists about coding. I started working with Software Carpentry and organisations of computational scientists in my field (Phylotastic, Open Tree of Life, Mesquite) and getting more involved in figuring out what motivates scientists to take time out of their research and learn to code.
Unexpectedly, I've found it difficult to talk about programming with scientists who code. I've noticed that scientists gravitate towards platforms and languages that most software engineers don't favour, and my rationale for why I don't use those platforms myself doesn't make a lot of sense to scientists. I think that the reasons for this difference between myself and other scientists lie in my background as a software engineer, which predates and informs my approach towards science and computational biology in general. I've come to some conclusions about what differentiates scientific coders from software engineers, and why the two camps tend not to see eye-to-eye, and can be almost antagonistic.
The scientific programming community loves interactive programming environments such as R, iPython, and, to a lesser extent, Ruby. Scientists, when given a large chunk of data, want to query, explore, and visualise the data, first and foremost. Programming environments like R are ideal for this sort of exploration: you can load in a chunk of data and query it to your heart's content, and the history of what you've done is recorded as you go, making it even easier to reproduce! Perfect. Software Carpentry understands that this is what scientists do, and teaches to those inclinations. The curriculum emphasises learning how to automate discovery and how to reproduce and retrieve the results you've discovered, through lessons on shell scripting, version control, databases, and basic programming skills in R and Python.
While this approach is fine for exploration, this is not at all the approach I use as a software engineer. It's probably to my detriment as a scientist, but I have a very hard time approaching programming in this highly interactive way. Because I learned from a curriculum that was based on The Structure and Interpretation of Computer Programs, I tend to visualise the blocks of code that I want to create first, as black boxes that will take inputs and give me outputs (abstraction), and consider the mechanisms within the boxes later. I use abstract data structures to regularise the input and output data so that I can package other data to put into the box and get the same sort of output.
I think that these two modes are not easy to reconcile: the data-discovery model tends to view data as a monolithic chunk to be examined and explores it at a fairly fine scale. Scientists want to repeatedly tweak queries and analyses on a single, primary data set and get immediate feedback. As long as you can clearly explain what you did to get the result published in a scientific paper, you don't need to pursue the code any further. If this works for what you need, it's hard to understand why you'd need to spend a lot of time learning about computer programming theory outside of possibly making your code faster. The code is entangled with the specific data that drove the analysis, and there is little incentive to disentangle the two.
The software engineering model tends to view generalised functionality as paramount and central to the process: one figures out what one wants to achieve first, blocks out the boxes and widgets that needed to feed in and out of the machine, and only then builds the machine. A software engineer would see a scientific paper written by the data scientist and want to design a nice box around it that could then be used by others on other data sets. Not only that, but a software engineer would see that the scientific coder has clearly performed the same computational task over and over again within their code and would want to abstract a modular function for it, even after the code is published. A box that can only work on a single dataset isn't a very good box at all. To design a really elegant box that is easy for others to use is ultimately the goal of the software engineer as craftsperson.
The software engineer and the biologist
A quick analogy: imagine if biologists still had to do the polymerase chain reaction by hand instead of using tested, reliable machines to do it! Originally, scientists had to add polymerases to their reactions manually and then transfer the tubes from heat baths of different temperatures, over and over again. While this is a perfectly functional and reproducible way of performing PCR, it's not how we do it today. First came robots that would move the tubes between the different temperatures for us, and later came nicely engineered machines that use the Peltier effect to do it all in one closed block. Because of these further refinements to the PCR functional box, PCR is now a basic tool that we can use to ask further, deeper biological questions.
What does this mean for scientists who code? Do they all have to become software engineers in order to be real coders? I think not. I think that scientists should use computer programming as an exploratory tool to drive discovery in their fields, the way they use other methods and tools, but scientist-coders can benefit from learning about modularity, abstraction, and data structures, if they want to leverage their past discoveries into further data discoveries without having to reinvent the wheel every time. Software engineers can then take that code and make it reusable by a broader audience and enable more discoveries by other scientists. Even that first modularisation of PCR from a human moving the tubes by hand to a robot moving them was a huge innovation in making science more replicable by others, but without the initial scientific discovery of Taq polymerase and how to use it in the polymerase chain reaction, engineers would have no box to build.
As science becomes increasingly computational in nature, it will become more important that scientific code does not end its development cycle on publication of the paper. Data exploration drives primary scientific discovery, but in order for future scientists to fully leverage the work of their predecessors, robust, reproducible, and sustainable software is needed to automate the parts we already know how to do. The role of the research software engineer in taking the original code published in a scientific paper and forming it into a modular and reusable box is vital to drive science forward. Making scientific software more open, reproducible, and sustainable needs to be recognised as a separate endeavour from the original scientific publication. Software engineers play a critical role in the future of data-driven discovery, and recognising and compensating their labour is crucial to accelerating future scientific advancements.
Are you a research software engineer or a scientist-coder? What do you think? Do these models ring true to you, or did I miss something? Please comment below!