Scientific coding and software engineering: what's the difference?

By Daisie Huang, Software Engineer, Dryad Digital Repository.

What differentiates scientific coders from research software engineers? Scientists tend to be data-discoverers: they view data as a monolithic chunk to be examined and explore it at a fairly fine scale. Research Software Engineers (and software engineers in general) tend to figure out the goal first and then build a machine to do it well. In order for scientists to fully leverage the discoveries of their predecessors, software engineers are needed to automate and simplify the tasks that scientists already know how to do.

Scientists want to explore. Engineers want to build

I've been thinking a lot about the role of coding in science. As a software engineer turned scientist, my research is extremely computational in nature: I work with genomes, which are really just long character strings with biological properties. My work depends on software developed by myself and many, many other scientists. Scientists are, by and large, inquisitive and intelligent people who are fast learners and can quickly pick up new skills, so it seems natural that many would teach themselves programming. When I first started talking to scientist-coders, I thought that perhaps I could relate to them from a programming perspective, and maybe bring some experience in formal software design practices to teaching scientists about coding. I started working with Software Carpentry and organisations of computational scientists in my field (Phylotastic, Open Tree of Life,  Mesquite) and getting more involved in figuring out what motivates scientists to take time out of their research and learn to code.

Unexpectedly, I've found it difficult to talk about programming with scientists who code. I've noticed that scientists gravitate towards platforms and languages that most software engineers don't favour, and my rationale for why I don't use those platforms myself doesn't make a lot of sense to scientists. I think that the reasons for this difference between myself and other scientists lie in my background as a software engineer, which predates and informs my approach towards science and computational biology in general. I've come to some conclusions about what differentiates scientific coders from software engineers, and why the two camps tend not to see eye-to-eye, and can be almost antagonistic.

Different approaches

The scientific programming community loves interactive programming environments such as R, iPython, and, to a lesser extent, Ruby. Scientists, when given a large chunk of data, want to query, explore, and visualise the data, first and foremost. Programming environments like R are ideal for this sort of exploration: you can load in a chunk of data and query it to your heart's content, and the history of what you've done is recorded as you go, making it even easier to reproduce! Perfect. Software Carpentry understands that this is what scientists do, and teaches to those inclinations. The curriculum emphasises learning how to automate discovery and how to reproduce and retrieve the results you've discovered, through lessons on shell scripting, version control, databases, and basic programming skills in R and Python.

While this approach is fine for exploration, this is not at all the approach I use as a software engineer. It's probably to my detriment as a scientist, but I have a very hard time approaching programming in this highly interactive way. Because I learned from a curriculum that was based on The Structure and Interpretation of Computer Programs, I tend to visualise the blocks of code that I want to create first, as black boxes that will take inputs and give me outputs (abstraction), and consider the mechanisms within the boxes later. I use abstract data structures to regularise the input and output data so that I can package other data to put into the box and get the same sort of output.


I think that these two modes are not easy to reconcile: the data-discovery model tends to view data as a monolithic chunk to be examined and explores it at a fairly fine scale. Scientists want to repeatedly tweak queries and analyses on a single, primary data set and get immediate feedback. As long as you can clearly explain what you did to get the result published in a scientific paper, you don't need to pursue the code any further. If this works for what you need, it's hard to understand why you'd need to spend a lot of time learning about computer programming theory outside of possibly making your code faster. The code is entangled with the specific data that drove the analysis, and there is little incentive to disentangle the two.

The software engineering model tends to view generalised functionality as paramount and central to the process: one figures out what one wants to achieve first, blocks out the boxes and widgets that needed to feed in and out of the machine, and only then builds the machine. A software engineer would see a scientific paper written by the data scientist and want to design a nice box around it that could then be used by others on other data sets. Not only that, but a software engineer would see that the scientific coder has clearly performed the same computational task over and over again within their code and would want to abstract a modular function for it, even after the code is published. A box that can only work on a single dataset isn't a very good box at all. To design a really elegant box that is easy for others to use is ultimately the goal of the software engineer as craftsperson.

The software engineer and the biologist

A quick analogy: imagine if biologists still had to do the polymerase chain reaction by hand instead of using tested, reliable machines to do it! Originally, scientists had to add polymerases to their reactions manually and then transfer the tubes from heat baths of different temperatures, over and over again. While this is a perfectly functional and reproducible way of performing PCR, it's not how we do it today. First came robots that would move the tubes between the different temperatures for us, and later came nicely engineered machines that use the Peltier effect to do it all in one closed block. Because of these further refinements to the PCR functional box, PCR is now a basic tool that we can use to ask further, deeper biological questions.

What does this mean for scientists who code? Do they all have to become software engineers in order to be real coders? I think not. I think that scientists should use computer programming as an exploratory tool to drive discovery in their fields, the way they use other methods and tools, but scientist-coders can benefit from learning about modularity, abstraction, and data structures, if they want to leverage their past discoveries into further data discoveries without having to reinvent the wheel every time. Software engineers can then take that code and make it reusable by a broader audience and enable more discoveries by other scientists. Even that first modularisation of PCR from a human moving the tubes by hand to a robot moving them was a huge innovation in making science more replicable by others, but without the initial scientific discovery of Taq polymerase and how to use it in the polymerase chain reaction, engineers would have no box to build.

As science becomes increasingly computational in nature, it will become more important that scientific code does not end its development cycle on publication of the paper. Data exploration drives primary scientific discovery, but in order for future scientists to fully leverage the work of their predecessors, robust, reproducible, and sustainable software is needed to automate the parts we already know how to do. The role of the research software engineer in taking the original code published in a scientific paper and forming it into a modular and reusable box is vital to drive science forward. Making scientific software more open, reproducible, and sustainable needs to be recognised as a separate endeavour from the original scientific publication. Software engineers play a critical role in the future of data-driven discovery, and recognising and compensating their labour is crucial to accelerating future scientific advancements.

Are you a research software engineer or a scientist-coder? What do you think? Do these models ring true to you, or did I miss something? Please comment below!

Posted by s.hettrick on 6 February 2015 - 2:06pm

Submitted by Anonymous on 6 February 2015 - 3:16pm


Thanks for the nice post. I have a similar question: What is the difference between a scientist conducting an experiment (Experimentalist) and a scientific hardware engineer? This analogy, where "Experimentalist" replaces "scientific coder" and "scientific hardware engineer" replaces "software engineer", makes clear that software engineering not only will play a central role but will be indispensable for the advance of science.

Submitted by Marco on 6 February 2015 - 4:25pm


I'm a software engineer, coming from a family of car mechanicians. I'm pretty sure they're much smarter at fixing cars, or even making them to go faster, than any car factory director. However, when the latter has to organise the job of a whole plant, interact with other directors, unions, understand the market, know which technology is available to build cars and their pros/cons, well, that's the job of someone working on cars on a large scale. Building software on a large scale is the job os software engineers, crafting smart PERL scripts, which encode brilliant ideas, is more the job of data scientists who are proficient in programming. Please don't pretend that programmers are software engineers and don't put them in charge to build large scale software, which is going to affect thousand users around the world. Most importantly, don't think that this way you'll save money. I won't add 'and viceversa', cause I have never seen that happening :-)

Submitted by Carl on 6 February 2015 - 9:42pm


Daisie, Thanks for this; I think you've done a great job of concisely illustrating the differences here. The only thing I would push back on is that I believe this is primarily an incentives issue, not an educational one. It isn't that scientists don't _understand_ the value of modularity and abstraction here, but rather that they see it as the job of an engineer. As you say in the beginning, scientists want to explore, engineers want to build. I suspect most researchers have no conceptual problem appreciating the advantages in the evolution of PCR, but do not aspire to bring about those advantages themselves or celebrate those advances as scientific discoveries, any more than they do other feats of engineering from which they regularly benefit. The cynical scientist might reply: "sure, we invented PCR and the engineers came along and made it faster and cheaper, just as they did after we invented the computer, the laser and everything else. No need for science to change. The engineers just need to catch up". As you know, developing the right abstracts and modules isn't a simple mechanical process, but a creative and complex problem itself that is closely tied to the research. Thus, I think the challenge today is not convincing scientists that some abstraction could save them time and money -- they already know this. The challenge is persuading scientists that developing the appropriate abstractions is an interesting scientific challenge in it's own right, and not somebody else's problem that will just go away with the inevitable progress of society. That view that 'building something' is fundamentally different from 'discovering' something. Anyway, great piece.

Submitted by Rachel on 8 February 2015 - 10:25pm


I agree with some of what you said but I take issue with some of it as well. A good scientist _does not_ get data and then try to make something out of it. That's what people have been doing with genomes and it's not good science. The data that are gathered should reflect the goal of the project. Also, most scientific coders use code to speed up organization and data analysis for large datasets so your PCR analogy isn't quite right - coders are creating the PCR machine equivalent for each project ie speeding up results without direct work. OTOH you're correct that PCR is a general tool, whereas scientific coders tend to write specific code for each project. This is where it's great to work with a software engineer to think about how to generalize the tool and make it useful to a large audience.

This is the computing point of view of "science vs. engineering", very well explained and illustrated. I don't think it is fundamentally different from other aspects of this entangled pair of activities. The distinction between discovering the principle of a laser and building an efficient laser is very much the same. The main difference I see is that computing technology has developed much faster than other technologies in the past, so the research community didn't have the time to restructure itself properly to adapt to the change. We could probably learn a lot from studying how science and engineering have interacted in the past. From a philosophical point of view, I think the best description of the relation of science and engineering is a yin-yang pattern, science being the yin process and engineering its yang complement. The main conclusion is that any attempt to divide the two into neatly separated activities is bound to fail. Translated to computing, this means we will have an eternal cycle of scientists doing exploratory computing, thus figuring out the concepts that permit engineers to write new software tools that are then picked up by scientists in the next round. The cyclic point of view also suggests that we should not have "scientists" and "software engineers", but also people with mixed competences and experience. Instead of two activities with a clear borderline, we have a continuum of processes, with the possibility of people specializing on each point of the circle.

Also interesting in this context: various essays and presentations by Richard Gabriel (, in particular "Science is Not Enough: On the Creation of Software". He adds art to science and engineering and explains why real-life problems often require a mixture of these three approaches.

Submitted by Kai on 9 February 2015 - 2:06pm


Hi Dasie, thank you for this piece, I can really relate to a lot of this. I agree with Carl that the differences in reusability are likely caused by different incentives here rather than knowledge. I've also got a background in more traditional software engineering (though I dodged the lambda-calculus bullet and instead have more experience on code actually used in production), but my scientific code looks different. The whole "get data, extract interesting insights, publish, move on" mindset that permeates science make it really hard to justify the time needed for a properly engineered piece of code. Basically the first time you get a chunk of data, you have no clue what boxes you might need, so you can't build them at that step. That's why it's called explorative analysis after all. A bit like prototyping and using tracer bullets in software engineering. Different to software engineering, though, once you have your prototype, you have the results needed for publication. After publication, where you now could go back and build the thing properly, what's your incentive to do so? Sure, if you wanted to repeat a similar analysis in the future. But very likely you now need to move on to the next project instead. A lot of scientific code I come across exactly looks like that, stuck in prototype stage. But how do you convince your funders that properly engineering your code now is time/money well spent? After all, it's not "science" anymore, so why bother? As a research software engineer, that really, really irks me, but as a scientist who wants to keep getting funded, well, that's just how the game is being played at the moment.

Submitted by James on 9 February 2015 - 10:22pm


What about big scientific models? Interactive programming environments are an appropriate tool for the job of exploring large datasets *when you do not know in advance exactly what you will find in the data* (which is the essence of "exploring" after all!). Typically these tools have a lot of good software engineering abstraction under the hood, of course, so you can order up any number of complicated analyses very quickly and easily (it's not as if scientists are re-inventing those wheels interactively with every new dataset). However, there is a whole other class of scientist-programmers in the world: those who work on massive scientific number-crunching codes (e.g. atmospheric models and the like). This activity was largely confined to the physical sciences in the past, but I imagine it is happening increasingly in biology too now. These programs are almost always created from the ground up by scientists, not software engineers. They are typically written in Fortran, C, or C++, for maximum numerical efficiency, and execute on massively parallel supercomputers. These scientists are definitely aware of the benefits of abstraction and proper (non-interactive!) programming, although many of them have limited knowledge of really modern software engineering skills that their codes could probably benefit from (e.g. Fortran only got object orientation in the mid-2000s?) - it's only the numerical core of big scientific models that need to be super-efficient, after all.