Scientific coding and software engineering: what's the difference?

By Daisie Huang, Software Engineer, Dryad Digital Repository.

What differentiates scientific coders from research software engineers? Scientists tend to be data-discoverers: they view data as a monolithic chunk to be examined and explore it at a fairly fine scale. Research Software Engineers (and software engineers in general) tend to figure out the goal first and then build a machine to do it well. In order for scientists to fully leverage the discoveries of their predecessors, software engineers are needed to automate and simplify the tasks that scientists already know how to do.

Scientists want to explore. Engineers want to build

I've been thinking a lot about the role of coding in science. As a software engineer turned scientist, my research is extremely computational in nature: I work with genomes, which are really just long character strings with biological properties. My work depends on software developed by myself and many, many other scientists. Scientists are, by and large, inquisitive and intelligent people who are fast learners and can quickly pick up new skills, so it seems natural that many would teach themselves programming. When I first started talking to scientist-coders, I thought that perhaps I could relate to them from a programming perspective, and maybe bring some experience in formal software design practices to teaching scientists about coding. I started working with Software Carpentry and organisations of computational scientists in my field (Phylotastic, Open Tree of Life,  Mesquite) and getting more involved in figuring out what motivates scientists to take time out of their research and learn to code.

Unexpectedly, I've found it difficult to talk about programming with scientists who code. I've noticed that scientists gravitate towards platforms and languages that most software engineers don't favour, and my rationale for why I don't use those platforms myself doesn't make a lot of sense to scientists. I think that the reasons for this difference between myself and other scientists lie in my background as a software engineer, which predates and informs my approach towards science and computational biology in general. I've come to some conclusions about what differentiates scientific coders from software engineers, and why the two camps tend not to see eye-to-eye, and can be almost antagonistic.

Different approaches

The scientific programming community loves interactive programming environments such as R, iPython, and, to a lesser extent, Ruby. Scientists, when given a large chunk of data, want to query, explore, and visualise the data, first and foremost. Programming environments like R are ideal for this sort of exploration: you can load in a chunk of data and query it to your heart's content, and the history of what you've done is recorded as you go, making it even easier to reproduce! Perfect. Software Carpentry understands that this is what scientists do, and teaches to those inclinations. The curriculum emphasises learning how to automate discovery and how to reproduce and retrieve the results you've discovered, through lessons on shell scripting, version control, databases, and basic programming skills in R and Python.

While this approach is fine for exploration, this is not at all the approach I use as a software engineer. It's probably to my detriment as a scientist, but I have a very hard time approaching programming in this highly interactive way. Because I learned from a curriculum that was based on The Structure and Interpretation of Computer Programs, I tend to visualise the blocks of code that I want to create first, as black boxes that will take inputs and give me outputs (abstraction), and consider the mechanisms within the boxes later. I use abstract data structures to regularise the input and output data so that I can package other data to put into the box and get the same sort of output.

Reconciliation

I think that these two modes are not easy to reconcile: the data-discovery model tends to view data as a monolithic chunk to be examined and explores it at a fairly fine scale. Scientists want to repeatedly tweak queries and analyses on a single, primary data set and get immediate feedback. As long as you can clearly explain what you did to get the result published in a scientific paper, you don't need to pursue the code any further. If this works for what you need, it's hard to understand why you'd need to spend a lot of time learning about computer programming theory outside of possibly making your code faster. The code is entangled with the specific data that drove the analysis, and there is little incentive to disentangle the two.

The software engineering model tends to view generalised functionality as paramount and central to the process: one figures out what one wants to achieve first, blocks out the boxes and widgets that needed to feed in and out of the machine, and only then builds the machine. A software engineer would see a scientific paper written by the data scientist and want to design a nice box around it that could then be used by others on other data sets. Not only that, but a software engineer would see that the scientific coder has clearly performed the same computational task over and over again within their code and would want to abstract a modular function for it, even after the code is published. A box that can only work on a single dataset isn't a very good box at all. To design a really elegant box that is easy for others to use is ultimately the goal of the software engineer as craftsperson.

The software engineer and the biologist

A quick analogy: imagine if biologists still had to do the polymerase chain reaction by hand instead of using tested, reliable machines to do it! Originally, scientists had to add polymerases to their reactions manually and then transfer the tubes from heat baths of different temperatures, over and over again. While this is a perfectly functional and reproducible way of performing PCR, it's not how we do it today. First came robots that would move the tubes between the different temperatures for us, and later came nicely engineered machines that use the Peltier effect to do it all in one closed block. Because of these further refinements to the PCR functional box, PCR is now a basic tool that we can use to ask further, deeper biological questions.

What does this mean for scientists who code? Do they all have to become software engineers in order to be real coders? I think not. I think that scientists should use computer programming as an exploratory tool to drive discovery in their fields, the way they use other methods and tools, but scientist-coders can benefit from learning about modularity, abstraction, and data structures, if they want to leverage their past discoveries into further data discoveries without having to reinvent the wheel every time. Software engineers can then take that code and make it reusable by a broader audience and enable more discoveries by other scientists. Even that first modularisation of PCR from a human moving the tubes by hand to a robot moving them was a huge innovation in making science more replicable by others, but without the initial scientific discovery of Taq polymerase and how to use it in the polymerase chain reaction, engineers would have no box to build.

As science becomes increasingly computational in nature, it will become more important that scientific code does not end its development cycle on publication of the paper. Data exploration drives primary scientific discovery, but in order for future scientists to fully leverage the work of their predecessors, robust, reproducible, and sustainable software is needed to automate the parts we already know how to do. The role of the research software engineer in taking the original code published in a scientific paper and forming it into a modular and reusable box is vital to drive science forward. Making scientific software more open, reproducible, and sustainable needs to be recognised as a separate endeavour from the original scientific publication. Software engineers play a critical role in the future of data-driven discovery, and recognising and compensating their labour is crucial to accelerating future scientific advancements.

Are you a research software engineer or a scientist-coder? What do you think? Do these models ring true to you, or did I miss something? Please comment below!

Thanks for the nice post. I

Thanks for the nice post. I have a similar question: What is the difference between a scientist conducting an experiment (Experimentalist) and a scientific hardware engineer? This analogy, where "Experimentalist" replaces "scientific coder" and "scientific hardware engineer" replaces "software engineer", makes clear that software engineering not only will play a central role but will be indispensable for the advance of science.

I'm a software engineer,

I'm a software engineer, coming from a family of car mechanicians. I'm pretty sure they're much smarter at fixing cars, or even making them to go faster, than any car factory director. However, when the latter has to organise the job of a whole plant, interact with other directors, unions, understand the market, know which technology is available to build cars and their pros/cons, well, that's the job of someone working on cars on a large scale. Building software on a large scale is the job os software engineers, crafting smart PERL scripts, which encode brilliant ideas, is more the job of data scientists who are proficient in programming. Please don't pretend that programmers are software engineers and don't put them in charge to build large scale software, which is going to affect thousand users around the world. Most importantly, don't think that this way you'll save money. I won't add 'and viceversa', cause I have never seen that happening :-)

I am not pretending that

I am not pretending that programmers are software engineers: I am trying to separate out the sorts of people who get called "programmers" into two ends of a spectrum: software engineers, who start out planning to scale projects up and out, and scientific coders, who make smart Perl/Python/R scripts to encode brilliant ideas. I'm saying that we need to distinguish between these two extremes so that we don't conflate (or as you say, pretend) that these are the same thing.

As a software engineer I view

As a software engineer I view the code written by scientists not really brilliant (generalizing here sorry). To put it short, most of the time its a huge mess. I wonder how they make it work, because the programm will surely break for any new data set, outlier or special cases. I'm saying that, because I had to fix flooding simulation software written by physicists. It was all written procedural in a handfull of classes. After a weeks of painfully analysing that pile I basically rewrote it from scratch, utilizing cuda. Simulation now could be done in minutes instead of an hours... However it was the last time I was doing this kind of work.

Daisie, Thanks for this; I

Daisie, Thanks for this; I think you've done a great job of concisely illustrating the differences here. The only thing I would push back on is that I believe this is primarily an incentives issue, not an educational one. It isn't that scientists don't _understand_ the value of modularity and abstraction here, but rather that they see it as the job of an engineer. As you say in the beginning, scientists want to explore, engineers want to build. I suspect most researchers have no conceptual problem appreciating the advantages in the evolution of PCR, but do not aspire to bring about those advantages themselves or celebrate those advances as scientific discoveries, any more than they do other feats of engineering from which they regularly benefit. The cynical scientist might reply: "sure, we invented PCR and the engineers came along and made it faster and cheaper, just as they did after we invented the computer, the laser and everything else. No need for science to change. The engineers just need to catch up". As you know, developing the right abstracts and modules isn't a simple mechanical process, but a creative and complex problem itself that is closely tied to the research. Thus, I think the challenge today is not convincing scientists that some abstraction could save them time and money -- they already know this. The challenge is persuading scientists that developing the appropriate abstractions is an interesting scientific challenge in it's own right, and not somebody else's problem that will just go away with the inevitable progress of society. That view that 'building something' is fundamentally different from 'discovering' something. Anyway, great piece.

Thanks, Carl! I totally

Thanks, Carl! I totally agree with you. A few points: it is not at all clear, in my experience, that we have figured out a clear, goal-oriented way to teach the importance of modularity and abstraction to scientific coders. Secondly, I think that one of the challenges we have in research software engineering is to convince scientists that it is in their interest, as users of the potential future software products, to help design it to make it work well for them. UI/UX is a problem for all software development, of course, but as long as scientists see the abstraction of their software as "someone else's problem," they won't be very pleased with the usability of the resulting software.

I agree with some of what you

I agree with some of what you said but I take issue with some of it as well. A good scientist _does not_ get data and then try to make something out of it. That's what people have been doing with genomes and it's not good science. The data that are gathered should reflect the goal of the project. Also, most scientific coders use code to speed up organization and data analysis for large datasets so your PCR analogy isn't quite right - coders are creating the PCR machine equivalent for each project ie speeding up results without direct work. OTOH you're correct that PCR is a general tool, whereas scientific coders tend to write specific code for each project. This is where it's great to work with a software engineer to think about how to generalize the tool and make it useful to a large audience.

Hi Rachel, Thanks for your

Hi Rachel, Thanks for your comment! I think we can all agree that scientists should consider what the goals of their projects are before they start, but let's face it, there are two problems here: 1) not everyone plans out their goals well, and 2) not all data collected can actually answer the questions it was meant to answer, so you've gotta do something with all that data. This has been true for nearly every project I've seen or been associated with. Maybe I'm just working with a lot of bad scientists, but I don't think so: I think querying a mass of data to find interesting results is the reality of a lot of science. I think that scientific coders are creating the equivalent of the mechanical robot that moved tubes between water baths for PCR. Software engineers might be able to make those nice Peltier-effect cyclers we use now.

This is the computing point

This is the computing point of view of "science vs. engineering", very well explained and illustrated. I don't think it is fundamentally different from other aspects of this entangled pair of activities. The distinction between discovering the principle of a laser and building an efficient laser is very much the same. The main difference I see is that computing technology has developed much faster than other technologies in the past, so the research community didn't have the time to restructure itself properly to adapt to the change. We could probably learn a lot from studying how science and engineering have interacted in the past. From a philosophical point of view, I think the best description of the relation of science and engineering is a yin-yang pattern, science being the yin process and engineering its yang complement. The main conclusion is that any attempt to divide the two into neatly separated activities is bound to fail. Translated to computing, this means we will have an eternal cycle of scientists doing exploratory computing, thus figuring out the concepts that permit engineers to write new software tools that are then picked up by scientists in the next round. The cyclic point of view also suggests that we should not have "scientists" and "software engineers", but also people with mixed competences and experience. Instead of two activities with a clear borderline, we have a continuum of processes, with the possibility of people specializing on each point of the circle.

I absolutely agree that we

I absolutely agree that we need people who can perform on all points on the continuum. I think the main point I'd like to drive home is that without recognizing the benefits of both ends of the continuum, we will continue to have a system where most science is only incentivized as far as exploratory code that ends in publication, with only small, monetizable (or fundable) bits being engineered into reusable software. Groups like the SSI are key to advocating for embedding software engineering thinking and skills deeper into the scientific process.

Also interesting in this

Also interesting in this context: various essays and presentations by Richard Gabriel (https://www.dreamsongs.com/Essays.html), in particular "Science is Not Enough: On the Creation of Software". He adds art to science and engineering and explains why real-life problems often require a mixture of these three approaches.

Hi Dasie, thank you for this

Hi Dasie, thank you for this piece, I can really relate to a lot of this. I agree with Carl that the differences in reusability are likely caused by different incentives here rather than knowledge. I've also got a background in more traditional software engineering (though I dodged the lambda-calculus bullet and instead have more experience on code actually used in production), but my scientific code looks different. The whole "get data, extract interesting insights, publish, move on" mindset that permeates science make it really hard to justify the time needed for a properly engineered piece of code. Basically the first time you get a chunk of data, you have no clue what boxes you might need, so you can't build them at that step. That's why it's called explorative analysis after all. A bit like prototyping and using tracer bullets in software engineering. Different to software engineering, though, once you have your prototype, you have the results needed for publication. After publication, where you now could go back and build the thing properly, what's your incentive to do so? Sure, if you wanted to repeat a similar analysis in the future. But very likely you now need to move on to the next project instead. A lot of scientific code I come across exactly looks like that, stuck in prototype stage. But how do you convince your funders that properly engineering your code now is time/money well spent? After all, it's not "science" anymore, so why bother? As a research software engineer, that really, really irks me, but as a scientist who wants to keep getting funded, well, that's just how the game is being played at the moment.

What about big scientific

What about big scientific models? Interactive programming environments are an appropriate tool for the job of exploring large datasets *when you do not know in advance exactly what you will find in the data* (which is the essence of "exploring" after all!). Typically these tools have a lot of good software engineering abstraction under the hood, of course, so you can order up any number of complicated analyses very quickly and easily (it's not as if scientists are re-inventing those wheels interactively with every new dataset). However, there is a whole other class of scientist-programmers in the world: those who work on massive scientific number-crunching codes (e.g. atmospheric models and the like). This activity was largely confined to the physical sciences in the past, but I imagine it is happening increasingly in biology too now. These programs are almost always created from the ground up by scientists, not software engineers. They are typically written in Fortran, C, or C++, for maximum numerical efficiency, and execute on massively parallel supercomputers. These scientists are definitely aware of the benefits of abstraction and proper (non-interactive!) programming, although many of them have limited knowledge of really modern software engineering skills that their codes could probably benefit from (e.g. Fortran only got object orientation in the mid-2000s?) - it's only the numerical core of big scientific models that need to be super-efficient, after all.

I really can relate to this

I really can relate to this post. I am primarily a scientist, but I did a double major in computer science and whenever I am in data exploration mode there is this little voice in my head that says - "Make it elegant, make it generalizable, abstract out the functions, use Clean Code". I usually tend to ignore it at first, but once I have a good idea of what I want to achieve with my analysis, I end up refactoring the "quick-and-dirty" code at least a little bit so that I (let alone others) can understand it in a few months' time. This way I find it easier to reuse chunks of the code later on for similar or related analyses. I don't go all the way designing fully reusable modules and libraries, but I kind of stop half-way.

Daisie, great post, but I'm

Daisie, great post, but I'm not sure I completely agree with what I understood you are saying. So, if the point is to create structure from the get go, I disagree. You can't optimize something before you know it's working. And since we are talking about computer scientists, perhaps Knuth's statement might apply here: "Premature optimization is the root of all evil (or at least most of it) in programming". To say it differently, an early hack is better when you're starting, even because there might not be anything in there. All that being said, I completely agree the vast majority of data scientists (myself included) simply doesn't know enough of architecture to start automating or "devoping" things when the time comes, and that's to our detriment. So, once we can detect a signal, then I completely agree that it would be fantastic to start getting some structure around it

I agree with both you and

I agree with both you and Daisie, so maybe I don't understand either. I'm an SE working with scientists; I come from the school of "make it work (correct), make it right (robust, understandable), make fast - in that order". For me, that model reconciles your concerns with Daisie's sentiment in that the activities and practices of exploration are related to, but distinct from, those of engineering. For me, the value is in realizing that exploration and engineering are easier to understand and manage as discrete activities (if not discrete skill-sets and phases of a project lifecycle).

Perhaps we are expecting too

Perhaps we are expecting too much of "computer science." As a _computational_ scientist I am perhaps less "pure" than true computer scientists, who struggle with intellectual problems which are truly the domain of pure mathematics (I hope that no reader will find this a pejorative description, or feel that I mischaracterize them or their discipline). I would suggest that the difference between them and me - less important than the similarities, but illuminating, nevertheless - is that their goal is to prove what is computable, while mine is to find out how to compute that which is known to be computable. As I sometimes put it when talking to kids I want to tempt with towards the intellectual world with challenging ideas, my job is to work out how to work things out (yes, the metadiscussions can also be fun when the kids have left). So I sort of cringe when I hear a teacher or a politician cranking out a prepared line that says we need more computer science. It's not that I disagree with the proposition, it's just that I think that at this time in history of all times we need to be able to find pragmatic solutions to our problems that don't burden future generations with debt, whether technical debt (not taking care of what needs taking care of for sustainability) or fiscal debt (and don't get me started about that). In other words, while I cherish the long-term value of intellectual curiosity, pure research and academic endeavor, at present the lunatics appear to have not only taken charge of the asylum but set the place on fire the more easily to be able to cook one last, fabulous, dinner. So while I am a scientist I am prepared, for the duration of our crisis, to work on important issues like how to (try and) ensure a more equitable division of the world's resources without regard to accidents of birth and ancestry. If I may summarize it in a way which probably lays bare my ultimate interest in the question, do you want philosopher robots or garbage-disposal robots?

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.