Learnings from a software sustainability health check

Posted by j.laird on 18 August 2021 - 10:00am

The Institute's Research Software Healthcheck allows researchers to apply for a lightweight review of their research software and how they develop it, particularly to look at identifying ways to improve software reproducibility and community engagement. Miquel Duran-Frigola, Chief Scientific Officer and Co-Founder of the Ersilia Open Source Initiative which develops a platform for collecting and sharing computational pharmacology AI models, had some good things to say about the healthcheck Steve Crouch and James Graham conducted for them back in May this year.

By Miquel Duran-Frigola, originally posted on Ersilia's blog. A scene from the 1931 movie version of Mary Shelley’s Frankenstein A scene from the 1931 movie Frankenstein

I am an organic chemist by training. I was 22 when I wrote my first line of code in Visual Basic as part of a University assignment. Then it was not until I was 25, bored in a studio in San Salvador, that I discovered something called Project Euler, a website proposing mathematical problems of increasing complexity to be solved with a programming language of choice. I think the first task was: “find the sum of all the multiples of 3 or 5 below 1000”. That one was easy, but difficulty rapidly escalated and I soon gave up. My language of choice was Python (a friend’s recommendation), run from the default IDE in Windows. I tried to install Linux on my computer, but I didn’t succeed, really. Almost two years later, in 2012, I enrolled for a Master’s in Bioinformatics for Health Sciences and I got some formal education on variables, functions, objects and so on. I didn’t fully understand objects, but at least I was able to install Linux now, and I collected the necessary jargon to be admitted as a PhD student in the Structural Bioinformatics and Network Biology lab, at the Institute for Research in Biomedicine in Barcelona.

I was never too good with computers, but the PhD went well. I don’t think it is necessary to be good with computers if you are a computational biologist working in a world-class research centre. We had access to splendid resources, including the MareNostrum supercomputer, a beautiful, glass-cased machine placed inside a church. If you ask me how much storage, CPU or RAM I used back in the day, I will not know what to answer. But the PhD went well, as I said, we wrote several computational papers and I met some outstanding people. We never published a single line of code along with those papers. That was not common practice five years ago. Only monolithic supplementary tables in Excel format and well-intentioned but vague explanations in the Methods section. Null maintenance of code and databases; it’s all about novelty, who funds and cares about maintenance? I wrote all of my PhD scripts with a rudimentary knowledge of the VIM editor and only moved to Jupyter Notebooks in my PostDoc. I never wrote docs, never built a package, nobody has ever been able to run my legacy code in the lab without an error. In sum, I am not a computer scientist and I will never be one. Not a data manager, not a software engineer — at best a chemist with some data science literacy.

This doesn’t mean that I don’t care deeply about quality when it comes to scientific software. I actually do, this is why we launched the Ersilia Open Source Initiative in the first place. It’s funny if you think about it: the overarching goal of Ersilia is to distribute (‘democratize’, we like to say) machine learning and data science tools to low- and middle-income countries, but none of us is a natural software developer or engineer. We are new to GitHub stars, forks, pull requests and these things. I won’t lie: as lead scientist at Ersilia, I often wonder whether we will make an impact in the end. Perhaps we’ve been naïve and overoptimistic. Perhaps it happens to every developer: code grows slowly and erratically, and at times it feels like there’s nobody waiting for it out there.

Early this year, we learned about a free software ‘health check’ evaluation offered by the Software Sustainability Institute (SSI). We thought it would be good to apply to get some feedback on the Ersilia Model Hub, our core repository of machine learning models. You need to fill in an online questionnaire. “Does your website and documentation provide a clear, high-level overview of your software?”, “Is your project/software name free from trademark violations?”, “Does your project have a ticketing system to manage bug reports and feature requests?”, etc. That was a useful checklist. We would have been happy with only that.

The service offered by the SSI is as follows. First, they set up an online meeting where you have the chance to explain your software and goals. I remember that meeting vividly. I was particularly insecure: I had been working on the Ersilia Model Hub for uncountable hours, I had put all of my passion and capacity into it, and now I would be in front of a team of professionals who were going to inspect my product in search of imperfections. That was a ‘health check’, I thought, the kind of health check a doctor would do on an unfinished Frankenstein’s creature.

That first meeting went well, as you may expect, mainly thanks to Steve Crouch (Software Architect and Research Software Group Lead) and James Graham (Research Software Engineer) who, at the other end of the screen, acted with extreme caution and empathy. In fact, I ended up talking quite emphatically about the Ersilia Model Hub. I claimed it would contain hundreds, thousands of machine learning models, first for drug discovery but then for all of biomedicine. Models would be up and running on the cloud, free, open-source and ready for wet-lab researchers, physicians and undergrad students. How far were we? Steve or James asked. Very far, I said. Six months and 20,000 lines of code after and nothing was working yet. We agreed on the following: focus on a command-line interface (CLI) and have a case example that works. One model, no matter how simple it was ー for instance, a tool that counts the number of atoms found in a given molecule.

That first meeting took place in March and, if you check the activity of our GitHub repository you’ll see a spike of commits in April. That is us rushing to get a CLI done and a minimal working example. A few weeks later, I got an email with an 8-page report by Steve and James in the attachment. From a technical standpoint, the report was excellent, ranging from installation recommendations to package versioning, from privacy and security issues to error tracing. Many takeaways, for example, apply continuous integration (CI) pipelines and, if you do so, make sure you use a configuration matrix with multiple operating systems. Or, do not introduce untested code on the main branch of the repository. It may all sound obvious, but it isn't when you are still unable to figure out Git logics and you have never written a README file before.

However, we quickly realised that the report was not about the technicalities. It was much more profound than this. I believe that, in almost every item, Steve and James were trying to tell me: think about your users, think about your community. Ask yourself who is going to care about the Ersilia Model Hub, wonder who is going to contribute to it eventually. What are the technical skills of your users? How many users do you expect? Do you want to communicate with them? How often? Do you need feedback? What are their specific obstacles? No easy answer to any of these questions. Some days I think we should aim at ten key users and some days I think we will have thousands. Some days I think we should target computer scientists and some days I think we’ll achieve a graphical user interface in a beautiful, responsive web app for anyone to use.

In their report, Steve and James put significant effort into explaining how to engage contributors. They suggested splitting them into two groups: model developers and platform developers. The first group would be, most probably, computational biologists and chemists who train machine learning models themselves and would use the Ersilia Model Hub as a means of dissemination. It would be great to have such contributors. If we ever do, we’ll be very close to our horizon. In the report, there were very specific ‘must’, ‘should’ and ‘could have’ guidelines for engaging this group of contributors. The second group would be software developers who will help us build a more professional platform. There is not a day that I don’t wonder how to attract software developers. What do they need? A technical challenge or a purposeful project?

I have to admit that I was surprised by the emphasis that Steve and James put on the contributors’ side of Ersilia. This didn’t seem very down-to-earth to me. But when we met again in a follow-up session, I perceived that they were actually positive about our technology. There was a lot to be done, true, but there was potential for building a community of users, contributors and developers. It would be necessary to first hand over the code to friends for testing, but spontaneous users would eventually show up. We would need a professional website, something more than a mere landing page, but there are good-enough templates out there and we should focus on content for the moment. We would need to write better docs, but a quick-start guide is sufficient for an alpha release. I believe that the follow-up meeting was encouraging as well as realistic because Steve and James truly understood and cared about our code. They were not there to judge it as I feared before the kick-off meeting. I gathered very clear answers, much-needed enthusiasm and a roadmap that for once seemed feasible. I am deeply thankful for that.

If you are a scientist with an academic background as I am, and if you struggle with developing professional software as I do, I encourage you to reach out to the SSI. This is a fantastic institute offering essential services and materials to the open-source community.