Managing computational notebooks - an overview of ‘chopportunities’
By Software Sustainability Institute Fellow Patricia Herterich
This blog post is also posted on the Digital Curation Centre (DCC) blog.
There are over five million jupyter notebooks on GitHub and they are increasingly used in teaching due to the combination of code, results and documentation which makes them a good resource to interact with to learn. However, they are hardly covered in conversations about teaching materials or seen as a standalone research output.
This is a reason to bring the conversation around computational notebooks to the Research Data Alliance (RDA), a community-driven initiative bringing together a variety of stakeholders to discuss issues around research data, but also related outputs such as research software. The RDA hosts several working and interest groups and their activities as well as two plenary meetings each year. With a mix of infrastructure providers, researchers, librarians and others attending the RDA Plenary, we suggested a birds of a feather discussion on four aspects of computational notebooks, taking advantage of the range of expertise present at the latest plenary meeting in Helsinki, Finland in October. For the discussions at the RDA Plenary, we identified four major challenges for the management of computational notebooks that were introduced by lightning talks and then discussed in detail in breakout groups.
Notebooks are most commonly shared to get credit for writing the code shared through the notebook and to allow others to find and re-use the work (addressing some of the FAIR principles). This raises challenges around the citation of notebooks. Furthermore, computational notebooks can be published as a form of documentation to provide a more detailed explanation of code or workflow and to increase reproducibility. This is especially applicable to notebooks used in teaching.
Preservation - do we keep them and if so, how?
I led the breakout group discussing challenges around the long-term preservation of notebooks. We started by trying to define “long term” as the preservation of interactive notebooks currently happens with a time horizon of about two years in mind, whereas librarians and archivists usually think of a time span of ten years when talking about preservation. Furthermore, we discussed at which level notebooks can be preserved as they can be saved as text files and the outputs created by running parts of code in the notebook can be archived as separate files, losing some of the context and interactivity of the notebook. Containers such as docker can be used to preserve notebooks in an interactive way, but long-term availability of these services is not clarified.
Not all notebooks consist of code, outputs and documentation, some might just be storing scripts that could easily be preserved as a text file. However, we currently do not have the terminology to classify notebooks and decide any preservation actions and trade-offs based on that.
Notebooks and big data
Another area of interest is scaling computational notebooks to use them with large datasets and major e-infrastructures. This requires setting up dedicated services and infrastructure that can handle the processing of large datasets and having notebooks interact with them. The EGI, a computing infrastructure provider, has created a use case that provides jupyter notebooks as a cloud computing resource that can access data stored in its DataHub. Outstanding challenges are to provide options for batch computing and setting up a service that allows us to scale the running of several notebooks in parallel. In addition to providing the infrastructure, training will be needed to ensure the service is used in the most effective way.
‘Chopportunities’ for managing computational notebooks
A fourth topic was discussed that cuts across publishing notebooks, notebooks and long-term preservation, and using computational notebooks with big data and cloud resources: notebooks as digital objects that align with the FAIR (Findable - Accessible - Interoperable - Re-usable) principles. This is clearly a popular topic as you can also see from the discussions on FAIR at the 2019 eScience Symposium and recent publications on FAIR software. The lightning talk provided an overview of the FAIR principles that can be applied to computational notebooks as well as raising issues where more clarification might be needed. As with the other breakout discussions, the group did not identify solutions but a list of ‘chopportunities’ - challenges that are also opportunities for the community to develop new services or initiatives.
The session identified a variety of steps that could be taken by the RDA community. Members of the group submitted a proposal to follow up with a session at the upcoming RDA Plenary in Melbourne and will continue discussions at other suitable events. If you have ideas and want to get in touch, you can do so on the group’s GitHub repository where you also find a write up.
Notes from all four breakout discussions can be found on Google Drive.
Tweets from the session can be found via the hashtag #RDACompNotebooks
My lightning talk slides on notebooks and long-term preservation are available on Zenodo.
Thanks to Hugh Shanahan for organising the session and my co-presenters Martin Fenner, Christine Kirkpatrick, and Gergely Sipos.