CW20 - Mini-workshops and demo sessions

Mini-workshops and demos sessions will give an in depth look at a particular tool or approach and a chance to query developers and experts about how this might apply to participants’ areas of work.

Here is the list of mini-workshops and demo sessions that will take place at CW20 (they are subject to minor changes up until the event).

Participants are encouraged to ask questions and enquire about how they could use the approach, tools, etc - these sessions have a focus on being interactive.

Back to the CW20 agenda

Session 1

On Tuesday, 31 March, 16:00 - 16:50

1.1 Embedding open research and reproducibility in the UG and PGT curricula

Facilitators: Andrew Stewart, University of Manchester; Phil McAleer, University of Glasgow and Pablo Bernabeu, Lancaster University

Room: 2

Arguably, one of the most effective ways in which to enable researchers to adopt open and reproducible research practices is to introduce the underlying computational skills needed to create reproducible workflows early on in the researchers' training. Traditionally, undergraduate degree programmes in science and social sciences provide students with key skills in research methods and statistical modelling. By teaching these skills via open and reproducible statistical programming languages such as R, students develop the ability to engage in reproducible research practices alongside acquiring the underlying conceptual knowledge in statistics, coding, and research design. These skills are also valuable in terms of future employability. The University of Glasgow has led the way in terms of the large-scale introduction of R and Reproducibility to undergraduate Psychology students from the very first semester of their first year. Other Universities are now following suit.

In this workshop we will talk about our personal experiences of introducing reproducible research practices in UG and PGT curricula including the challenges that can occur at an institutional level that require risks to be taken, the challenges around colleagues needing to learn a new set of skills in order to supervise reproducible dissertations, pedagogical issues around introducing coding, and practical challenges around teaching groups of students using software that they may need to install locally.

1.2 Research Data Discovery Workshop

Facilitator: Phil Hesketh, Consent Kit

Room: 2

The Research Data Discovery Workshop is for researchers and their teams.

It’s designed to help us understand what data we produce in our research, and how we manage and take care of it; from capture to deletion.

Reflecting on the practicalities of how data flows through our processes can help us understand and prevent risks of information getting into the wrong hands and our participants being identified.

We're interested in adopting more human and open practices in crossing the gap between compliance and practice. The workshop itself is open sourced and intended for researchers to take back and run within their own teams and organisations.

With the outcomes from the session, you’ll be able to:

Map out the flow of data, from capture to deletion for a particular research method
Understand any risks in your process that might inform a Data Protection Impact Assessment (DPIA) or data policies
Think about how we might improve existing processes from a practical starting point
Write more accurate informed consent forms

1.3 Make Your Tools, Scripts and Analyses Open and more FAIR

Facilitators: Mateusz Kuzak and Jurriaan Spaaks, the Netherlands eScience Center

Room: 3

Open science has rapidly gained interest and importance across the academic world and society. It will require researchers to be open and transparent in sharing their methods, analyses and raw and published data, so these can be reused, verified or reproduced by a wider audience. In the humanities and social sciences, research software (including code, scripts, tools, algorithms) often is an integral part of the methodological process, so there is a need for guidelines on making these open as well. The Netherlands eScience Center and DANS launched a website (fair-software.nl) with five practical recommendations that help researchers to make their software more FAIR (Findable, Accessible, Interoperable, Reusable). The site serves as a signpost for researchers to get actionable advice on how to get this adventure started. The general idea of the session is that workshop participants bring their code, and the session organizers will help participants to improve the openness and FAIRness of their software or scripts using the new website as guidance.

1.4 Hypergraph: open "as-you-go" research communication

Facilitator: Chris Hartgerink, Liberate Science GmbH

Room: 4

Hypergraph is the first open-source software implementation of an "as-you-go" iterative scholarly communication system based on the peer-to-peer knowledge commons (aka ?-to-? commons). In this session, I provide an introduction to how "as-you-go" (instead of "after-the-fact" publications) may help to start addressing scholarly research's core issues with respect to access, archival, provenance, incentives, and quality by embedding chronology and new peer-to-peer technologies in the communication design. This aims to also help make research management more effective and facilitate the reproducibility of outputs. I will explain how the ?-to-? commons and Hypergraph relate, what potential they provide for future developments, and I will provide a live demonstration of Hypergraph for the first time, showing that this parallel form of scholarly communication is available today without restrictions on current operations; researchers are invited to give feedback for tooling they would like to see to make their everyday research life easier.

Session 2

On Tuesday, 31 March, 17:00 - 17:50

2.1 The OpenAIRE Research Graph

Facilitator: Andrea Mannocci, ISTI-CNR

Room: 1

In the last decade, we experienced an urgent need for a flexible, context sensitive, fine-grained, and machine actionable representation of scholarly knowledge and corresponding infrastructures for knowledge curation, publishing and processing. Such technical infrastructures are becoming increasingly popular in representing every facet of scholarly knowledge as structured, interlinked, and semantically rich Scholarly Knowledge Graphs (SKGs).

In this sense, the OpenAIRE project has extensively worked on the realisation of the OpenAIRE Research Graph, which is a massive Open Access (CC-BY) collection of interlinked metadata connecting research entities such as articles, datasets, software, and other research products and other contextual entities like organisations, funders, funding streams, projects, research communities, and data sources.

As of today, the OpenAIRE Research Graph aggregates around 350Mi interlinked metadata records collected from 10,000 scholarly data sources “trusted” by scientists.

After a process of cleaning and deduplication of equivalent records the graph counts ~110Mi publications, ~10Mi datasets, ~180K software research products, 7Mi other products, with 480Mi (bi-directional) links between them. Such products are in turn linked to 7 research communities, organisations, and projects from about 29 funders worldwide.

Conceived as a public and transparent good, populated out of data sources trusted by scientists, the OpenAIRE Research Graph aims at bringing discovery, monitoring, and assessment or science back in the hands of the scientific community free of charge.

In this session, we aim at presenting the OpenAIRE Research Graph, and describe its information modelling, as well as its full-gamut potential by showing how it fuels key services in the OpenAIRE ecosystem.

2.2 Towards continuous research: Leveraging Continuous integration and delivery through GitHub actions

Facilitator: Tania Allard, Microsoft

Room: 2

Reproducible computational research is still hard to achieve. Even more so it seems that for us to be able to achieve this gold standard, we have to master hundreds of techniques, use 100's of tools and platforms and keep on top of all the novel developments in the area.

Though this is a rather complex problem to solve, we can leverage software engineering practices like continuous integration to keep track of our experiments and outputs in a single platform - i.e. GitHub through GitHub actions. Why? GitHub is a platform commonly used by researchers and developers alike to share and find software.

This workshop will be an introductory session on setting continuous research workflows through GitHub actions, from experimentation to publication-ready stages.

The attendees will need to bring their own laptop, have a GitHub account and some experience with coding and using the terminal.

2.3 Towards software REF submissions with Code4REF

Facilitators: Olexandr Konovalov, University of St Andrews; Diego Alonso Alvarez, Imperial Imperial College London; Louise Brown, University of Nottingham; Patricia Herterich, University of Edinburgh and Patrick McCann, University of St Andrews

Room: 3

We will present the Code4REF project (https://code4ref.github.io/) which aims at providing guidelines on recording research software in CRIS (Current Research Information Systems). Many universities use systems like Pure, Simplectic, RIS, to name a few, to record research publications, e.g. to display them on their webpages and, in the UK, for preparing submissions for REF (Research Excellence Framework) and reporting research outputs to funding bodies, while software outputs are much less common there. We believe that scientific code needs to be treated as a primary research output, and should be equally well covered by CRIS.

We called the project Code4REF, because recording research software it in the University CRIS system is a first step towards submitting it in REF. However, doing this will not be only serving the REF purpose. It will allow to get an overview of all research software developed at an institution, in the research group or by an individual developer using CRIS or their public views. This will provide further evidence that software is vital for research, and will contribute to the campaign for the recognition of the RSE role within academia.

In this workshop, we will outline the current state of the project, explain how you can contribute by providing guidance to further CRIS and promoting Code4REF in your institutions, and outline our vision of further grassroots campaign which starts from Code4REF. We will also demonstrate who one could use Python API to extract and analyse software records from Pure, and suggest possible activities for the CW20 hackday projects.

2.4 Open Life Science: Empowering communities with open*principles

Facilitators: Malvika Sharan, The Alan Turing Institute and Yo Yehudi, University of Manchester

Room: 4

Many journals require that scientific/research outputs, such as data, protocols and code are openly available in order to be reviewed and reused by other researchers. Sharing source code and research papers alone isn’t usually enough to draw in new users and contributors who can collaboratively advance the field. This session will teach researchers and coders the basic principles to make their projects (i.e, scientific code repositories) not only open, but inclusive and welcoming to contributors, allowing scientific and research software to be more sustainable, reproducible, and accessible both to users and to other software developers. Experienced community managers are also welcome to attend and help pass their knowledge on to others.

The session will be run by the Open Life Science team (https://openlifesci.org/), who collectively have experience working openly, mentoring, and training others in open (science) practice. Rather than lecture-style presentations, there will also be hands-on discussion groups, allowing participants to build on and share existing knowledge.

Session 3

On Wednesday, 1 April, 13:45 - 14:35

3.1 Enrich a research paper with code and data

Facilitator: Emmy Tsang, eLife

Room: 1

In February 2018, eLife published its first executable article (elifesci.org/reprodoc). Executable articles enrich the traditional narrative of a research article with code, data and interactive figures that can be executed in the browser, downloaded and explored, giving readers a direct insight into the methods, algorithms and key data behind the published research, and an opportunity to remix and build upon it.

Over the past year, we’ve been working closely with our collaborators Stencila and Substance to build an open tool stack that would enable us to publish these computationally executable articles at scale. In this session, we will demo how authors can enrich their traditional eLife paper using Stencila Hub, through:

Starting a Stencila Hub project linked to their eLife paper
Converting the article to a computational notebook format
Uploading data required to reproduce tables and figures in the article
Replacing static tables and figures with code chunks that produce them

We will share our current vision of how executable articles will be integrated into our production workflow and collect feedback. We also hope to engage participants in exploring potential functionalities for the tool stack and building a community-driven roadmap.

3.2 Productive research on sensitive data using cloud-based secure research environments

Facilitators: Martin O'Reilly and James Robinson, The Alan Turing Institute

Room: 2

Many of the important questions we want to answer for society require the use of sensitive data. In order to effectively answer these questions, the right balance must be struck between ensuring the security of the data and enabling effective research using the data.

Imposing insufficient security measures results in unacceptable risks of data exposure. However, excessively stringent security measures will often result in significant detrimental impact on the effective use of the data.

Agreeing the right balance between security and productivity is a challenging and nuanced process. Doing this in an ad-hoc way for each research project and dataset is time consuming and requires a shared understanding between data provider and researcher of both the data and the available security measures.

Additionally, most existing secure research environments (Safe Havens) are configured as organisational or regional level facilities. These are complex, risky and expensive to set up and run and are often configured with a fixed set of security measures, meaning that all projects are subject to the most restrictive security measures required for the most sensitive data the Safe Haven supports.

In consultation with the community, we have been developing recommended policies and controls for performing productive research on sensitive data, as well as a cloud-based reference implementation in order to adress some of the above challenges. We have developed:

A shared model for classifying data sets and projects into common sensitivity tiers, with recommended security measures for each tier and a web-based tool to support this process.
A cloud-based Safe Haven implementation using software defined infrastructure to support the reliable, efficient and safe deployment of project specific secure research environments tailored to the agreed sensitivity tier for the project.
A productive environment for curiosity-driven research, including access to a wide range of data science software packages and community provided code.

In the first half of this session we will present an overview of our recommended policies and controls and demonstrate the systems we have developed to support the effective implementation of these in practice. The second half of the session will be an open discussion on how these fit with the needs of the participants and how we can work together as a community to develop these further to cover to best meet our collective needs.

3.3 Sustainability of Scientific Software: Experience from Several Projects

Facilitators: Vahid Garousi and David Cutting, Queen’s University Belfast

Room: 3

The mini-workshop will be run by two speakers. They will review the importance of sustainability of scientific software, and will provide experience from several projects in which they have used Software Engineering best practices to develop high-quality scientific software. Their goal is to network with the scientists attending the event with the goal of possible joint collaborations.

David Cutting will explore just how easy it is to write terrible unsustainable code for all the right reasons, using real life examples from research software and open source development. He will highlight his own learning and how small changes to working practice and learning from industry can be effective in aiding sustainability of scientific and research software.

Vahid Garousi will review the state-of-the-art in engineering scientific software and will present his experience from two projects: (1) Engineering a scientific software in collaboration of mechanical engineers for optimization of oil pipelines; and (2) Collaboration with a team of reservoir (chemical) engineers to develop a large Fortran software (~120 KLOC) to simulate oil reservoirs (involving lots of modeling and simulation).

3.4 Our Raging Planet

Facilitators: Phil Weir and Martin Naughton, Flax & Teal Ltd

Room: 4

How can volcanology reimagine a volcano erupting from Cave Hill, or fluid dynamics illustrate a tsunami hitting Belfast Harbour, or forestry science a forest fire in Ormeau Park. Through open research and open source, we have streamlined the journey from complex research algorithms to user-friendly testing. OurRagingPlanet is a community-driven, open-source platform that uses available open data to simulate and contextualise natural disasters in Belfast and around the world.

Join us for a walk-through of the platform, how it was built, funded and sustained. Help us, by engaging with the team on how best to improve the experience for all - especially, researchers, open-source developers and educators.

We will use the opportunity to show how to work with open source, open data and open access.

Session 4

On Wednesday, 1 April, 15:05 - 15:55

4.1 Using GraphQL to connect software with authors, publications and other scholarly resources

Facilitators: Martin Fenner, DataCite; Neil Chue Hong, Software Sustainability Institute and Frances Madded, British Library

Room: 1

The EC-funded FREYA project is using persistent identifiers (PIDs), their metadata, and GraphQL to build the PID Graph, a graph of connected scholarly resources that can be queried, visualized and re-used in other ways. For software we use the more than 100,000 software packages described with DataCite DOIs and metadata. For easy access to this PID Graph we have started to write and share Jupyter notebooks that can be reused and extended.

In this demo session we will give an introduction to the GraphQL query language, will explain how GraphQL offers many advantages over REST APIs, and will demonstrate how it can be used for exploring connections between scholarly resources, including software. We will use the GraphQL API from DataCite available at https://api.datacite.org/graphql, and Jupyter notebooks running on MyBinder. To follow along with the examples, a computer and web browser, but no local installation of software are needed. We will demo a notebook that starts with the ORCID ID of a researcher and software author, will then look at software authored by this person, and will then find other connected resources, including other versions of the same software as well as associated publications and datasets.

4.2 Professionalising support for Open Research

Facilitator: Patricia Herterich, Digital Curation Centre, University of Edinburgh

Room: 1

Open Research encourages making research more transparent and outputs from every stage of the research workflow - including data, software and publications - freely available and accessible to all. To achieve this shift in research culture, additional skills are needed that are not covered by traditional professional development offers for researchers. To support researchers in these new practices, a range of professional roles such as research software engineers and data stewards have been developed over the last few years that provide training, support and guidance to address new funder requirements and facilitate cultural change.

Those support roles however often do not have formal career paths and rarely get rewarded for their contributions. Thus, professional associations and support networks are being built to campaign for recognition and share knowledge and experiences. While sharing the same goals and facing similar challenges, there has not been much exchange between those professional communities to join forces and exploit synergies.

This workshop will bring data stewards and research software engineers together and provide an insight into case studies of initiatives to professionalise roles supporting open research from a range of countries. Group discussions will address challenges that those professions still face to get recognised and explore opportunities for collaboration on policy changes, professional development and hands-on cooperation in institutions.

4.3 Generating synthetic datasets using the QUiPP pipeline

Facilitators: Louise Bowler, Oliver Strickson, Camila Rangel Smith, Greg Mingas, Kasra Hosseini and Martin O'Reilly, The Alan Turing Institute

Room: 3

The proliferation of individual-level data sets has opened up new research opportunities; however, this individual information is tightly restricted, for example, in health and census records. This creates difficulties in working openly and reproducibly, since full analyses cannot then be shared. Methods exist for creating synthetic populations that are representative of the existing relationships and attributes in the original data. However, understanding the utility of the synthetic data and simultaneously protecting individuals' privacy, such that these data can be released more openly, is challenging.

The QUiPP (Quantifying Utility and Preserving Privacy in synthetic datasets) project aims to produce a framework to facilitate the creation of synthetic population data where the privacy of individuals is quantified. In addition, QUiPP can assess utility in a variety of contexts. Does a model trained on the synthetic data generalize as well to the population as the same model trained on the confidential sample, for example?

In this mini-workshop, we will introduce synthetic record generation. We will illustrate the problems of assessing the privacy impact to individuals contained in the original dataset, and of measuring the utility of the synthetic data. We will demonstrate QUiPP as applied to several example datasets, and show how it can be used to assess the privacy impact and utility of synthetic data generation based on these examples. The session will conclude with an open discussion around participants' confidential data and data-sharing needs.

4.4 Exploring Communities of Practice

Facilitator: Aleksandra Nenadic, Software Sustainability Institute

Room: 4

Building open, diverse, inclusive and functional communities is the key to the themes of CW20. In this breakout session, we are first going to briefly talk about what Communities of Practice (CoPs) are (and different terminology used around them) and what constitutes a CoP. This will be followed with some examples of different CoPs from the research software space and personal experiences from the UK’s Software Sustainability Institute and North-West University South Africa. The rest of the session will engage the audience in exploring their experiences of starting, managing and sustaining CoPs. We will discuss questions raised to explore Dos and Don’ts, challenges, what they tried, what worked, what didn’t and why. This will help the group discuss and learn how to establish and maintain communities in different areas and save on start up time by following best practices and considering various aspects of building a community that we may otherwise miss. We are planning to collate all the discussions in some published form post conference in order to share with a wider audience - either in the form of a blog post, a mini guide or simple rules that can also be published as an article.

Back to the CW20 agenda