Improvements for man and machine in research software and data publishing

Posted by j.laird on 12 October 2022 - 10:00am

By Scott Edmunds, Chief Editor of GigaByte.

Frictionless Data improves not just machine readability of scientific articles, but also enables humans to directly interact with the data within the article itself. A new article just published in GigaByte Journal demonstrates a new workflow where frictionless data can help bring papers to life with interactive figures.

The need for information from research outputs to be more findable, accessible, interoperable, and reusable (FAIR) has spurred researchers, database managers, and publishers to continually look for new and better ways to make information machine-readable. Another equally important area is creating articles that readers can actively engage with, rather than passively taking in information from reading a published article. One tool that easily improves machine readability of data is a data standard called Frictionless Data, developed by the Open Knowledge Foundation. A Frictionless Data package being a JSON file enclosing a list of local or remote resources (the data) and the meta-information of the package and each resource (for example, author and licence). Published in GigaByte a new Technical Release paper demonstrates that not only does Frictionless Data drastically improve machine readability, it can also turn normally static figures within the article into dynamic entities that allow readers to directly interact with the data within the article. Demonstrating that the use of Frictionless Data can tackle two important activities: allowing both man and machine to use and directly engage with scientific outputs in a dynamic fashion.

Integration of Frictionless Data was carried out in a new article by a team of researchers from the University of Melbourne led by Professor Anthony Papenfuss, whose lab have been long time advocates of open and reproducible research. Making sure the data, source code, and every other sharable component of their research is openly available to the community. This makes their work especially amenable to utilising new tools on top of their articles to make the published work dynamic and actively usable. The new article presents two new open source tools, svaRetro and svaNUMT, for interpreting difficult to understand structural variation in genome analysis. These help annotate novel genomic events that are missed in most genome assembly pipelines: such as retrotransposition events and insertion of DNA fragments from the mitochondria to the nuclear DNA, which contribute to the complexity of genome sequences and the understanding of gene function and genome evolution.

The openness and availability of all of the research components behind these tools and analyses created a perfect opportunity to implement Frictionless Data to make the article far more machine readable. During the process of adding this to the article, Raniere Silva now at City University of Hong Kong and formerly Community Officer at the SSI, as part of a FAIR data internship at GigaScience Press, made the fortuitous discovery that Frictionless Data could also play a role in improving human interaction with the article. You can get a behind-the-scenes and more detailed look of his work this summer in a recent end-of-project-blog. The figures, for the first time, were regenerated in an interactive manner. In the example here, readers can not only view the summary information presented in the figure, they can hover over data points to see the exact numbers and information behind these, and also manipulate the figure itself to view specific components that are of interest.

Raniere says:

My biggest surprise was that the Frictionless Data Package specifications in conjunction with the popular Plotly tool has functions to convert a static visualisation into a dynamic one. This massively reduces the barrier for many researchers to produce dynamic data visualisation as they only need to add a line or two to their code. GigaByte made a huge leap by publishing the dynamic data visualisation and I hope it inspires other journals to publish dynamic data visualisation.

When asked what they found most useful from this process, the authors stated: “The interactive figures are a great addition to the paper. We found the interactive functions made reading labels easier, especially for label-rich figures, and liked that the figures were accessible in SVG format, allowing viewing and editing without losing information from the figures.”

To promote the use of Frictionless Data in more published articles, Raniere wrote a detailed handbook that includes an introduction to the use of Frictionless Data, an introduction to the specifications, short working examples for creating an author’s own data package, and long examples, based on published articles in GigaScience Press journals, illustrating the creation and use of Frictionless Data. The goal is for the handbook to serve as the start of a conversation within the scientific community of how to embrace Frictionless Data. This handbook also provides a resource and guidance to make things easier and for data producers to submit articles with these packages to data publishers, such as GigaScience Press.

Of added interest, in addition to the inclusion of Frictionless Data, paper is that for the first time as the figures were regenerated in an interactive manner this process combined a CODECHECK certificate of reproducible computation. You can read more on our experiences of using CODECHECK in software review and its part in confirming reproducibility of the (in)famous Imperial COVID-19 model results, but this is the first time the CODECHECKing process has been worked into the process of generating interactive figures for publication.

The use of Frictionless Data and all the downstream elements it enables, serves as transformative steps in scientific publishing, as they improve machine readability and reproducibility, and turn scientific articles from their old-fashioned static format into a 21st century living document. These types of novel, data-literate additions to the publication process are part of the reason GigaByte was the winner of the 2022 ALSPS Innovation in Publishing Award presented last month.

Zenodo Codecheck certificate

Further Reading:

Dong R, Cameron D, Bedo J, Papenfuss AT. (2022). svaRetro and svaNUMT: modular packages for annotating retrotransposed transcripts and nuclear integration of mitochondrial DNA in genome sequencing data. GigaByte. 2022. https://doi.org/10.46471/gigabyte.70

Raniere Gaia Costa da Silva. (2022). Frictionless Data Handbook for Researchers. http://dx.doi.org/10.5524/102316

Raniere Gaia Costa da Silva. (2022). CODECHECK Certificate 2022-018. Zenodo. https://doi.org/10.5281/zenodo.7084333