Software and research: the Institute's Blog

Data Carpentry goes to Netherlands

By Aleksandra Pawlik, Training Lead.

Last week the Institute helped to run Data Carpentry hackathon and workshop at the University of Utrecht in Netherlands. Both events were a part of ELIXIR Pilot Project aiming to develop Data and Software Carpentry training across the ELIXIR Nodes. The project is coordinated by ELIXIR UK and a number of other Nodes are partnering up, including ELIXIR Netherlands, ELIXIR Finland and ELIXIR Switzerland.

The hackathon consisted of two days during which the participants, representing ten ELIXIR Nodes, worked on Data Carpentry training materials. Day one started with the introduction to Data and Software Carpentry teaching model. This was then followed by a review and discussion on the existing materials. The participants made suggestions about the possible improvements for the existing materials and new topics to be developed. The overall theme of the hackathon was genomics and hence the participants could base their work on the existing contents for teaching genomics in Data Carpentry. Eventually three groups were formed:

  • Group 1 which worked on creating training materials on using ELIXIR Cloud resources.
  • Group 2 which worked on a decision tree for using cloud computing.
  • Group 3 which worked on different aspects of understanding how to use one's data for genomics. In particular the group worked on describing the file formats, file manipulation, pipelines integration, post-assembly - de novo RNA Transcriptome Analysis, handling blast annotation output and verifying data.

Revealing the magic of field theory algebra

By Paul Graham, EPCC and Software Sustainability Institute.

We have a new project working with Dr Kasper Peeters of Durham University and his software, Cadabra: a computer algebra system which can perform symbolic algebraic computations in classical and quantum field theory. In contrast to other software packages, Cadabra is written with this specific application area in mind, and addresses points where the more general purpose systems are unsuitable or require excessive amounts of additional programming to solve the problems at hand.

Cadabra has extensive functionality for tensor computer algebra, tensor polynomial simplification including multi-term symmetries, fermions and anti-commuting variables, Clifford algebras and Fierz transformations, implicit coordinate dependence, multiple index types and many more. The input format is a subset of TeX, and it supports both a command-line and a graphical interface.

The ultimate brain-dump: unifying Neuroscience software development

By Dr Robyn Grant, Lecturer in Comparative Physiology and Behaviour at Manchester Metropolitan University.

In the neurosciences, we produce tons of data, in many different forms – images, electrophysiology and video, to name but a few. If we are really going to work together to answer big picture questions about the brain, then the different data types really need to start interacting and building on information from each other. I understand this, however, it is quite complex in practice, and begs questions in how best to specify data types, annotations and formats to make sure researchers can develop the software and hardware to interface efficiently.

One first step towards a unifying concept for data, software and modelling is the Human Brain Project (HBP). This is a European funding initiative that has received a lot of criticism globally for its big picture thinking and focus on human brain experiments. At the British Neuroscience Association meeting 2015 this Spring, I attended a session on the HBP, interested to see what they might say.

The Light Source Fantastic: a bright future for DAWN

By Steve Crouch, Research Software Group Leader, talking with Matt Gerring, Senior Software Developer at Diamond Light Source and Mark Basham, Software Sustainability Institute Fellow and Senior Software Scientist at Diamond Light Source.

This article is part of our series: Breaking Software Barriers, in which we investigate how our Research Software Group has helped projects improve their research software. If you would like help with your software, let us know.

Building a vibrant user and developer community around research software is often a challenge. But managing a large, successful community collaboration that is looking to grow presents its own challenges. The DAWN software supports a community of scientists who analyse and visualise experimental data from the Diamond Light Source. An assessment by the Institute has helped the team to not only attract new users and developers, but also increase DAWN’s standing within the Eclipse community.

The Diamond Light Source is the UK’s national synchrotron facility based at the Harwell Campus in Oxfordshire. By speeding up electrons to near light speed, they give off light that is 10 billion times brighter than the sun. Over 3000 scientists have used this light to study all kinds of matter, including new medicines and disease treatments, structural stresses in aircraft components, and fragments of ancient paintings, to name but a few.

How to avoid having to retract your genomics analysis

By Yannick Wurm, Lecturer in Bioinformatics, Queen Mary University of London.

Biology is a data science

The dramatic plunge in DNA sequencing costs means that a single MSc or PhD student can now generate data that would have cost $15,000,000 only ten years ago. We are thus leaping from lab-notebook-scale science to research that requires extensive programming, statistics and high-performance computing.

This is exciting and empowering - in particular for small teams working on emerging model organisms that lacked genomic resources. But with great power come great responsibility... and risks that things could go wrong.

These risks are far greater for genome biologists than say physicists or astronomers who have strong traditions of working with large datasets. This is because biologist Researchers generally learn data handling skills ad hoc and have little opportunity to gain knowledge of best practices. Biologist Principal Investigators - having never themselves handled huge datasets - have difficulty in critically evaluating the data and approaches. New data are often messy with no standard analysis approach, even so-called standard analysis methodologies generally remain young or approximative. Analyses that are intended to pull out interesting patterns (e.g. genome scans for positive selection, GO/gene set enrichment analyses) will enrich for mistakes or biases in the data. Data generation protocols are immature & include hidden biases leading to confounding factors (when things you are comparing differ not only according to the trait of interest but also in how they were prepared) or pseudoreplication (when one independent measurement is considered as multiple measurements).

Scanning for peace of mind

By s [dot] choppin [at] shu [dot] ac [dot] uk (Simon Choppin), Software Sustainability Institute Fellow and Research Fellow, Sheffield Hallam University.

Developments in technology and software are often accompanied by cohorts of enthusiastic developers and engineers who are convinced by its revolutionary potential. However, new technology often requires careful nurturing before it truly changes the way we do things – a number of things have to be working in harmony before the revolution can take place. If you push too hard, and in the wrong direction, things can veer off course and never recover. I wrestled with the nuances of technology and its potential at a seminar for Breast Surface scanning at Ulster University, Belfast.

Inspiring confidence in computational drug discovery - automated testing for ProtoMS simulations

By Devasena Inupakutika, Steve Crouch and Richard Bradshaw

Computer simulations are a good way to study molecular interactions. There have been striking advances in the application of computer simulation to innovative complex systems that have shed light on phenomena across the breadth of modern chemistry, biology, physics and drug design. ProtoMS (short for Prototype Molecular Simulation) is one such major piece of Monte Carlo biomolecular simulation software. The Software Sustainability Institute is working with ProtoMS developers to review and evaluate the software and its code, assess the usability, ease of installation and the long-term sustainability of ProtoMS by collating areas for improvement.

Simulating biomolecules is a particularly challenging problem and requires the use of specific computational techniques to perform experiments or study processes that cannot be investigated by any other methodology. The Essex Research Group of the School of Chemistry at the University of Southampton research innovative applications of simulations to biological systems and focus on the development of new methods and software for biomolecular simulation.

ProtoMS is used to develop new methods to enhance the accuracy and precision of biomolecular simulations, to apply these ideas in the calculation of relative protein/ligand binding free energies and to add new functionalities to computational chemistry programs. The software package consists of Python interface on top of Fortran subroutines. Both GNU and Intel C/C++ and Fortran compilers are supported. Richard Bradshaw and his colleagues from Essex Research Group have been using and improving ProtoMS to develop new science with the aim to replace the time consuming and expensive wet-lab screening of putative drugs with fast and inexpensive computational tools.

Open-access publishing: trials of the transition

By Dr Robyn A Grant, Lecturer in Comparative Physiology and Behaviour at Manchester Metropolitan University.

I have mixed opinions about open-access publishing. Finding the money to cover open-access publishing is not easy, especially for early career researchers during this transitionary period as open access becomes the norm. Despite the costs, I really believe in open-access publishing. We want our science to be read, surely! Especially in this interdisciplinary era, it is important for non-academic stakeholders (such as patients, consultants, managers, developers, etc.) to have access to our outputs. And, of course, as academics, we are publicly funded, so outputs should be published for all to see.

We do not receive much money to cover the costs of open-access publishing. In fact, my university receives only enough to fund around two open-access publications each year. Don’t worry, I hear you cry, in this open-access era the costs of the library subscriptions to journals will cover your publication costs. However, in this transition period of subscription fees becoming replaced by publishing fees, universities are still subscribing to journals and trying to publish open access, in effect paying twice. If you have a Wellcome Trust or Research Council grant, this will cover your publishing costs, but of course if you are just starting out like me, you might not have a large grant yet. I guess I am just left counting my pennies to try to cover the thousands of pounds it costs to publish my papers under open access - amid rumours that only open-access papers will count in future research assessment exercises.

Something for everyone - resources from the CW15 demo sessions

By Shoaib Sufi, Community Leader

The Collaborations Workshop 2015 (CW15) and Hackday lasted three short - but highly charged - days, and attracted over 85 people to work on interdisciplinary research. As the rich set of resources created at the event are written up, we will be releasing these short posts to share the outcomes and tools with the community.

The CW15 demos covered a vast array of subjects. From systems for creating data management plans (DMPonline), to new ways of packaging software, data, papers and other associated resources to make Research Objects the new index of your research work. Other demos showed off tools to help visualise data sets using Python (DAWN Science), workflow-oriented systems that make it easier to integrate the vast array of web-based datasets using a visual programming paradigm (such as Apache Taverna, and systems for cataloguing data driven experiments (SEEK).

Bioinformatics tools, services and know how had a huge showing, with strong representation  from the Wurmlab. Important approaches to software development for researchers were covered such as the user centric design approach used to develop sequence server. The need to identify and fill in the gaps in software provision in Bioinformatics was highlighted with an exemplar of the need to validate gene predictions and how this lead to the Gene validator software system. Techniques such as crowd sourcing applied to improving the quality of research data offered a novel example of how techniques from other areas could be used to improve the quality of research tools, such as the gene prediction system, Afra).

Round-trip testing for Provenance Tool Suite

Family tree

By Mike Jackson, Software Architect.

Provenance is a well-established concept in arts and archaeology. It is a record of ownership of a work of art or an antique, used as a guide to authenticity or quality. In the digital world, data too can have provenance: information about the people, activities, processes and components, for example software or sensors, that produced the data. This information can be used to assess the quality, reliability and trustworthiness of the data.

Trung Dong Huynh, Luc Moreau and Danius Michaelides of Electronics and Computer Science at the University of Southampton research all aspects of the “provenance cycle”: capture, management, storage, analytics, and representations for end users. As part of their research, they have developed the Southampton Provenance Tool Suite, a suite of software, libraries and services to capture, store and visualise provenance compliant with the World Wide Web Consortium (W3C) PROV standards.