Demystifying big data, machine learning and all that

Posted by s.aragon on 7 November 2018 - 9:46am

Image courtesy of Samuel Zeller

By Becky Arnold, Software Sustainability Institute Fellow

This is part of a series of talks on good coding practice and related topics Becky Arnold has organised as part of her Fellowship plan.

As time progresses data is getting big. This presents many scientific opportunities as well as many technical difficulties. On the 12th of September Rob Baxter of EPCC came to University of Sheffield and gave the talk “Demystifying big data, machine learning and all that” about tactics to handle this shift and advice on how to avoid pitfalls.

The problem

Science is generating a lot of data and it ain’t slowing down any time soon. We’re still within a generation of the floppy disk, but as soon as the 2020s we’re going to have to deal with hundreds and even thousands of petabytes from multiple sources such as the square kilometre array and the high luminosity hadron collider.

To spice things up, further scientific data is often messy and comes in a dizzying array of types and formats which are sometimes only understood by the researcher that generated it. This damages science by making experiments less easily reproducible and data more frustrating to share (see the data sharing panda from the New York University (NYU) Health Sciences Library for a painfully accurate depiction of this).

Solutions

There will never be any one size fits all solution, but Rob offered his thoughts and advice about how to handle it when data gets big.

Have a plan. If you know you’re going to be generating a lot of data then before you start dumping it into files think about how it’s going to be used and how it’s going to be read. Rob suggested using DMPonline.
Work in teams of specialists. There’s an additional cost involved with hiring more people, but at a certain point it becomes cost-effective to employ multiple people to do specific things well rather than one person to do multiple things less skilfully. Rob broke this down into three main roles:
- Data scientists: make sense of the data, handle analytics and statistics
- Data engineers: wrangle the data, handle pipelines, data prep and cleaning
- Data managers: store the data, handle file formats, backups, and metadata management
When your number of files becomes big use a database. Rob suggested the possibility of using noSQL (Not Only SQL) over SQL.
Use standard formats. Whatever the standard is in your field.
Save metadata. The best data in the world is useless if no one knows what it actually is.
Think about what makes your data big. Is it raw volume? Is it the variety of data you’ll need to cope with (e.g. in the medical field)? It it the velocity you’ll receive chunks of data in? Thinking about this will help guide your approach.

To conclude, we’re moving into a new era of stupendously big data. We need to think carefully about how to make the most of it.