Bigger Data, Bigger Challenges — A review of Advances in Data Science 2017

Posted by s.aragon on 20 June 2017 - 9:00am

By Raniere Silva, Software Sustainability Institute.

Manchester hosted the Advances in Data Science 2017 meeting organised by the Data Science Institute on 15-16 May 2017. It was an opening eyes meeting for privacy and inspiring for ways that researchers can analyse and visualise their data.

The meeting started with a talk by Mark Girolami covering the use case of inference and prediction of the London retail development. It was interesting to discover how retail development is important to plan the future of any city since it’s one of the main reasons why people travel across their cities. The introduction of drones and self-driven cars would completely change why and when we travel across our cities, which will create many opportunities for researchers in this area. Following Mark's talk, Raia Hadsell exposed ways to overcome catastrophic forgetting; i.e., when a machine learning entity forgets what it’s learnt when it starts learning a new problem in neural nets. Raia used Atari Games played by an artificial intelligence (AI) entity and demonstrated the catastrophic forgetting effect. The attendees watched the AI play Breakout with superhuman performance but lose poorly when they played Pong. Fortunately, as explained by Raia, there are ways to minimise the catastrophic forgetting effect.

The rest of the first day was mostly filled with talks related to data privacy. Yves-Alexandre de Montjaye talked about the misconception that some researchers have: by removing obvious identifiers from the spreadsheet with the answers of their last survey they will be protecting the privacy of the participants on their experiment. For example, the movie ratings of Netflix subscribers can be used to identify them, as demonstrated by Arvind Narayanan and Vitaly Shmatikov. As scary as this is, researchers continue to investigate new ways to anonymise users data. Two other great talks about privacy were delivered by Jean-Pierre Hubaux, who focused on the development of a system to enable data sharing between Swiss hospitals, and by StJohn Deakins, who talked about a mobile app that helps users to control their data.

On the second day, Gaël Varoquaux shared his experience leading the development of scikit-learn: a community-drive open source Python package for machine learning. He mentioned the importance of code review and tests—you can read more about it on the Institute's guides—before new code is added into scikit-learn. Also, he warned the audience about the current difficulty to keep open source projects sustainable. Even though open source projects are used in many research papers (for example, R was cited 67248 times and scikit-learn was 5525 times), it is challenging to fund "infrastructure" projects. If you want to discuss more about support to digital infrastructure, look into the Sustain meeting in San Francisco on 19th June 2017.

In addition to Gaël's talk, attendees also learned from Caroline Jay, Graham McNeill and Niall Robinson. Caroline Jay showed the results of her research on the perceptual process applied to read images and how machine learn has helped on this investigation. Graham McNeill explained why tile maps can be a great tool to visualise some data—for example, when visualising a election result per electoral district the tiles will help the reader with the districts with small territory—but can be quite a challenge to preserve the limits of each tile. Niall Robinson showcased some of the Met Office work that includes a daily analysis of Terabytes of weather data.

All in all, the Advances in Data Science 2017 meeting captured many of the exciting and challenging bits of research, enabled by the terabytes of new information, provided by IoT sensors and other sources every day.