By Eilis Hannon, University of Exeter, Martin Callaghan, University of Leeds, James Baldwin, Sheffield Hallam University, Mario Antonioletti, Software Sustainability Institute, David Pérez-Suárez, University College London.
This post is part of the Collaborations Workshops 2017 speed blogging series.
In our daily work we may, at some point, need to access data from third parties that we wish merge or compare with some data that we have generated or obtained. Invariably we may turn to Google to find pertinent data sources. Domain experts may be able to refer us to data sources or in part there are keywords that can unlock what you are trying to find on the web. Alongside, we can filter results using advanced Boolean operators. In order to make sense of the results, we can consider a number of factors, such as top links and domains that are most relevant to the topic. For specific domains, there will be known and trusted data providers, e.g. the Gene Expression Omnibus (GEO) or the Sequence Read Archive (SRA) for Genomics. For other data, one might start by consulting the Office of National Statistics, public records such as those published by local authorities and licensing authorities, the Police, the Met Office, the Home Office, the Environment Agency, Social Networks and datasets published on Figshare, etc. The list is almost endless, although, for a non-expert in a particular domain, a conversation with someone established in that area will often be the quickest and easiest way to start to identify the most appropriate data sources.
We must evaluate the content and be critical of what you read checking for its reliability and accuracy. This is what we call ‘verification’. It’s important to understand what the website stands for and why it has been written, some websites may give a biased point of view or give a one-sided view, which seemingly presents their opinion as facts. Education on using search engines (e.g. how it filters data from the web) and the subject domain can help towards verifying the data.
We also need to be careful about the provenance of data:
Do we understand how the data was collected?
Is it compatible with your existing data and not introducing biases or inaccuracies?
Is it licensed in a way that allows you to use it?
Is it fit for purpose?
Does it have a data dictionary and adequate supporting metadata for you to be able to understand and use the data correctly?
In some instances, we are becoming data sources ourselves. There are lots of services that collect and serve this data. Many of us use wearable fitness devices that collect biometric information about ourselves, such as steps walked, heart rate, location and "tracks", sleep duration, etc and we mostly give away this data without being aware of the rights we hold on the data once it has been collected or who is going to be given access to our records.
We may even be tricked into providing access when we believe the data is going to be used for helping a council to make better decisions (e.g. Strava and the commute button). Some services provide the user with additional statistical information, e.g. how you relate to your peers, the ability to download your data, etc. if you pay for a subscription service.
When you want to analyse data available, either from one of these devices linked with a company or from networks (research, amateurs, citizen science, ...) besides finding the data or the api you also need to know the rights and the meaning of the data. If you get undocumented data then it's pretty much useless (temperature without units, dataset without metadata, ...). Some data archives only offer the data to a selected group of people while being used in papers, so it's not open data in the terms commonly understood.
If you want to find data besides googling it, you can also use the open data network, Wikidata and Wikipedia.