AI models - algorithms that learn statistical patterns from data to perform a given task - need large datasets to infer the right patterns between their input (what is known) and their outputs (the task’s targets). AI development often focuses on designing novel, larger or more complex models to improve task performance. However, the data used for AI development is also of crucial importance, not only because it can impact the quality of the results obtained, but also for transparency and accountability purposes.
Getting the best out of a dataset involves proceeding carefully at all stages of data design, starting from planning. It is important to consider not only how the data will be collected, but also what data will be collected. Here, researchers may want to think about how to balance data minimisation with data coverage - can we achieve the same goal with less data? For example, if working with texts: do they need to include different writing styles (coverage)? Do they contain unnecessary, and therefore removable, personal data (minimisation)? Indeed, minimising the amount of personal data collected is also explicitly a GDPR (General Data Protection Regulation) requirement. That said, data minimisation can be relevant more broadly, given that both collecting and storing data can be expensive in terms of computational resources, funding and time. This is also why the collection process is typically iterative, starting from a pilot dataset, then refining the requirements and expanding, or reducing, the dataset.
There are multiple data collection methods that may be employed, including:
- Web scraping, or, generally, automated collection from web sources. This practice is common yet debated, for example because of the risk of ingesting personally identifiable information, and therefore best approached with care.
- Simulations and sensors, where data is generated from computer programmes (for example, weather forecast models) or instruments (for example, microscopy data).
- Data collection from participants, whether targeted (directly recruiting participants) or crowdsourced (soliciting contributions from volunteers).
Each different method can support different goals for why we may want to collect a dataset. For example, training a model typically necessitates a larger dataset size, and therefore tends to involve some element of automated collection from web sources, or simulations. Data collection from participants is often used for smaller or benchmark datasets, for example to evaluate existing models. However, this distinction is not absolute: Mozilla Common Voice is an example of an open-source dataset created via crowdsourcing from volunteer contributors, yet able to support model training too.
Irrespective of the collection method, a recommended best practice for responsible data collection is that of maintaining extensive and rigorous metadata documentation. It can be helpful to use documentation standards like Datasheets for Datasets or Data Cards. Metadata themselves can be stored in a machine readable format.
Documentation includes keeping track of the provenance of data sources and, where relevant, the licences and terms under which they were released. Different portions of the data could also come from sources with varied licensing requirements - for example, some might allow commercial use and others do not, increasing the need for maintaining precise documentation. Furthermore, even permissible licenses or terms of services may still not imply there is sufficient awareness or consent from the people represented in the data (for datasets involving human subjects). Implementing opt-in or, depending on the circumstances, opt-out mechanisms can help increase the level of transparency and accountability in the data collection process (the BigCode project is an example of opt-out data governance).
Metadata documentation can also encompass describing why the dataset was created, and what it should and should not be used for. Similarly, it can incorporate a specific training/test split that dataset users should apply. Training/test split means creating a non-overlapping partition of the data: the model learns the desired patterns from the training subset and its generalisation capabilities are evaluated on the test. The test subset should be as indicative as possible of the model’s performance on new data, otherwise we might obtain an optimistic, rather than realistic, measure of generalisation capability. For example, if a medical dataset contains multiple samples from the same patient, the training and test sets could be split by patient, ensuring that data from a single patient does not appear in both subsets.
After creating the dataset, exploring it is necessary to better understand its content and structure. Here, important steps include calculating descriptive statistics, checking for duplicates and outliers, detecting (and removing) potentially harmful or illegal content, and identifying potential gaps in representativeness and coverage. Personally identifiable information included in the dataset should be removed.
Exploring the distribution of the target labels or values is also crucial to check for gaps, imbalances or errors. This is because labels, as the desired output of the AI algorithm, are typically the main signal from which the model learns how to perform a task. Datasets may not always include labels when collected. Various data points might therefore need to be annotated with such labels, for example by domain experts or volunteers - a process that is complex but crucial, and therefore worth resourcing appropriately. When annotations are performed by multiple labellers, inter-annotator agreement measures can indicate how reliably each sample can be labelled. Disagreements between annotators may reflect not just labelling errors, but also intrinsic uncertainty in what the right label should be. Indeed, the categories that we use to classify data points rely on human discretion - both to determine which category should be assigned to a data point, but also in deciding which categories should be used in the first place. This can lead to bias entering the dataset and then being propagated further by an algorithm trained to replicate the same biased labels.
The data exploration work can also continue after a (partial) model is trained, for example by measuring which data points a model is more or less familiar with. This can help identify learning gaps, which may be due to, for example, noisy or not well represented subsets of the data. For example, the Gender Shades project was a seminal work demonstrating how AI models can underperform for specific groups, likely due to under-representation of those groups in the training data. Characteristics of the data, including its representativeness, can also change over time (“data drift”). Continuously monitoring the statistics of your datasets can help understand when such a drift has occurred, potentially indicating the need to update the dataset and retrain the AI model.
To conclude, when creating datasets for AI development it is important to:
- Plan the data collection process, considering data relevance, coverage, consent and minimisation.
- Create and maintain accurate metadata documentation.
- Explore and monitor the dataset characteristics, updating (and retraining) when necessary.
Further resources
- HuggingFace’s Data Measurements Tool.
- ODI’s framework for AI-ready data.
- UK Data Services’ UK Web Scraping Compliance Guide 2026.
- Towards Best Practices for Open Datasets for LLM Training.
- Towards accountability for machine learning datasets: Practices from software engineering and infrastructure.
- Advances, challenges and opportunities in creating data for trustworthy AI.
Acknowledgements
This guide was written by Stef Garasto and reviewed by Fil Garciano.