We are pleased to announce the Creating and Evaluating Datasets in AI guide, written by Stef Garasto, which offers a practical introduction to responsible data collection, evaluation and documentation for Artificial Intelligence algorithms.
AI models rely on data to learn patterns and perform tasks. While AI development often focuses on building larger or more complex models, the quality, suitability and transparency of the datasets used are just as important. Poorly documented or unrepresentative datasets can affect model performance, introduce bias and make it harder to understand whether an AI system is reliable or appropriate for its intended use.
The guide introduces key considerations for researchers creating or evaluating datasets for AI, including planning data collection, balancing data minimisation with data coverage, documenting data sources and licences, and understanding how different collection methods can shape the resulting dataset.
It also highlights the importance of exploring and monitoring datasets once they have been created. This includes checking for duplicates, outliers, harmful content, missing information, imbalances, annotation quality and potential gaps in representativeness. These steps can help researchers better understand the limitations of their data and the trade-offs involved in using it to train or evaluate AI models.
The guide is aimed at researchers at the beginning of their AI journey, as well as those who want to strengthen their understanding of responsible data practices. It also includes links to further resources for more experienced practitioners. After reading the guide, readers should be able to describe relevant data collection, evaluation and documentation practices, and understand potential biases, limitations and trade-offs involved in data collection for AI models.