The Practice of Reproducible Research

Posted by s.aragon on 24 March 2017 - 9:00am

By Justin Kitzes, University of California, Berkeley

We are very happy to announce the launch of our open, online book The Practice of Reproducible Research, to be published in print by the University of California Press later this year. In short, this book is designed to demonstrate and teach how research in the data-intensive sciences can be made more reproducible. The book centres on a collection of 31 contributed case studies, in which experienced researchers provide examples of how they combined specific tools, ideas, and practices in order to improve the reproducibility of a real-world research project. These case studies are accompanied by a set of synthesis chapters that introduce and summarise best practices for data-intensive reproducible research.

Within the overall context of reproducibility, our book focuses specifically on the goal of achieving computational reproducibility in individual research projects. We defined a research project as computationally reproducible if a second investigator can recreate the final reported results of the project, including key quantitative findings, tables, and figures, given only a set of files and written instructions. This focus reflects our belief that computational reproducibility forms a first and most foundational goal for individual investigators interested in the broad goals of reproducible research.

Each of the 31 case study chapters in our book presents the specific approach that an author used to attempt to achieve reproducibility in a real-world research project, including a discussion of the overall project workflow, key tools and techniques, and major challenges. We divided these case studies into two groups: Part II of the book contains case studies that take a high-level view of an entire research workflow, including data input, processing, and analysis, while Part III of the book contains low-level case studies that focus in more detail on individual aspects of reproducible research.

These case studies were written by a diverse group of scholars, ranging from graduate students to full professors, who work in disciplines spanning the natural sciences and engineering. All of these authors, however, work in what we describe as the data-intensive sciences, fields in which researchers are routinely expected to collect, manipulate, and analyse large, heterogeneous, uncertain data sets. Many of the authors are affiliates of one of three Moore-Sloan Data Science Environments, housed at the University of California Berkeley, the University of Washington, and New York University, and have deep experience both in domain-specific research as well as in software development, scientific computing, and data science.

To accompany these diverse case studies, we have also written a set of overview chapters, found in Part I of the book, that introduce several important concepts and practices in computational reproducibility and report on lessons learned from the case studies themselves. This section of the book contains chapters that

outline the factors that determine the extent to which a research project is computationally reproducible (Chapter 2),
provide a step-by-step illustration of a core, cross-disciplinary reproducible workflow, suitable as a standalone first lesson for beginners (Chapter 3),
describe the format of the contributed case study chapters and summarises some of their key features (Chapter 4),
summarise common themes across the case studies, focusing on identifying the tools and practices that brought the authors the most reproducibility benefit per unit effort and the universal challenges in achieving reproducibility (Chapter 5),
discuss reproducible research in modern science, highlighting the gaps, challenges, and opportunities going forward (Chapter 6), and
defines key concepts, techniques, and tools used in reproducible research and mentioned throughout the case studies (Chapter 7).

Part I of the book can be read as a standalone introduction to reproducible research practices in the data-intensive sciences. After reviewing Part I, readers can then progress to the individual case study chapters themselves, contained in Parts II and III.

We hope that you enjoy reading the book, and please feel free to get in touch with the editors and authors to share your comments and ideas.

Finally, we're still continuing to collect additional case studies of reproducible research workflows. If you're interested in contributing a case study to our growing online collection, you can head over to our GitHub repository for instructions on writing and submitting a case study.