Dr Rad Suchecki1, Dr Alex Whan2
1CSIRO, Urrbrae, Australia, 2CSIRO, Black Mountain, Australia
Reproducibility is widely regarded as a key part of computational scientific processes. However, it remains challenging, and is rarely implemented completely or routinely. The nature of computational analyses combined with techniques such as version control and literate programming should, in principle facilitate full reproducibility. In practice however, reproducibility often relies on the accuracy and completeness of plain language description of steps undertaken and software used in a given study. Additional challenges arise for heterogenous analytical workflows which rely on multitude of dependencies, scripting languages and distributed computing. There are a variety of systems and tools that support reproducibility, but orchestrating them can be challenging, and can be a distracting overhead from getting an analytical task done.
In this talk, we present repset – an approach we have developed for benchmarking computational tools (in this case sequence aligners). In repset, we refined a suite of techniques which facilitate reproducibility of the analyses but also their scalability, portability and extensibility. Repset relies on Nextflow and its support for local, HPC and cloud execution environments as well as its integration with container registries such as Docker Hub. We also harness container build automation to easily update the multiple software environments used by repset. Generated results are linked with input definitions, specific revision of the source code, and the definitions of versioned software containers used for the analysis. Along with other run metadata these can be automatically deposited on GitHub and Zenodo.
Rad Suchecki obtained his BSc and PhD from the School of Computing Sciences, University of East Anglia, Norwich, UK. During his postdoc at The University of Adelaide, he developed high-performance computational pipelines and web applications for integration and visualisation of biological data. He continues this work in CSIRO’s Aginformatics group where he applies and develops frameworks and software to drive reproducibility in crop informatics and data science.