Dr Rad Suchecki1, Dr Nathan Watson-Haigh2, Dr Stuart Stephen3, Dr Alex Whan4
1CSIRO Agriculture and Food, Urrbrae, Australia
2School of Agriculture, Food and Wine, The University of Adelaide, Urrbrae, Australia
3CSIRO Agriculture and Food, St. Lucia, Australia
4CSIRO Agriculture and Food, Black Mountain
Reproducibility of experiments is an essential part of the scientific method. Many scientific studies are difficult or impossible to replicate, so the correctness of their results cannot always be verified. In data science, the widespread use of often proprietary, graphical interfaces is generally detrimental to reproducibility of data processing and analyses. Replication attempts are hampered by the usually limited accuracy and completeness of description of steps undertaken. In contrast, the rise of free and open source software coupled with advances in computational process management, version control, containerisation and automation, brings reproducibility within reach.
BioKanga is a suite of bioinformatics tools developed at CSIRO. To evaluate BioKanga’s sequence alignment module against other state-of-the-art tools, we developed workflows using two popular automation frameworks, Snakemake and Nextflow. Individual tasks are executed in Cloud or HPC environment using dedicated, lightweight (Docker/Singularity) containers for each tool being evaluated. Containers are also used for other tasks including extensive quality-control of the input as well as the interim data and final results. Use of containers provides a reproducible software environment which contributes to the replicability of the results. Given a pre-defined experimental set-up, raw data is acquired from a public source, processed and analysed. The final document (research paper or project report) including dynamically generated figures and tables is compiled from R Markdown or LaTeX/knitr.
This work is a proof of concept and provides an open source template for automated generation of a fully
Rad Suchecki obtained his BSc and PhD from the School of Computing Sciences, University of East Anglia, Norwich, UK. In his PhD work, he focused on developing algorithms and distance measures for phylogenetic networks and multi-labelled trees. For his postdoc Rad joined Australian Centre For Plant Functional Genomics at The University of Adelaide, where he explored genomics and transcriptomics of bread wheat. For this he developed high-performance computational pipelines along with web applications for integration and visualisation of biological data. Rad recently joined CSIRO Agriculture and Food, where he applies and develops frameworks and software to drive reproducibility in crop informatics and data science.