A reproducible and reusable publication and analysis workflow

Dr Rad Suchecki1, Dr Nathan Watson-Haigh2, Dr Stuart  Stephen3, Dr Alex Whan4

1CSIRO Agriculture and Food, Urrbrae, Australia

2School of Agriculture, Food and Wine, The University of Adelaide, Urrbrae, Australia

3CSIRO Agriculture and Food, St. Lucia, Australia

4CSIRO Agriculture and Food, Black Mountain

 

Introduction

Reproducibility of experiments is an essential part of the scientific method. Many scientific studies are difficult or impossible to replicate, so the correctness of their results cannot always be verified. In data science, the widespread use of often proprietary, graphical interfaces is generally detrimental to reproducibility of data processing and analyses. Replication attempts are hampered by the usually limited accuracy and completeness of description of steps undertaken. In contrast, the rise of free and open source software coupled with advances in computational process management, version control, containerisation and automation, brings reproducibility within reach.

Methods

BioKanga is a suite of bioinformatics tools developed at CSIRO. To evaluate BioKanga’s sequence alignment module against other state-of-the-art tools, we developed workflows using two popular automation frameworks, Snakemake and Nextflow. Individual tasks are executed in Cloud or HPC environment using dedicated, lightweight (Docker/Singularity) containers for each tool being evaluated. Containers are also used for other tasks including extensive quality-control of the input as well as the interim data and final results. Use of containers provides a reproducible software environment which contributes to the replicability of the results. Given a pre-defined experimental set-up, raw data is acquired from a public source, processed and analysed. The final document (research paper or project report) including dynamically generated figures and tables is compiled from R Markdown or LaTeX/knitr.

Conclusion

This work is a proof of concept and provides an open source template for automated generation of a fully


Biography:

Rad Suchecki obtained his BSc and PhD from the School of Computing Sciences, University of East Anglia, Norwich, UK. In his PhD work, he focused on developing algorithms and distance measures for phylogenetic networks and multi-labelled trees. For his postdoc Rad joined Australian Centre For Plant Functional Genomics at The University of Adelaide, where he explored genomics and transcriptomics of bread wheat. For this he developed high-performance computational pipelines along with web applications for integration and visualisation of biological data. Rad recently joined CSIRO Agriculture and Food, where he applies and develops frameworks and software to drive reproducibility in crop informatics and data science.

ABOUT AeRO

AeRO is the industry association focused on eResearch in Australasia. We play a critical coordination role for our members, who are actively transforming research via Information Technology. Organisations join AeRO to advance their own capabilities and services, to collaborate and to network with peers. AeRO believes researchers and the sector significantly benefit from greater communication, coordination and sharing among the increasingly different and evolving service providers.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.
© 2017 Conference Design Pty Ltd