From a Data Rivulet to a River: Lessons learnt from upgrading the Deterministic Seven-Day Streamflow Forecast System to provide Probabilistic Flow Ensembles at the Bureau of Meteorology

Patrick Sunter1, Daehyok Shin1, Prasantha Hapuarachchi1, Maree Carroll1, Sophie Zhang1

1Australian Bureau Of Meteorology, Melbourne, VIC, Australia

 

This presentation will discuss the challenges faced, and how we addressed them, in a multi-year project to upgrade the Australian Bureau of Meteorology’s (BoM) Seven-Day Streamflow Forecasting service to provide ensemble probabilistic forecasts.

The project involved integrating many new statistical approaches, algorithms and data sources – several of which originated in collaborative research with the CSIRO and Australian universities – into a production-ready system able to publish results daily for several hundred locations on the Bureau’s website.

We will discuss the ways this work challenged our existing systems and the ways we addressed those challenges, including:
• data management and provenance: requiring new approaches to handle version control of much larger data artefacts and model representations, including moving to Git Large File Storage (LFS) for managing hydrological model configuration and verification data;
• performance and scalability: including updating our Python software that previously worked effectively on deterministic Numerical Weather Prediction (NWP) grids to deal with higher-resolution ensemble forecasts;
• system integration: the challenge of integrating new R&D software into production architectures, including dealing with legacy systems;
• Redesigning outputs for better scientific communication: Including updating graphical plots that balance communicating the extra information included in probabilistic forecasts, while not overwhelming generalists with too much information.
Finally, we will attempt to draw out the most relevant lessons learnt from this project for other eResearch practitioners and other scientific software engineers.


Biography:

Patrick Sunter has worked in the field of software engineering of scientific computing applications for more than a decade, participating in multiple collaborative projects in research and industry. Building on a base of software engineering post-graduate training, he has worked across the domains of geophysics, materials science, and spatial information to develop software to support modelling and analysis of complex problems.

Patrick joined the Australian Bureau of Meteorology’s Water Forecasting Services section in 2016, and since then has worked on upgrades to the software and information systems that underpin the Bureau’s seasonal and short-term streamflow forecasting services.

Scaling Agile for SKA: Adoption of SAFe as the large-scale agile methodology for the construction of the SKA software systems

Juan Carlos Guzman1

1CSIRO, Bentley, WA, Australia

 

The Square Kilometre Array (SKA) project has completed most of the design work and started to prepare for construction due to commence at the end of 2020. A large fraction of the construction effort will be dedicated to software development, estimated at around 600+ FTE over 6 years of construction and distributed in multiple teams across the globe. To tackle this large-scale distributed development effort, the SKA Office decided to adopt the Scaled Agile Framework (SAFe). SAFe is a proven, publicly-facing framework for applying
Lean and Agile practices at enterprise scale, and one of the most popular large-scale agile methodologies in the market.

To gain experience in this new methodology, the SKA has started to use it in the context of the “bridging” activities, that is the time between the end of design and start of construction. We are currently in the second increment with 12 active agile teams continuing prototyping work and addressing key areas for the upcoming System Critical Design Review (CDR) scheduled for the end of this year.

This talk will introduce SAFe, why it was chosen for the SKA and shared some early results on this new large-scale agile software methodology applied to a big research project.


Biography:

Juan Carlos (JC) Guzman is the Head of the Software and Computing Group at CSIRO Astronomy and Space Science. He joined CSIRO in 2007 and has been working mainly in the ASKAP project, fulfilling many roles including Developer, Architect, Team and Group leadership. He also has been contributing to the SKA project since 2012. Before joining CSIRO he worked at the European Souther Observatory (ESO) in Chile developing monitoring and control system software for several optical telescopes located in the Chilean desert.

How to make research software a sustainable activity? Lessons from a year of planning out the US Research Software Sustainability Institute

Professor Karthik Ram1

1University Of California, Berkeley, California, United States

 

Many science advances have been possible thanks to use of software. This software, also known as “research software”, has become essential to progress in science and engineering. The scientists who develop the software are experts in their discipline, but do not have sufficient understanding of the practices that make software development easier, and the software more robust, reliable, maintainable and sustainable. This is an unfortunate state of affairs as researchers in the UK and the US report that 90-95% rely on research software for their work. 63-70% of these researchers also believe that their work would not be possible if such software were to become unavailable.
Through a grant funded by the US National Science Foundation (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1743188) we have been engaged in a series of activities to understand specific challenges that make research software unsustainable and why researchers who develop software face uncertain career paths. In this talk I’d like to discuss some solutions based on surveys, ethnographic studies, and workshops that we carried out over an 18 month period in 2018-2019.


Biography:

Karthik Ram is a research scientist at the Berkeley Institute for Data Science and University of California Museum of Paleontology at University of California, Berkeley. Karthik is also the co-founded of the rOpenSci project, lead of the US Research Software Sustainability Institute, founding editor at the Journal of Open Source Software and has served on the boards of various organizations in this space including Data Carpentry, Many Labs, Libraries.io and more.

Scheduling, deploying and monitoring 100 million tasks.

Professor Andreas Wicenec1

1University Of Western Australia, Crawley/perth, Australia

 

The SKA will enable the production of full polarisation spectral line cubes at a very high spatial and spectral resolution. Performing a back-of-the-evelope estimate gives you the incredible amount of around 75-100 million tasks to run in parallel to perform a state-of-the-art faceting algorithm (assuming that it would spawn off just one task per facet,  which is not the case). This simple estimate formed the basis of the development of a prototype, which had scalability as THE primary requirement. In this talk I will present the current status of the DALiuGE system, including some really exciting computer science research.


Biography:

Andreas Wicenec is Professor at the University of Western Australia since 2010, leading the Data Intensive Astronomy Program of the International Centre for Radio Astronomy Research designing and implementing data flows and high performance scientific computing for large scale astronomical facilities and surveys. During his career he had the privilege to be involved in the software development, data management and reduction and operation of several large scale astronomical facilities, including the ESA cornerstone HIPPARCOS satellite, the Very Large Telescope (VLT) and the Atacama Large Millimetre and Submillimetre Array (ALMA) in Chile, the Murchison Widefield Array (MWA) and the Square Kilometre Array (SKA). His scientific interests in astronomy include precision global astrometry, optical background radiation, stellar photometry, dynamics and evolution of planetary nebulae and observational survey astronomy. In computer science he is doing research in workflow construction and execution as well as scheduling and the related computational concepts.

QuickThermo: A software to perform ab initio thermodynamic calculations

Dr Benyamin Motevalli1, Dr Amanda Barnard1

1CSIRO Data61, Docklands, Australia

 

The energies obtained using first principles methods can be used to study thermodynamic stabilities of complex systems, particularly where experimental measurements are limited due to difficulties. However, the calculated energies only account for the ground state (temperature T ≈ 0K, pressure P=0 Pa) electronic energies, E. One practical solution to extend the ground state energies to finite temperatures and pressures is the first principles (ab initio) thermodynamics method, which combines the results calculated from first principles at the ground state, and the extensive thermochemical data measured at the standard state. This method can also serve as an effective technique to expand datasets in a more sensible way by calculating the probabilities as a function of temperature, pressure, and other environmental conditions such as humidity.

QuickThermo is a software package that enables such calculations. It has a user-friendly interface which is developed in C#, using WPF technology and has a proper database developed in SQLite. The database also includes a number of predefined elements with corresponding measured thermochemical data. This database is flexible and can be grown by users. The interface provides convenient tools to define elements and structures and calculate thermodynamic probability for various environmental conditions such as temperature, pressure, and humidity. Further, the software provides batch run capabilities, where users can load any number of structures and perform the calculations for a range of environmental conditions. Also, a range of interactive plots are embedded in the software to display some results.


Biography:

Dr. Benyamin Motevalli is a Postdoctoral Fellow in Data61 at CSIRO. He has years of experience in developing/employing computational and numerical analysis techniques to establish fundamental understanding of novel intelligent nanomaterials.  His current research focuses on rational design of materials through innovative data-driven models that offer the advantage of fusing complex experimental and computational data for a higher-level understanding of structure-processing-property relationships

DevOps in eResearch

Mr Sven Dowideit1

1Csiro, Brisbane, Australia

There are many useful technologies and practices that are coming, or have come to make pushing software, data and information to researchers and users easier or faster.

This workshop continues to provide a platform to highlight, discuss and bring eResearch people together.

We’ll organize a set of short talks to spark discussion and learning.

The topics of interest are broadly:

  •    Security, as a service?
  •    teaching, learning, and moving forward together
  •    Continuous integration and delivery
  •    monitoring, alerting, and logging
  •    moving to the cloud
  •    containerization, orchestration, Kubernetes, CaaS systems
  •    keeping your research service running for the next 10 years

If you’ve recently done a review of new or old systems in a space, and found pitfalls,or limitations, or positive surprises, we’d love to hear about that.


Biography:

Sven Dowideit has been working in the application container startup space since 2013, having lead both the Boot2Docker project, and the RancherOS container Linux project.

A reproducible and reusable publication and analysis workflow

Dr Rad Suchecki1, Dr Nathan Watson-Haigh2, Dr Stuart  Stephen3, Dr Alex Whan4

1CSIRO Agriculture and Food, Urrbrae, Australia

2School of Agriculture, Food and Wine, The University of Adelaide, Urrbrae, Australia

3CSIRO Agriculture and Food, St. Lucia, Australia

4CSIRO Agriculture and Food, Black Mountain

 

Introduction

Reproducibility of experiments is an essential part of the scientific method. Many scientific studies are difficult or impossible to replicate, so the correctness of their results cannot always be verified. In data science, the widespread use of often proprietary, graphical interfaces is generally detrimental to reproducibility of data processing and analyses. Replication attempts are hampered by the usually limited accuracy and completeness of description of steps undertaken. In contrast, the rise of free and open source software coupled with advances in computational process management, version control, containerisation and automation, brings reproducibility within reach.

Methods

BioKanga is a suite of bioinformatics tools developed at CSIRO. To evaluate BioKanga’s sequence alignment module against other state-of-the-art tools, we developed workflows using two popular automation frameworks, Snakemake and Nextflow. Individual tasks are executed in Cloud or HPC environment using dedicated, lightweight (Docker/Singularity) containers for each tool being evaluated. Containers are also used for other tasks including extensive quality-control of the input as well as the interim data and final results. Use of containers provides a reproducible software environment which contributes to the replicability of the results. Given a pre-defined experimental set-up, raw data is acquired from a public source, processed and analysed. The final document (research paper or project report) including dynamically generated figures and tables is compiled from R Markdown or LaTeX/knitr.

Conclusion

This work is a proof of concept and provides an open source template for automated generation of a fully


Biography:

Rad Suchecki obtained his BSc and PhD from the School of Computing Sciences, University of East Anglia, Norwich, UK. In his PhD work, he focused on developing algorithms and distance measures for phylogenetic networks and multi-labelled trees. For his postdoc Rad joined Australian Centre For Plant Functional Genomics at The University of Adelaide, where he explored genomics and transcriptomics of bread wheat. For this he developed high-performance computational pipelines along with web applications for integration and visualisation of biological data. Rad recently joined CSIRO Agriculture and Food, where he applies and develops frameworks and software to drive reproducibility in crop informatics and data science.

Building applications for scientific communities

Mr Anders Savill1

1Pawsey Supercomputing Centre, Kensington, Australia

 

In the digital age, Scientists are facing a number of new challenges. For example the emergence of high-throughput sequencing has pushed the study of some biology fields into the computer science realm, however the skills and software required to deploy and manage these new complex workflows will take much longer to develop. In this presentation we explore the relationship between the investigator, the enabling technology and their supportive communities. This presentation will outline how researchers can be best supported to use technology to enable their research. Our focus is on leveraging containerisation technology which will allow complex workflows to be deployed on a diverse range of platforms including cloud services and HPC. Building on the existing Docker API provides us the ability to transparently deploy containers to any configured host. On top of the container stack we can deploy a web services for users to start, stop and manage their own workflows and services. Through collaboration with research partners we can develop a wide range of easy to use applications backed by well documented scientific methods. It is apparent that where there is a strong and supportive relationship between the investigators, the provider of the enabling technology and the community, researchers have greater opportunities to create impact. Accessible and easy to use containerised workflows have the opportunity to accelerate diagnostic testing, simplify the transition to parallel computing and lower the bar of entry to complex computing.


Biography:

Anders is a Web Applications Developer at the Pawsey Supercomputing Center where he works on computing and HPC related problems within the wider research community and Pawsey Partners.

Good practices in scientific computing

Dr Paulus Lahur1

1CSIRO (IMT), Clayton South, Australia

 

A list of around ten major things that require relatively small effort to implement, but promise big reward towards the success of a scientific code. The list is drawn from references and experience. The motivation is to help researchers who might not have the time or expertise to learn and implement every single wisdom in software engineering.


Biography:

Joined CSIRO in the current role in 2015.

Virgil enhancements – Planned improvements for a distributed processing scheduling software system

Mr Lance Holden1, Dr Antonio Giardina2, Mr Denis Shine1

1DSTG, Edinburgh, Australia,

2Deakin University, Melbourne, Australia

 

When the Defence Science Technology Group (DSTG) analyse close combat via simulations, they are required to generate a large amount of interaction data from many small processing tasks. This data is used to execute simulations with enough replications to provide a large enough sample set from where to gain statistically significant simulation insights. Initially DSTG had a slow, manually controlled execution pipeline for all required processing and simulation tools across ad-hoc and poorly controlled computing resources. The tools used often lacked any built in support for distribution and DSTG managed this process manually.

A collaboration with Deakin University produced a new software system to manage the distribution of these tasks across a fixed network of known processing capability. This new system, called Virgil, used a mix of available open source components and custom scheduling modules. Experience in the use of Virgil by DSTG exposed further areas where improvements could be made, such as the need of better methods for creating dependencies between task executions and the need to easily configure the required processing/simulation tools across all remote clients.

A new collaboration between DSTG and Deakin University has started to explore creating a task specification language and a reconfigurable network environment. These new features will provide a scalable and adjustable distributed processing platform that will improve the speed and reliability of data generation and simulation execution.


Biography:

Mr Denis Shine is a researcher working for the Defence Science and Technology Group. He specialises in the application of Land Combat Simulation to support Army decision making.

12

ABOUT AeRO

AeRO is the industry association focused on eResearch in Australasia. We play a critical coordination role for our members, who are actively transforming research via Information Technology. Organisations join AeRO to advance their own capabilities and services, to collaborate and to network with peers. AeRO believes researchers and the sector significantly benefit from greater communication, coordination and sharing among the increasingly different and evolving service providers.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.
© 2017 Conference Design Pty Ltd