Archetype and Prototype Analysis of Graphene-Oxide Nanoflakes

Dr Benyamin Motevalli1, Dr Amanda Barnard1, Dr Baichuan Sun1

1CSIRO Data61, Docklands, Australia


Since its discovery in 2004, graphene has attracted massive attention in a range of industries. Graphene is the lightest, strongest, most electrically conductive substance on earth. However, after over a decade of intense research and development graphene products are still almost non-existent. In this respect, computational material science community have also devoted a substantial effort, specially, in establishing the structure-property relationship of graphene oxides (GOs). To date, most available studies are limited to few cases of specific topologies, sizes, and oxygen contents.

Here, using intensive electronic structure calculations, more than 60,000 virtual experiments is conducted to develop a large dataset of GO nanoflakes. To characterise GOs, a range of post-processing techniques were employed, which yield more than 500 descriptors, from which 90 features were identified to effectively describe the dataset. Then, clustering and archetypal analysis is used to investigate the GO dataset and identify the most significant structures. It is observed that at least 20 archetypes are required to explain more than 60% variance, while 45 prototypes should be included to get a similar value. Since the archetypes are the pure types of the convex hull, their combination can more efficiently describe a larger number of members of the ensemble in comparison with the prototypes. To visualise the 90-dimensional dataset, all structures are mapped to the convex hull of the archetypes and are coloured by different outputs as well as clusters. The profiles of the archetypes and prototypes are also extracted and compared with their closest match in the dataset.


Dr. Benyamin Motevalli is a Postdoctoral Fellow in Data61 at CSIRO. He has years of experience in developing/employing computational and numerical analysis techniques to establish fundamental understanding of novel intelligent nanomaterials.  His current research focuses on rational design of materials through innovative data-driven models that offer the advantage of fusing complex experimental and computational data for a higher-level understanding of structure-processing-property relationships

Online time series sensor data cleaning system: A case study in water quality

Dr Yifan Zhang1, Dr Peter Thorburn1, Mr Peter Fitch2

1CSIRO, Brisbane, Australia
2CSIRO, Canberra, Australia


Water quality high-frequency monitoring offers a comprehensive and improved insight into the temporal and spatial variability of the target ecosystem. However, most monitoring system lacks the consideration of sensor data quality control. The sensor data missing, background noises and signal interference have long been a huge obstacle for the users in understanding and analysing the sensor data, therefore makes the utilisation of sensor data much inefficient.

Therefore, we present an online data cleaning system for water quality sensor data. After collecting the raw sensor data, the data cleaning system applied different data filters to corresponding water quality sensor streams. In this approach, the specific environmental effects and can be considered separately. Cleaned data streams are then sent to the web-based frontend interfaces for end users.

There are two main tasks in this system:  detect and remove water quality outliers, and recover the missing sensor data. For the first task, the water quality filters are built based on the variable-specific threshold, changing rate and statistical distributions. The machine learning-based algorithms such as KNN are applied in filling the sensor data gaps in the monitoring streams.

The prototype system releases the end users from the trivial data cleaning work and shows a significant improvement in the readability of the water quality sensor data. In the next stage, more neural network based algorithms would be tested and integrated to provide more reliable and accurate data cleaning results.


Yi-Fan Zhang is a Postdoctoral fellow in Agriculture & Food, CSIRO. He received a PhD in data science from Queensland University of Technology in 2016. His work focuses on deep learning for agriculture decision making and management, with an emphasis on time series modelling and forecasting.

Building a Data Intensive Infrastructure for evaluating the role of Universities in Society

Dr Richard Hosking1

1Curtin University, Perth, Australia


It is distinctly a human response to optimise for, or even game any system of measurement. In this work we begin to question if our current metrics of academic scholarship tell incomplete stories or align with what we truly value. We consider the idea of an Open Knowledge Institution, and how open access rates, the diversity of hiring, policy direction, and research impacts are tracking across Australia, and also globally.

This talk focuses on our methods, infrastructure, code, and practices for data collecting to begin addressing these questions. From the use of Cloud Computing to dynamically scale and schedule our harvesting, to our partnerships for obtaining large data collections, and the ongoing challenges of working with incomplete, disjointed and often disagreeing data.

To date we have collected a multi-billion point dataset covering multiple decades, which extends well beyond Australian shores; linking funding, publications, citations and other forms of alternative impacts. To contextualise this BigData we have also collected deeper qualitative metrics of universities.

A indisputable conclusion of the work thus far is the wide scope of questions we have yet to answer, and the desire for further information to tell the full story. But there is scope to extend this framework: to collect more data, link more sources, and contextualise what we collect through further qualitative enquiry. But not all of this can be done without reaching out and working openly. We end by posing the question: what could an open digital scholarly observatory in the 21st century look like?


Richard is a Senior Data Scientist working at the Curtin Institute for Computation. He holds a PhD in Computer Science from the University of Auckland. He has worked in both Academia and Industry in the areas of Machine learning and Data Intensive Systems. This project is a collaboration with Centre for Culture and technology exploring the place of Universities in Society.

Geoscience data storage and management for advanced data analytics

Mrs Chitra Viswanathan1, Dr  Irina  Emelyanova1, Dr Ben Clennell1, Ms Stacey Maslin1

1Csiro, Kensington, Australia


Due to recent advancements in sensor and computer technologies, the volume of geoscience data is constantly increasing. Development of a unified data repository that is readily accessible for performing data analytics is one of the research objectives of the Geoscience Data Analytics team within the CSIRO Energy business unit. A prototype of a well log database (WLD) was developed to demonstrate the concept of a repository for sanitized well log data downloaded from open source databases. Furthermore, a software package is being developed to implement the well log data analytics workflows developed by the team.

WLD was built using the PAWSEY supercomputing facility for storing the raw data.  PostgreSQL was used to store the sanitized well log data from Bonaparte, Carnarvon and Browse basins, metadata and links to raw data. A web-based graphical user interface was developed to view, upload and download data.

One of well log data analytics workflows, the unsupervised ensemble learning of electrofacies, was implemented using PYTHON and a tool developed to choose the data from WLD, run the unsupervised ensemble learning clustering algorithm, view the results and combine different clustering methods’ outputs into a single solution (ensemble clustering).

Development of WLD using the latest data management technologies has addressed the data integrity issues. Development of the software package will replace the different software platforms that were being used to perform data analytics. Next step of the project will see this platform being used for supervised and semi-supervised approaches for predicting rock properties.


Mrs Chitra Viswanathan is an experienced software developer working in the Energy business unit. She specialises in geological / geomechanical software applications. Her primary programming languages are VB.Net and ASP.Net. She has also worked in C, C++ in the past. She is also proficient in PYTHON.

The first suite of software that Mrs Viswanathan developed while working for CSIRO was Drillers’ Wellbore Stability Tool (DWST) that was used in numerous externally funded projects. Her latest software applications include 3 major pieces of software Mechanical Earth Model (MEM), Sand Onset Prediction (SOP) and Volumetric Sand Production Prediction (VSPP) that were developed for a commercial sand management project with PETRONAS.

Mrs Viswanathan is currently in the Data Analytics team in the Exploration Geo-science group and is the principal developer of novel Geo-science data storage and sharing tools and data analytics workflows.

Introduction to Machine Learning

Mr Chris Watkins1

1Csiro, Clayton South, Australia


As CSIRO embraces the transition into the technological age it has spawned a variety of digital initiatives designed to accelerate researchers’ application of modern digital advances to their technical domains. One such initiative is the CSIRO Data School program which has been designed to equip scientists with the tools necessary to apply defensible, reproducible data analytics to unique scientific datasets. This workshop will be built around a small part of the Data designed to introduce participants to the opportunities and challenges offered by the application of modern Machine Learning (ML) techniques.

Our C3DIS offering will first introduce ML and demystify the associated hype, provide a light overview of some useful ML approaches and, most importantly, equip attendees with the ability to verify and validate the results produced by their ML pipeline. We will highlight some common difficulties with real world datasets, how to identify these problems and how to rectify them.

The focus of the workshop will on applications to scientific datasets with examples including image data, time series data and regression problems. The workshop uses Python as it’s delivery vehicle and so some familiarity with the language will be assumed. We will be using the Google Collaboratory as a compute environment, so attendees are only required to bring a laptop with an internet connection. There will be limited support on offer should attendees wish to set up their own local environments.


Research software engineer with the Scientific Computing team at CSIRO. Chris works mostly with machine learning applications to scientific problems.

Building Modern Infrastructure for National Positioning Capability

Mr Jonathan Mettes1

1Geoscience Australia, Canberra, Australia


Geoscience Australia (GA) plays a coordinating role across government to provide accurate positioning information and data, and maintains a national network of Global Navigation Satellite System (GNSS) ground stations which forms part of a global observatory network.

Typically positioning systems allow for positioning to within 5-10 metre accuracy. GA is developing the capability to provide better than 5 cm accuracy positioning information to the public, at a national scale.

Precise positioning at this level opens a range of new and innovative applications, including major productivity improvements for agriculture, mining, engineering, logistics, transportation and location-based services.

In order to overcome the current gaps in coverage from mobile and radio communications, GA is deploying a test project of a Satellite-Based Augmentation System (SBAS). This will ensure that accurate positioning information can be received anytime and anywhere within Australia and New Zealand.

Vision for the future:

– Transition to real-time operations

– Transition from static to kinematic applications support

– Transition from science to industrial and environmental applications

– Scaling from 1,000 users to 100,000+ users

To scale for future needs, GA has migrated large parts of its data centre into the cloud (Amazon Web Services), and is currently in the process of modernising the infrastructure for data delivery. Jonathan will cover the challenges and successes in rolling out this infrastructure.


Jonathan is a software developer at Geoscience Australia (GA), working in a team together with software developers and geodesists, building network infrastructure to provide centimetre-level positioning capability at a national scale. He studied software development at Australian National University, and he joined the Geodesy team at GA in March.

Fault detection in FinFans – physics based modeling vs. machine learning

Daniel Marrable1, Mr Shiv Meka1, Mr Amir Najafi Amin2, Dr Kristoffer McKee2, Mr Nathan Jombwe3, Prof Ian Howard2

1Curtin Institute For Computation, Curtin University, Bentley, Australia 6102,

2School of Civil and Mechanical Engineering, Curtin University, Bentley, Australia 6102,

3Cisco Innovation Center, Bentley, Australia 6102


Air-cooled heat exchangers are commonly used in fractional distillation rigs. Finfans are a major component in these exchangers. A typical oil and gas refinery employs several tens of arrays of finfans – fans that are used to control temperature of the distillation column. Operating conditions, mechanical wear, and anomalies in operational procedures imply that finfans are more prone to faults. Given the complexity in detecting faults that are associated with finfans – bearing and corrosion, several physics based diagnostic approaches were proposed in the past. Rigorous in nature, these methods may have shortcomings owing to the challenges that are involved in the overall heat-exchanger setup.

In this talk, work related to a recently concluded project that was accomplished in collaboration with Curtin Institute for Computation, Woodside, and Faculty of Science and Engineering at Curtin University, would be presented. The research involved architecting methods to detect finfan faults, and comparing different design paradigms – one which is physics based and the other that is data driven. The talk also elaborates on the bottom-up technicalities and intricacies that are involved in – the choice of sensors/micro-controllers/communication protocols, designing energy efficient workflows, code structuring, embedded controller pleasant machine learning architectures, and testing.


Shiv Meka works as a HPC Specialist for Curtin Institute for Computation – a Data and Computational science institute setup to help Curtin researchers with computational research problems. He has a bachelors degree in electrical engineering from India and received his masters in Materials Science and Engineering from Texas A&M University, College Station, in 2009. He has since been involved with research relating computational materials science, quantum transport (esp. transport in nanoscale junctions), and multiscale modeling. Although not part of his formal training, computational “experiments” have become his passion. As a HPC/”Catalyzing” Specialist, he will collaborate with faculty in Science and Engineering, and seek avenues  to accelerate the discovery process. He also periodically reviews journal articles in IEEE Transactions on Nanotechnology, Journal of Nanotechnology, and Journal of Chemical Physics.

Time Series Analytics with Simple Relational Database Paradigms

Mr Ben Leighton1, Ms Julia Anticev1, Mr Alex Khassapov1

1Csiro, Clayton, Australia


The aim of the Energy Use Data Model (EUDM) project led by CSIRO Energy is to make Australian energy-use data accessible to the wider research community. A subset of this energy data, sensor readings from substations, have been provided by electricity distributors from across Australia. The EUDM project has harmonized this data to a standard format. The current set of data constitutes around 500 million observations. A goal of the project is to further add value to these harmonized datasets through generation of select pre-processed analytics products. Here we describe our initial work on “Medium Size Data Analytics” and show that, with no assumptions about time series alignment, large time series joins can be generated in reasonable time working within a familiar relational database paradigm, utilizing simple infrastructure, and with a minimum of python code.


Ben Leighton is a Software Engineer and Data Scientist working at CSIRO Land and Water. His work includes engineering collaborative technologies for data, and code. His research interests are reusable, reproducible, and portable environmental science and science systems.

Fast time series analysis of wave hindcast data

Mr Robert Davy1, Dr Ron Hoeke2, Ms Claire Trenham2, Dr Julian O’Grady2, Dr Mark Hemer2, Ms Rebecca Gregory2

1CSIRO Information Management & Technology, Canberra, Australia,

2CSIRO Oceans and Atmosphere, Aspendale, Australia


A CSIRO – Bureau of Meteorology partnership has been running gridded wave hindcast models at hourly time steps to produce estimates for historical ocean wave heights, fluxes and energy. Like many other gridded models, this output is optimised for spatial extracts at a given time step. Using data in this native form, constructing a 30+ year hourly time series at a grid point can take around 90 minutes, and large scale spatial analysis of time series extreme values is not practical.

An eResearch Collaboration Project was initiated with the aim of streamlining access to this data for time series analysis. Large speedups were achieved through reorganisation of the data into spatial tiles, concatenating in time, then performing NetCDF chunking in the time dimension. Due to the large memory requirements, processing is performed using a number of bash/python scripts on CSIRO’s large memory multiprocessor known as Ruby, with capability to update as new data comes in. As a result, retrieval of the time series at a random grid point now takes around 0.1 second. Extreme value analysis of the entire Australian coastal domain can be done on the Pearcey cluster (using job parallelism) in around 10 minutes.

An example is presented showing the science that this has enabled. We examine two historical storm-wave events, one which occurred at Sydney’s northern beaches (Collaroy-Narrabeen), Australia, and the other along Viti Levu, Fiji’s southern coastline (Coral Coast). Both events resulted in significant damage to coastal structures.


Robert Davy is a scientific software engineer at CSIRO Information Management & Technology. He is a member of the Scientific Computing Data Processing Services team. His focus is on use of data processing pipelines, DevOps tools and statistical analysis to unlock the latent value contained in large datasets. He has provided short and medium term support to a number of science teams via the IMT eResearch Collaboration Projects. He also has a background in quantitative analysis for renewable energy applications, and has co-authored a number of journal publications in this area.


AeRO is the industry association focused on eResearch in Australasia. We play a critical coordination role for our members, who are actively transforming research via Information Technology. Organisations join AeRO to advance their own capabilities and services, to collaborate and to network with peers. AeRO believes researchers and the sector significantly benefit from greater communication, coordination and sharing among the increasingly different and evolving service providers.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.
© 2019 Conference Design Pty Ltd