High Performance Data – Efficient Interoperability for Scientific Data

Alex Ip, Andrew Turner, Dr. David Lescinsky1

1Geoscience Australia, Canberra, Australia


Geoscience Australia (GA) is the steward of large volumes of geoscientific and environmental data extending over the entire Australasian region and spanning many decades. The volume and variety of data which must be managed, coupled with the increasing need to support machine-to-machine data access, mean that the old “click-and-ship” model delivering data as downloadable files for local analysis is rapidly becoming unviable – a “big data” problem not unique to geoscience. Text-based formats (e.g. CSV) have been used to provide interoperability, albeit at the cost of efficiency, and these formats do not provide standardised metadata handling.

Historically, GA’s datasets have been stored in a range of formats (some proprietary), with metadata of varying quality and accessibility, and without standardised vocabularies. To address these issues, GA has elected to use the Hierarchical Data Format (HDF) 5 within the Network Common Data Form (NetCDF) 4 to support standards-based scientific data delivery via web services. This flexible approach not only supports large-scale HPC processing and access from commercial cloud environments, but also interactive access to a wide range of data in user-friendly environments such as iPython notebooks and more sophisticated cloud-enabled portals such as the Virtual Geophysics Laboratory (VGL).

When combined with the use of standards-based data services and APIs, a coordinated, systematic modernisation of data formats will result in vastly improved accessibility to, and usability of, scientific data in a wide range of computational environments both within and beyond the geoscientific community.


