The CSIRO Data Access Portal: 1 PB and beyond

Katie Hannan1

1CSIRO, Clayton, Australia

 

The CSIRO Data Access Portal (DAP) provides access to data collections published by CSIRO across a wide range of research domains. The DAP was publicly released in 2011 and storage growth rose from an average of 18 GB a week during Sept 2012-Feb 2013, to an average of 8 TB a week during Sept 2017-Feb 2018.

During the first week of March 2018 the DAP passed the 1 Petabyte (PB) mark. This significant milestone means that 1 PB of CSIRO data and software files are preserved together with their metadata for long-term discovery.

This poster will provide information about the nature of the data collections that are hosted on the Data Access Portal and the data access habits of our user community.


Biography:

Katie Hannan started work at CSIRO as a Data Librarian during the week that the Data Access Portal passed the 1 petabyte storage milestone. She is passionate about storytelling, cultural history projects, linking people with information and helping to facilitate learning experiences. Katie has a background working in higher education, eResearch and projects. Her own research interests are in the areas of  human computer interaction, digital legacy and information society.

Web standards support science data

Dr Simon Cox1

1CSIRO Land and Water, Clayton South, Australia

 

The science community has developed many models for representation of scientific data and knowledge. For example, the biomedical communities OBO Foundry federates applications covering various aspects of life sciences, which are united through reference to a common foundational ontology (BFO). The SWEET ontology, originally developed at NASA and now governed through ESIP, is a single large unified ontology for earth and environmental sciences. On a smaller scale, GeoSciML provides a UML and corresponding XML representation of geological mapping and observation data.

Key concepts related to scientific data and observations have now been incorporated into domain-neutral ontologies developed by the World Wide Web consortium. OWL-Time has been enhanced to support temporal reference systems needed for science, and deployed in a linked data representation of the geologic timescale. The Semantic Sensor Network ontology (SSN) has been extended to cover samples and sampling, including relationships between samples. Specific extensions for science are being added to the Data Catalog vocabulary (DCAT) used by data repositories such as RDA and CSIRO-DAP.

These standard vocabularies can be used directly for science data, or can provide a bridge to specialized domain ontologies. The W3C vocabularies support cross-disciplinary applications directly. The W3C vocabularies are aligned with the core ontologies that are the building blocks of the semantic web. The W3C vocabularies are hosted on well known, reliable infrastructure, and are being selectively adopted by the general schema.org discovery framework.


Biography:

Simon has been researching standards for publication and transfer of earth and environmental science data since the emergence of the world wide web. Starting in geophysics and mineral exploration, he has engaged with most areas of environmental science, including water resources, marine data, meteorology, soil, ecology and biodiversity. He is principal- or co-author of a number of international standards, including Geography Markup Language, and Observations & Measurements, that have been broadly adopted in Australia and Internationally. The value of these is in enabling data from multiple origins and disciplines to be combined more effectively, which is essential in tackling most contemporary problems in science and society. His current work focuses on aligning science information with the semantic web technologies and linked open data principles, and the formalization, publication and maintenance of controlled vocabularies and similar reference data.

Persistent URIs for Australian government: the Australian Government Linked Data Working Group’s governance and management of linked.data.gov.au

Mr Nicholas Car1, Dr Simon Cox3, Mr Ben Leighton2

1CSIRO, Dutton Park, Australia

2CSIRO, Clayton, Australia

3CSIRO Land and Water

 

The Australian Government Linked Data Working Group (AGLDWG, http://linked.data.gov.au) is a group consisting of Australian government agency representatives interested in Linked Data (LD). Recognising that one of the foundational aspects of operational LD is the provision and maintenance of persistent HTTP URIs (PIDs), the AGLDWG has run both an operational PID redirection service and established some governance procedures for URI allocation.

In the first quarter of 2018, several AGLDWG government agencies signed a Memorandum of Understanding between them, as represented through the AGLDWG, and Digital Transformation Agency (DTA) as the agency tasked with maintaining the Australian government’s data catalogue and data space at data.gov.au, seeking to ensure consultative governance for the subdomain linked.data.gov.au. This was the first, but certainly not the last, step in shoring up strong governance of a domain for LD PIDs.

The AGLDWG has now proposed a governance and technical maintenance regime for PID allocation and maintenance that builds on the group’s past 4 year’s management experience of PIDs and PID management regimes elsewhere, such as those implemented within agencies like Geoscience Australia & CSIRO and internet organisations such as the W3C and the Internet Archive. This regime is designed to scale and be maintained for 25 year (the life of the current oldest digital PIDs).

In this presentation we will give background to the AGLDWG’s PID regime, detail the new regime and discuss challenges.


Biography:

Simon has been researching standards for publication and transfer of earth and environmental science data since the emergence of the world wide web. Starting in geophysics and mineral exploration, he has engaged with most areas of environmental science, including water resources, marine data, meteorology, soil, ecology and biodiversity. He is principal- or co-author of a number of international standards, including Geography Markup Language, and Observations & Measurements, that have been broadly adopted in Australia and Internationally. The value of these is in enabling data from multiple origins and disciplines to be combined more effectively, which is essential in tackling most contemporary problems in science and society. His current work focuses on aligning science information with the semantic web technologies and linked open data principles, and the formalization, publication and maintenance of controlled vocabularies and similar reference data.

Scratch Management and Scalable Flushing

Dr Robert Bell1, Mr Jeroen van den Muyzenberg2, Mr Steve McMahon3, Mr Peter Edwards1

1CSIRO IMT SC, Clayton, Australia

2CSIRO IMT SC, now Griffith University

3CSIRO IMT SC

 

High Performance Computing centres provide storage to complement compute services.  Typically, they configure their highest-performing filesystem as a ‘scratch’ area, providing space for the temporary storage of data, and shared among all the users.

HPC service providers use scheduling to ensure compute resources are allocated to the stakeholders and users in some way reflecting need, entitlement and fairness.  The same criteria need to apply to shared storage.

Quotas are typically used to control storage usage, but to support large problems over-allocation is needed, along with some mechanism to clear out old data to make way for the new.  This has proved to be a difficult problem as filesystems have grown to meet the compute needs, storing hundreds of millions of files.

This paper canvasses ways to manage shared filesystems for temporary storage, and then provides a new algorithm for flushing old files that is highly scalable and responsive.


Biography:

Robert Bell first worked for CSIRO as a vacation student at CSIRO Division of Meteorological Physics in November 1967.

From 1974, he worked for about 15 years in the CSIRO Division of Atmospheric Research, in programming various models of the ocean and atmosphere, and latterly in managing the computing group.

From 1990, he moved into providing support and services for CSIRO scientific computing (including a joint centre with the Bureau of Meteorology).  He is currently responsible for the administration of CSIRO’s HPC National Partnerships.

He has majored on data storage facilities for science, having nurtured the CSIRO SC Data Store for over 26 years.

Since September 2015, he has been seconded part-time to the Bureau of Meteorology’s Scientific Computing Services group.

He is driven to provide services for science, particularly in computing and storage services, and in user support, having been a user himself of such services in the past.

The CSIRO ASKAP Science Data Archive

Dr Minh Huynh1, Mr James  Demspey2, Ms Margaret Ophel2, Dr  Matthew Whiting3

1CSIRO Astronomy and Space Science, Kensington, Australia,

2CSIRO IM&T, Canberra, Australia,

3CSIRO Astronomy and Space Science , Marsfield, Australia

 

Astronomy is moving into the era of “big data”, a paradigm where data volumes will be reaching 10s PBs (ASKAP), and in the near future 100s PBs (SKA1). Science from the next generation of radio telescopes requires long term storage of the data and tools for querying and accessing the data. CSIRO IM&T and CASS have addressed this by building the CSIRO ASKAP Science Data Archive (CASDA) to provide long term storage for Australian SKA Pathfinder (ASKAP) data products, and the hardware and software facilities that enable astronomers to access the data. CASDA will store ~5 PB per year from ASKAP and serve that to astronomers around the world using both virtual observatory (VO) and web-based portal services. This paper will present the current status of CASDA and future development plans.


Biography:

Minh Huynh graduated from ANU with a PhD in Astronomy and Astrophysics in 2005. She worked in NASA’s Spitzer Space Telescope and Planck Observatory teams at Caltech in Los Angeles, before moving back to Perth as Deputy International Project Scientist for the Square Kilometre Array at the University of Western Australia. The CSIRO is busy commissioning the Australian SKA Pathfinder (ASKAP), a next generation radio telescope. Now at the CSIRO in Perth, Minh continues astronomical research on radio galaxies while also preparing CSIRO’s ASKAP Science Data Archive for the big data deluge from ASKAP.

Data management at Geoscience Australia and the use of persistent identifiers

Dr David Lescinsky1, Alex Ip1, Nicholas Car2

1Geoscience Australia, Canberra, Australia,

2CSIRO Land &Water, Dutton Park, Australia

 

Persistent identifiers are a critical component of Geoscience Australia’s (GA) data management systems and underpin GA’s linked data efforts. Currently, GA generates persistent identifiers for products (reports, data sets, maps, data services, software, applications, etc.), physical samples, surveys and vocabularies with additional objects planned (features, organisations, etc.). The goal of the persistent identifiers is to enable users to identify the official, definitive metadata for these collections in perpetuity and provide access to the data. Additional benefits include the improved citation of GA data and tracking of usage.

GA’s persistent identifiers are supported by a shared technology stack consisting of minting tools, catalogues/databases, and a Linked Data API. Under the current implementation, calls on the persistent identifiers are handled by the Linked Data API to generate a Landing Page following the designated pattern for the type of object, or to provide specific information related to the requested object.

As persistent identifiers become more common there is a notable increase in their usage, however, the way in which the persistent identifier is used varies. The correct usage needs to be encouraged, as highlighted by cases where products are being “republished” using new persistent identifiers and references to new metadata catalogues. This practice undermines GA’s ability to present a “single point of truth” and to track usage of its products.


Biography:

David Lescinsky has a M.Sc. and Ph.D. in Earth Sciences and has more than 20 years of experience working as a geologist. David trained as a volcanologist working on active volcanoes around the world and simulating eruptive processes in the lab. During his career, David has moved across sectors, working in academia as assistant professor of environmental science at University of Western Ontario, Canada, in industry as a project geologist / hydrologist for Geolex, Inc. in Albuquerque, NM. David now works in government at Geoscience Australia, where he has  transitioned from geological modelling for geothermal exploration, basin analysis and CO2 sequestration assessment projects to a role as a Senioor eResearch Strategist.

David is currently the team lead of GA’s Informatics Team which includes data governance, catalogues, Linked Data, Virtual Laboratories and support for GA’s High Performance Computing and Data .

The CSIRO 5-Star FAIR Data Rating system

Dr Simon Cox1, Dr Jonathan Yu1

1CSIRO, Clayton , Australia

 

Finding, using and trusting high quality datasets in any discipline is a “grand-challenge”. Often datasets are not curated in a way that allows for users or machines to decide they are fit-for-purpose. Conversely, data providers lack tooling and guidance about the quality or lack thereof of datasets they are publishing.

To address this gap, we have developed a 5-star data rating system that considers data quality criteria based on the FAIR data principles. Our rating system includes specific metrics (or examples) of how the FAIR principles could be met, to serve as concrete goals for data providers to aim for. This is particularly useful for FAIR’s Interoperable and Reusable principles (broken down into loadable, useable, and comprehensible). We make suggestions around formats and technologies, drawn from our experience with geospatial data. The FAIR principles are covered in CSIRO 5-star Data Ratings criteria, and we add published, updated/maintained, and trusted which are not covered by the FAIR principles.

We have also developed a companion 5-star Data Rating tool to allow self-assessment of a dataset using the above qualities, and ratings for each quality. The tool allows users to rate their data according to its current state. Questions presented to users also serve as tangible targets showing how one can to improve their data publication. See http://oznome.csiro.au/5star/

We present examples in the context of Australian government and research data showing how we can use this tool to assess data quality and suggest improvements.


Biography:

Simon has been researching standards for publication and transfer of earth and environmental science data since the emergence of the world wide web. Starting in geophysics and mineral exploration, he has engaged with most areas of environmental science, including water resources, marine data, meteorology, soil, ecology and biodiversity. He is principal- or co-author of a number of international standards, including Geography Markup Language, and Observations & Measurements, that have been broadly adopted in Australia and Internationally. The value of these is in enabling data from multiple origins and disciplines to be combined more effectively, which is essential in tackling most contemporary problems in science and society. His current work focuses on aligning science information with the semantic web technologies and linked open data principles, and the formalization, publication and maintenance of controlled vocabularies and similar reference data.

Excel2LDR: Lowering the bar to entry for defining vocabularies as Linked Data

Dr Jonathan Yu1, Dr Simon Cox1

1Csiro , Clayton , Australia

 

Web technologies are changing the way scientific data is shared. For efficient sharing of data, the content of datasets must use descriptors that are also shared by both humans and machines at scale. Linked Data supports this by leveraging web principles to allow links between and within datasets. Individual definitions about fields within datasets may be assembled into vocabularies, and published at standard web locations for use in multiple datasets. For example, a soil profile dataset should use a standard soil classification, in which each soil type is denoted by a web identifier (http URI), which can be de-referenced to get a description of the type, formatted using web standards like OWL or SKOS.

We have technologies and tools to publish vocabularies as Linked Data, such as SISSVoc or the LDR service. These have been used to manage and publish reference data, including codelists, units of measure, substances, organisations and general ‘vocabulary’ elements. However, preparation of vocabulary content for these has required specialist RDF-based tools like TopBraid and PoolParty. Domain scientists are more familiar with standard desktop productivity tools like Excel.

We have developed an Excel2LDR tool that enables users to define vocabulary content in an Excel template, and publish it directly into a Linked Data Registry (LDR) without leaving the Excel application. We present examples of Excel2LDR use and publication of vocabulary content to CSIRO LDR instance. We also compare this with other Excel-based implementations.


Biography:

Dr Jonathan Yu is a data scientist researching information and web architectures, data integration, Linked Data, data analytics and visualisation and applies his work in the environmental and earth sciences domain. He is part of the Environmental Informatics group in CSIRO Land and Water. He currently leads a number of initiatives to develop new approaches, architectures, methods and tools for transforming and connecting information flows across the environmental domain and the broader digital economy within Australia and internationally.

Developing a Linked Data convention for standardising scientific data encoded using netCDF

Dr Jonathan Yu1, Mr Mark Hedley2, Mr James Biard4, Dr Adam Leadbetter3

1Csiro , Clayton , Australia, 2UK Met Office, Exeter, United Kingdom, 3Marine Institute, Galway, Ireland, 4NOAA, , USA

 

netCDF is a format for encoding array-oriented scientific data and adoption in many domains including climatology, oceanography, and hydrology. The format is an open standard and an Open Geospatial Consortium (OGC) international standard.

A common practice is to use the Climate and Forecasting (CF) convention (http://cfconventions.org/) and the Attribute Convention for Data Discovery (ACDD) convention (http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery_1-3). Several communities are defining additional netCDF conventions for describing semantics relating to their different domains which are outside the scope of CF and ACDD. As the concurrent use of multiple, possibly clashing, conventions spreads, we are faced with the challenge of finding a common mechanism to validate and interpret metadata being embedded inside netCDF files.

Linked Data (LD) describes a method for encoding and publishing metadata to expose and connect data within and across datasets and become more useful through semantic queries. LD approaches present an opportunity for the netCDF community to address the above challenge and, beyond that, enhance data discovery and use.

We present some early work in a current OGC activity to define the netCDF-Classic-LD convention for constructing and interpreting metadata and structures found in netCDF files as LD. The aim is to enable netCDF data to be unambiguously linked with published conventions and controlled vocabularies, and to allow the semantics across files and datasets to be described. Specifically, netCDF-Classic-LD can be applied to support standardisation, validation and interpretation of netCDF content in relation to the conventions defined by different science disciplines and communities.


Biography:

Dr Jonathan Yu is a data scientist researching information and web architectures, data integration, Linked Data, data analytics and visualisation and applies his work in the environmental and earth sciences domain. He is part of the Environmental Informatics group in CSIRO Land and Water. He currently leads a number of initiatives to develop new approaches, architectures, methods and tools for transforming and connecting information flows across the environmental domain and the broader digital economy within Australia and internationally.

ABOUT AeRO

AeRO is the industry association focused on eResearch in Australasia. We play a critical coordination role for our members, who are actively transforming research via Information Technology. Organisations join AeRO to advance their own capabilities and services, to collaborate and to network with peers. AeRO believes researchers and the sector significantly benefit from greater communication, coordination and sharing among the increasingly different and evolving service providers.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.
© 2019 Conference Design Pty Ltd