Integration of plot-based ecology data using standard semantic vocabularies

Dr Simon Cox1, Dr SIddeswara Guru2, Edmond Chuc2, Tina Schroeder2, Mosheh Eliyahu2, Yi Sun2, Jenny  Mahuika2

1CSIRO, Melbourne, Australia, 2TERN, University of Queensland, Brisbane, Australia

Abstract:

Plot-based ecology data is collected by different agencies in multiple jurisdictions. Data are collected using varying survey methods and procedures even though the natural system and observed properties are similar, and the underlying methods are all derived from some common survey protocols. Furthermore, data representations and formats vary. As a consequence, use of the data in analysis is usually confined to a jurisdiction from where the data was collected.

Combination of datasets would enable their use at different scales for analysis and synthesis. In this paper, we describe an approach to represent plot-based ecology data using standard semantic models. This allows integration of observation data into a common data structure. The structure uses the W3C/OGC Semantic Sensor Network vocabulary, supplemented by a small number of domain-specific classes. This structure is compatible with a variety of other observation data systems used internationally, potentially allowing for integration with relevant non-ecology data.

The plot model is completed by controlled-vocabularies to provide the value-spaces for key slots in the mode. Currently, the controlled-vocabularies are mostly governed by the individual providers, though they are published using a common linked-data platform provided by ARDC.  Full semantic integration requires mappings and harmonization between the controlled vocabularies which raises some background science questions.

We will discuss some of the initial implementation progress of a system to load data from the different providers into a common datastore underpinned by a graph database.


Biography:

Simon Cox has been researching standards for publication and transfer of earth and environmental science data since the emergence of the world wide web. He is principal- or co-author of a number of international standards that have been broadly adopted in Australia and Internationally. The value of these is in enabling data from multiple origins and disciplines to be combined more effectively, which is essential in tackling most contemporary problems in science and society. His current work focuses on aligning science information with the semantic web technologies and linked open data principles, and the formalization, publication and maintenance of controlled vocabularies and similar reference data.

Simon was awarded the 2006 Gardels Medal by the Open Geospatial Consortium, and presented the 2013 Leptoukh Lecture for the American Geophysical Union. Simon is currently on the Executive Committee of CODATA, and is a member of the National Committee for Data in Science.

Location-index: A spatial knowledge graph dataset

Dr Jonathan Yu1, Mr Paul Box2, Dr David Lemon3, Mr Ashley Sommer4, Mr Shane Seaton3, Mr Benjamin Leighton1, Dr Simon Cox1

1CSIRO Land and Water, Clayton, Australia, 2CSIRO Land and Water, Eveleigh, Australia, 3CSIRO Land and Water, Black Mountain, Australia, 4CSIRO Land and Water, Brisbane, Australia

Abstract:

The Location Index (Loc-I) project has developed a spatial knowledge graph along with tools that provide a consistent way to reference location information and support the integration of observational datasets with location embedded within it (e.g. census data, business registers, and environmental observations). We have developed Loc-I Datasets that feature web-resolvable identifiers using Linked Data. Specifically, we have developed Linked Data versions of a subset of the ASGS 2016 published by the ABS, Geofabric v2.1.1 published by the Bureau of Meteorology and G-NAF published by PSMA. Web-resolvable identifiers provides a mechanism for referencing location, its metadata, and linking between datasets/geographies consistently. We have also developed Loc-I Linksets, which is a dataset that contain links between datasets, e.g. links between ASGS 2016 and Geofabric v2.1.1. The semantics of each location and linkset are captured via ontologies and additional information is attached as metadata, e.g. calculated area intersection between features. Loc-I datasets and linksets are assembled into a graph database and APIs are provided for querying. This system is then able to provide researchers with tools to carry out analysis across geographies, for example supporting reapportionment of census data captured in the ABS Mesh Block to Geofabric River Regions or vice-versa in real-time without needing the use of any GIS tools for processing. In this presentation, we will provide an overview of the Loc-I system and the source data and demonstrate the use of the Loc-I spatial knowledge graph to integrate across different datasets.


Biography:

Dr Jonathan Yu is a data scientist with the Environmental Informatics group in CSIRO. He has expertise in information and web architectures, data integration (particularly Linked Data), data analytics and visualisation. Jonathan is the technical lead for the Location Index project, which aims to enable and improve location-based integration of data in Australia.

Use of Open-Source software for ingesting, processing, displaying and delivering data from the CosmOz soil moisture sensor network

Mr Ashley Sommer1, Mr Matthew Stenson1, Mr David McJannet1

1CSIRO Land And Water, Dutton Park, Australia

Abstract:

CSIRO’s CosmOz soil moisture sensor network’s software solution for ingesting, processing, displaying, and delivering soil moisture data is showing its age, has become slow, and it’s time for a replacement.

In 2010, CSIRO Land and Water installed cosmic-ray sensors in locations around Australia to form the CosmOz network. These novel sensors use cosmic rays from space to measure average soil moisture over an area of 30 hectares to depths of 10 to 50 cm. This technique has a major advantage over conventional on-ground soil moisture sensing technology that can only measure moisture content within small volumes of soil.

The sensors have been generating hourly timeseries data for nine years, with raw data transmitted via satellite and delivered over email to the now legacy Cosmoz Processing scripts, stored in a proprietary Relational Database, displayed to consumers via a WordPress site, and made available to download in text files. While that software stack has worked well enough for nine years, it has become very slow as accumulated data volumes increase.

In late 2019 our team designed and built a new CosmOz Data Processing Pipeline, CosmOz REST API, and CosmOz Web UI, using modern Open-Source software solutions. Components were implemented using MongoDB, InfluxDB, Python3, EmberJS, amCharts4, OpenAPI 2.0, and Docker.

The Open-Source software packages selected have delivered high quality data ingest and processing solutions and provide easy to use data visualisation and powerful API access options. This presentation will describe the data processing steps and provide examples of the GUI available to users.


Biography:

Ashley is an Informatics Software Engineer working at CSIRO Land and Water in Brisbane.

Ashley has a BA in Information Technology from Griffith University, and has been working in the Environmental Informatics Group in L&W since 2015. Ashley is an open-source software enthusiast and active open-source project contributor. These days primarily a Python developer, Ashley is project maintainer of several popular open-source python libraries, and core-contributor of many others.

Value modelling for a research data set pipeline

Dr Chris Jackett1, Ms Pamela Brodie1, Dr Simon Pigot1

1CSIRO, Castray Esplanade, Battery Point, Australia

Abstract:

The acquisition and management of research data is fundamental to the scientific process. However, the underlying value of data sets are rarely understood or analysed. There is a growing view in the research data management community that data sets should be considered more like a core asset, rather than a means-to-an-end in the production of scientific publications. The inherent value of a research data set comprises many factors including the operational costs of data acquisition, survey design and sampling methods, publication output and impact factor, as well as the quality of data management practices used to achieve Findable, Accessible, Interoperable and Reusable (FAIR) data.

 

This work proposes a mixed data set valuation methodology designed to produce an estimate of the value of research data sets. The initial valuation model is broken down into three components: fiscal, academic and data. A custom algorithm is proposed to combine the various valuation metrics into a single estimate of the value range of a data set. This model could also be used to make future projections, providing an indication of how the value of a data set could be influenced by different data collection and management scenarios.

 

The valuations and projections produced by this model would give a clear understanding about the value of research data sets. This approach would provide a quantitative framework that could be used to inform scientific decision-making processes.


Biography:

Chris Jackett is a software engineer in the CSIRO Information and Data Centre who specialises in designing and developing software systems for research data management. Chris has previous experience developing remote sensing data acquisition systems, data storage solutions for drone-based multispectral imagery, computational systems for the optimisation of aquacultural planning, and web application development using modern frameworks, tools and techniques. His PhD research investigated a range of mathematical, statistical and computational mechanisms for improving the quality of recorded satellite data, including deconvolution and spatial resolution enhancement. Chris completed a Graduate Diploma in marine science through the joint CSIRO-UTAS Quantitative Marine Science (QMS) program which focused on a range of quantitative approaches to a wide variety of marine applications.

 

Implementation of active learning directed simulations to avoid biases and maximise efficiency

Dr Amanda J. Parker1, Dr Amanda S. Barnard2

1CSIRO, Docklands, Australia, 2Australian National University, Canberra, Australia

Abstract:

A typical challenge in science is deciding which experiments to run when resources are limited. We are using an active learning (AL) approach to make these decisions. We thereby aim to avoid potential biases and allow for a more efficient sampling of a broad parameter space. This AL implementation has broad cross-domain applicability for directing either simulations or experiments.

We use the AL method to investigate surface binding of small molecules on a complex surface (e.g. a nanoparticle) with density functional tight-binding (DFTB), performing the DFTB calculations within an AL loop. A machine learning model is updated after each site energy is calculated and uncertainty in the model is used to choose the next site that is simulated. The efficiency of this approach is compared to a random site selection method and the effects of updating hyperparameters are discussed.


Biography:

Dr Amanda J. Parker is a Commonwealth Scientific and Industrial Research Organisation (CSIRO) Postdoctoral Fellow at Data61 in the Applied Machine Learning group. She completed her M.Sc. (Physics) at VUW in 2011 and immediately following held a Distinguished Visiting Fellowship at IBM Almaden awarded by the MacDiarmid Institute for Advanced Materials and Nanotechnology. She received her Ph.D. in Physics from the University of British Columbia and now combines her experience in multi-scale materials modelling, statistical physics and machine learning methods to developing artificial intelligence for materials science applications.

Using Web Architectures for Gigascale Metadata Syndication

Dr Jens Klump1, Mr Doug Fils2, Mr Jess Robertson4, Dr Anusuriya Devaraju3, Dr Adam Leadbetter5

1CSIRO, Kensington, Australia, 2Ocean Leadership, Washington, USA, 3University of Bremen, Bremen, Germany, 4Ministry of Business, Innovation and Employment, Wellington, New Zealand, 5Marine Institute, Oranmore, Ireland

Abstract:

Automation in data curation is going to make large volumes of data available. This increase in volume will also bring more variety in metadata. How can we best address the challenge of scaling metadata up to giga-scale while at the same time accommodating more variety?

The technologies to syndicate metadata and repository catalogues were developed alongside with the emergence of the internet and present-day mechanism used for the dissemination of metadata in research data infrastructures are based on harvesting catalogues formatted in dialects of Extensible Markup Language (XML).

Indexing the Internet at large led to the development of more lightweight encodings based on JavaScript Object Notation for Linked Data (JSON-LD). Commercial search engine operators introduced schema.org as a lightweight structured data format for metadata syndication, which has now become the basis of services like Google Data Search.

JSON-LD representation of metadata that incorporates formal vocabularies allows machines to understand semantic descriptions of the metadata and thus gives access to the semantic web and ways to encode the context around data. This makes building a multi-domain network far easier and a web architecture exercise. In addition, the use of web architecture approaches means third parties like Google, Bing, DataOne and others are free to access, use and provide offerings based on the open, well-known architecture.

This presentation will report on work done in a network of activities to make metadata available on giga-scale and experiments to test the supporting system architecture and gauge its operational costs in a cloud-native implementation.


Biography:

Jens Klump is a geochemist by training and leads the Geoscience Analytics Team in CSIRO Mineral Resources based in Perth, Western Australia. In his work on data infrastructures, Jens covers the entire chain of digital value creation from data acquisition to data analysis with a focus on data in minerals exploration. This includes automated data and metadata capture, sensor data integration, both in the field and in the laboratory, data processing workflows, and data provenance, but also data analysis by statistical methods, machine learning and numerical modelling.

 

Challenges in combining heterogeneous materials repositories and datasets into a homogeneous database

Dr Melisande Julia Fischer1, Dr Amanda Barnard1

1Csiro Data61, Docklands, Australia

Abstract:

In material science there are a lot of different perspectives and information on specific materials. There are computational or experimental measured properties as well as experimental or simulated spectra and graphs, and extensive metadata on the software or conditions for the experiments or the computational methods, code and parameters.  All this information is stored in separated repositories or datasets across the world, making it challenging to access, combine and use the data. Each uses a different storage system, programming language and unique identifiers for each data point, which may exist in multiple places under different schemes. For a specific group of materials, namely perovskites, we have identified and merged some of these freely availed repositories and stored the data into one database, converting this heterogeneous information into a homogeneous resource. The resulting database, based in JSON, is now flexible and suitable for comprehensive analysis and machine learning.


Biography:

Julia Melisande Fischer is part of the Applied Machine Learning group for the Commonwealth Scientific and Industrial Research Organisation (CSIRO) at Data61.

She completed her Bachelor and Master of Science in Chemistry at Ulm University, Germany. Afterward, she received an International Postgraduate Scholarship from the University of Queensland (UQ). From the Australian Institute for Bioengineering and Nanotechnology (AIBN) at UQ, she graduated with a Doctor of Philosophy in Chemistry in 2018. Her research is combining sustainable energy applications, material science, and data science.

 

Managing Data from Ingestion to an Institutional Repository with iRODS

Mr. Dave Fellinger1

1iRODS Consortium, Chapel Hill, United States

Abstract:

At the turn of the last century High Performance Computing (HPC) was largely focused on both visualization and simulation. In the last few years this computing paradigm has changed a great deal because of the wide proliferation of sensor data that has become available. All of the required analytic processes require the tracking and management of data every step of the way while retaining traceability and provenance. The Integrated Rule-Oriented Data System (iRODS) is open source middleware that has been designed to greatly simplify and automate these tasks while guaranteeing adherence to data management policies. A primary feature of iRODS is file system virtualization. Data can be placed and discovered utilizing any file system allowing system administrators the capability of writing rules that directs the placement of data to the most cost-efficient storage. Data can be migrated in real time as storage technologies evolve allowing “future proofing” of entire data infrastructures. Another feature of iRODS is the ability to generate a defining metadata index throughout the lifecycle of the data which can be user defined or extracted from the data or sensor. This index can be used to route data to various analysis means. Finally, iRODS can control the entire data analysis process including the distribution of relevant data products to federated sites as well while guaranteeing reproducibility through audited policy enforcement. Manually managing huge quantities of data can be a challenging task that can be completely automated through the use of iRODS technology.


Biography:

Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS Consortium. He has over three decades of engineering experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc. he focused on building an intellectual property portfolio and presenting the technology of the company at conferences with a storage focus worldwide.

In his role at the iRODS Consortium, Dave is working with users in research sites and high performance computer centers to confirm that a broad range of use cases can be fully addressed by the iRODS feature set. He helped to launch the iRODS Consortium and was a member of the founding board.

He attended Carnegie-Mellon University and holds patents in diverse areas of technology.

ABOUT AeRO

AeRO is the industry association focused on eResearch in Australasia. We play a critical coordination role for our members, who are actively transforming research via Information Technology. Organisations join AeRO to advance their own capabilities and services, to collaborate and to network with peers. AeRO believes researchers and the sector significantly benefit from greater communication, coordination and sharing among the increasingly different and evolving service providers.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.
© 2019 Conference Design Pty Ltd