Data Versioning: understanding patterns and developing principles for datasets used in ‘Understanding the Earth’

Dr Jens Klump1, Dr. Mingfang Wu2, Mrs Gerry Ryder3, Mrs Julia Martin4, Dr. Ben Evans5, Dr. Lesley Wyborn5

1CSIRO Mineral Resources, Perth, Australia, 2Australian Research Data Commons, Melbourne, Australia, 3Australian Research Data Commons, Adelaide, Australia, 4Australian Research Data Commons, Canberra, Australia, 5NCI, Canberra, Australia

 

To enable reproducibility of research results, a researcher must cite the exact dataset that underpinned their research, yet community agreed, systematic data versioning practices are either not commonly implemented or impractical in some cases.

There is an urgency to develop common data versioning practices that should apply to a spectrum of datasets – from small datasets to large dynamic datasets such as petabyte-sized data from Earth systems, and dynamic Earth observation and geophysics datasets that evolve over time. Further, secondary datasets and data products that either process or aggregate these datasets may be periodically updated. For these, it is just not feasible to persistently store a copy of the exact data set that was used in the publication. Small datasets (e.g. in CSV) that undergo numerous changes also need to document and version data releases.

Versioning procedures and best practices are well established for scientific software. The code base of large software projects does bear some resemblance to large dynamic datasets. But are versioning practices for code also suitable for datasets or do we need a separate suite of practices for data versioning? How can we apply our knowledge of versioning code to improve data versioning practices?

Over the past two years, the Research Data Alliance Data Versioning Working Group has collected numerous use cases of data versioning practices and extracted data versioning patterns. In this presentation, we will present and discuss Working Group recommendations for data versioning practices.


Biography:

Jens Klump is a geochemist by training and Geoscience Analytics Team Leader in the Mineral Resources unit of CSIRO. His involvement in the development of publication and citation of research data through Digital Object Identifiers (DOI) sparked further work on research data infrastructures, such as enterprise data management systems and long-term digital archives.

Jens’ current work focuses on data in minerals exploration, looking at data capture and data analysis. This includes automated data and metadata capture, sensor data integration, both in the field and in the laboratory, data processing workflows, and data provenance, but also data analysis by statistical methods, machine learning and numerical modelling. Jens is the vice-president of the International Geo Sample Number Implementation Organization (IGSN). The organisation coordinates the development and introduction of persistent identifiers for physical specimens of research materials.

Jens earned degrees in geology and in oceanography from the University of Cape Town (UCT) and received his PhD in marine geology from the University of Bremen, Germany.

ABOUT AeRO

AeRO is the industry association focused on eResearch in Australasia. We play a critical coordination role for our members, who are actively transforming research via Information Technology. Organisations join AeRO to advance their own capabilities and services, to collaborate and to network with peers. AeRO believes researchers and the sector significantly benefit from greater communication, coordination and sharing among the increasingly different and evolving service providers.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.
© 2019 Conference Design Pty Ltd