Towards a research data management and analytics strategy

Professor Matt Bellgard1

1Queensland University of Technology, Brisbane QLD

 

Universities generate considerable amounts of research data. Associated with this data are the five ‘Vs’ of data: Volume, Variety, Velocity, Veracity and Value that need to be continuously taken into consideration. Invariably, the combination of the V’s for any given research endeavour determine how best to deliver to the research output and outcome as well as to manage it at an institutional level addressing archiving, compliance, potential reuse for other research or for teaching purposes and so forth. As such, institutions are challenged with defining, shaping and refining their research data management strategy for the plethora of research data challenges to ensure there is consistency and adequate research data management practices in place. FAIR data principles are very important for the sharing of data for new collaborative open data opportunities, but typically, research data management practices need to be established in a robust way. In addition, new technology options have become available for institutions to make considered choices on whether to use ‘on prem’, private Cloud or public Cloud infrastructure; If a hybrid approach is adopted, then it needs to be determined what the impact might be on the established institutional research data management strategy. In this talk, I will provide an overview of the approaches being adopted at both an institutional and at a collaborative research project level. These approaches feather the incorporation of an analytic-centric focus rather than a data capture mindset.


Biography:

Professor Bellgard is the inaugural eResearch Director at Queensland University of Technology and Chairs QUT’s Research Data Management Strategy Implementation Group. He has attracted over AUD$44m in research funding, is co-inventor of 5 patents and has co-authored over 140 articles. He has led the design and development of a diverse range of research data management approaches as well as digital health solutions, particularly for rare diseases, for government, industry and academia across multiple jurisdictions. He is also Chair of the APEC Life Science Innovation Forum Rare Disease Network.

Scaling Physical Sample Identifiers across all Research Domains within the Research Ecosystem

Jens Klump1, Kerstin Lehnert2, Sarah Ramdeen3, Lesley Wyborn4

1CSIRO, Kensington, WA, Australia

2Columbia University, Palisades, New York, USA

3Ronin Institute, Huntsville, Alabama, USA

4Australian National University, Canberra, ACT, Australia

 

Samples taken from nature or produced in laboratory experiments have always been at the heart of scientific research. Over the past two centuries, we have collected hundreds of millions of samples, and we are still collecting more. However, while infrastructures for scientific literature and data have evolved into a networked and searchable research information ecosystem, online access to sample information has lagged way behind and often we cannot even unambiguously identify which samples were the basis of which dataset and publication.

In the geosciences, the International Geo Sample Number Implementation Organization (IGSN e.V.) has built a persistent identifier and catalogue infrastructure that gives access to millions of sample records. The underlying infrastructure can be used in other science disciplines such as biology, archaeology, materials science, etc. To be able to scale this infrastructure to billions of samples, interconnected with a comparable number of datasets and their related publications, requires a redesign of both the organisational model and technical architecture of current persistent identifier infrastructures. Growing the scale of persistent identifier systems also needs coordination across the key identifier systems such as ORCID, DataCite, Crossref, etc. This contribution will present the present state of the discussion on how to link physical samples to the research ecosystem and the record of science, and to ensure attribution to those who originally collected the sample.


Biography:

Jens Klump is a geochemist by training and Geoscience Analytics Team Leader in the Mineral Resources unit of CSIRO. Jens holds a PhD in Marine Geology from the University of Bremen in Germany. His involvement in the development of publication and citation of research data through Digital Object Identifiers (DOI) sparked further work on research data infrastructures. Jens’ current work focuses on data in minerals exploration, both from a data analysis and from a data logistics perspective. Jens is the Vice-President of the IGSN e.V. and the Vice-President of the Earth and Space Sciences Division of the European Geosciences Union.

Enabling software to be FAIR and accessible as a reproducible Service: The AuScope Scientific Software Solution Centre

Mr Geoffrey Squire1, Dr Carsten Friedrich1, Dr Jens Klump1, Mr Ryan Fraser1, Mr Nigel Rees2, Dr Lesley Wyborn2, Dr Mingfang Wu3

1CSIRO, Canberra, Australia, 2NCI, Canberra, Australia, 3ARDC, Canberra, Australia

 

Software plays an vital role in today’s research. Increasingly publishers are now requiring software supporting publications to be Findable, Accessible, Interoperable, Discoverable and Accessible (FAIR): it is no longer acceptable to have software only accessible through software repositories such as GIT and subversion.

The AuScope Scientific Software Solution Centre (SSSC) provides a sophisticated environment where scientific software can be: 1) published, 2) discovered, 3) shared with collaborators, and 4) described for automated execution. The current process of registering software at the SSSC captures description, license, versioning, and citation relevant information from the standardised metadata record and adds a machine-readable description of the software environment required to run it. With these elements, software from the SSSC can be chained together into workflows in virtual research environments. The workflow can be re-used by a scientific problem that requires similar solution. The SSSC allows researchers to publish descriptions of the scientific problems that the software helps to solve and publish solutions built with the software. The registry also has a configurable publication process with the option of peer review, and the ability for users to cryptographically sign submissions an establish a level of trust in the published software and solutions.

To increase discoverability, attribution and citation of software in the SSSC through standard services and registration in other catalogues and registries, the Geoscience DeVL has worked with the Australian Research Data Commons (ARDC) to adopt software citation principles as recommended by the FORCE11 Software Citation Recommendation and the Software Citation Implementation Group.


Biography:

To be confirmed

Putting the “R” into FAIR. Licensing research data for reuse and recognition

Dr Adrian Burton2, Mr Baden Appleyard3, Dr Gregory Laughlin2, Ms Gerry Ryder1

1Australian Research Data Commons, Glen Osmond, Australia, 2Australian Research Data Commons, Canberra, Australia, 3Consultant, Brisbane, Australia

 

The concept of FAIR (Findable, Accessible, Interoperable & Reusable) has become ubiquitous across the research sector, yet not all aspects are well understood. Issues around copyright and licensing of data can seem confusing and few people realise that only research data that is licensed may be legally reused.

Copyright and licensing for research data can be complicated.  Apart from legal ownership, other factors such as policy and business requirements, relationships and norms can impact on data licensing decisions. For example, grant funding agreements may require a certain licence to be applied to research data outputs, or, in some cases, expectations or norms in a particular field of study will impact on licensing decisions. Equally, or perhaps more important, is the need to maximise potential for reuse of data to support innovation and new discoveries.

Are you missing out on opportunities for collaboration and attribution by releasing data without a licence? Are you assigning the most appropriate licence to your data outputs?  Do you know when and how you can reuse data created by others? Does your data facility have policy and procedures to support data licensing?

This presentation will be of interest to those wanting a ‘no jargon’ introduction to copyright and licensing for research data.  Learn about the ARDC Research Data Rights Management Guide that describes decision support tools and licensing frameworks that can be applied by data owners, data re-users and those providing access to data through repositories and e-research facilities.


Biography:

Dr Gregory Laughlin is Principal Policy Advisor with the Australian Research Data Commons

CSIRO’s Research Data Planner Pre-Release Taste Test

Mr Dominic Hogan1, Ms Katie Hannan2, Ms Sue Cook3

1CSIRO, Brisbane, Australia, 2CSIRO, Adelaide, Australia, 3CSIRO, Perth, Australia

 

Formal research data management plans are a requirement coming to a funding, grant, or project proposal near you!  Journals are increasingly requiring the release of data and software with publications.  Forewarned is forearmed, don’t get caught in a tangle of issues with licensing, intellectual property, ethics and legalities (just to name a few).

You’re on board, right?  You want this to all run smoothly and avoid any unpleasant surprises, but where do you start?  CSIRO has been working on a software tool to help with just that.  The Research Data Planner will guide you through what you need to address, what you don’t have to worry about, and help you to produce a mouth-watering planning document you can provide with your funding application.

We’re releasing this in two months, and we need your help.  Come and try the tool, tell us if we’ve got the flavour right, tell us what you love, tell us what you hate, don your white hat and tell us what ingredients we are missing!  While mostly of interest to CSIRO researchers who wind up using the software, we’d love to hear from anyone interested.  Come break our hearts (we’ll bring the ice-cream) and then tell us everything will be okay.


Biography:

Dominic Hogan is a Business Analyst at the Commonwealth Scientific and Industrial Research Organisation.  He worked as a data librarian supporting research across CSIRO, being heavily involved in development work for CSIRO’s Data Access Portal (DAP).  He has supported work in various research domains, including terrestrial ecology, marine research, computer visualisation and materials science.  Recently he has worked on implementing a recommender system for research datasets in CSIRO’s DAP, and the DAP’s new user interface.  Currently he works on CSIRO’s Research Data Planner.

Katie Hannan is a Data Librarian at the Commonwealth Scientific and Industrial Research Organisation. She is passionate about storytelling, cultural history projects, linking people with information and helping to facilitate learning experiences. Katie has a background working in higher education, eResearch and projects. Her own research interests are in the areas of human computer interaction, digital legacy and information society. She is currently working on CSIRO’s Research Data Planner.

A data quality framework for high performance datasets

Dr Kelsey Druken1, Dr Ben Evans1, Sean  Pringle1, Kashif Gohar1, Dr Nigel Rees1, Clare Richards1, Dr Jingbo Wang1

1NCI Australia, Canberra, Australia

 

In the next few years, there will be a major increase in computational power- through upgraded HPC systems and further uptake of cloud-based platforms. At the same time, an enormous amount of new digital data will come on-line from many science domains. However the two do not simply come together, and in many cases, the data needs to be better organised to make it more tractable to process at-scale, and to make it programmatically accessible for a broader range of use-cases.

Over the last several years, NCI has been focused on improving computational access to some major national reference geospatial datasets. The data at NCI has also been significantly used by the wider community via remote access data services, including server-side data processing- utilising the colocation of data and computational processing power. With the data too big or too complex to move, we are now in the post-download era.

The challenge is to enable the quality of this data for a range of techniques to be usable and interoperable across multiple domains: this necessitates an increased focus on “FAIR data” principles- Findable, Accessible, Interoperable and Reusable. FAIR is underpinned by the concerted efforts that are happening internationally to develop community agreed standards that enable seamless programmatic access to data in high performance environments across multiple domains.

While this places additional requirements on the suppliers of both the data and metadata, the result is that data can be even more accessible- for primary use, secondary use, and citability through publication processes.


Biography:

Kelsey Druken manages the petascale data repository at NCI and her interests lie in data management, services and informatics. Prior to joining NCI in 2015, Kelsey was a researcher at the Research School of Earth Sciences at the Australian National University in Canberra and a postdoctoral fellow at the Carnegie Institution for Science in Washington, DC. She holds a PhD in Oceanography from the University of Rhode Island.

Australia’s Marine National Facility: A floating sensor platform for big data

Ms Katherine Tattersall1, Dr Chris Jackett1, Mr Ian  Hawkes1

1CSIRO Oceans & Atmosphere, Battery Point, Australia

 

Australia’s Marine National Facility (MNF) is hosted by CSIRO Oceans & Atmosphere and manages the operation of the state-of-the-art RV Investigator, an Australian government blue-water research vessel dedicated to supporting Australia’s atmospheric, oceanographic, biological and geosciences research. The RV Investigator is equipped with a multitude of sensors that map the ocean floor, measure and sample the water column and collect atmospheric measurements. The vessel is also a platform for a vast array of other marine research instruments and equipment. Ship time aboard the MNF is available to researchers Australia-wide through an annual application process.

Managing this large working platform is a complex task. One major challenge is to efficiently, elegantly and robustly handle the many data streams captured by sensors and equipment and to quickly make high quality data available to researchers. We follow streamlined data management processes including on-board data aggregation and metadata capture, secure end-of-voyage data storage and archiving, publicly available data catalogues and portals and a team dedicated to data acquisition and processing. Data management and distribution is the responsibility of the O&A Information and Data Centre (IDC). The MNF follows open data objectives as outlined in the FAIR principles (Findable, Accessible, Interoperable and Reusable) and is an advocate of data and metadata standards and associated software tools and processes.

This poster illustrates important elements of the current MNF/CSIRO data workflow from acquisition to publication. Key processes and data access points are highlighted to convey the broad capabilities of the MNF. More information is online at https://research.csiro.au/oa-idc/marine-national-facility-datasets/.


Biography:

Katherine Tattersall is a research data specialist and data architect with over a decade of experience in the marine data domain and a firm grounding in the elements of meticulous and innovative research data management more broadly. Her background in marine physical, ecosystem, fisheries and geospatial research equipped her with an understanding of what researchers need from data infrastructure and tools.

Chris Jackett is a software engineer with experience developing scientific data processing systems. He has a background in marine science and remote sensing, and has worked on data storage solutions for drone-based multispectral imagery, computational systems for the optimisation of aquacultural planning, and web application development using modern JavaScript frameworks, tools and techniques.

Ian Hawkes manages the Marine National Facility research vessel RV Investigator’s Information and Communications Technology (ICT) systems; providing seagoing computing support; operation of various vessel scientific data acquisition systems; quality control and processing of key datasets and the delivery of associated data products. Ian has an Honours degree in Physics and has three decades of experience as a Systems and Software Engineer. He has worked on all phases of projects including proposals, identifying and analysing requirements, object oriented design, coding in C++, testing and documentation.

 

Linking Data, Samples, Software, Texts – publishing Services at GFZ German Research Centre for Geosciences

Dr Kirsten Elger1, Damian Ulbricht1, Roland  Bertelmann1

1GFZ German Research Centre For Geosciences, Potsdam, Germany

 

Research products ranging from “classical” textual manuscripts to data, samples and software underlying scholarly publications. Data and software publications with assigned digital object identifier (DOI), comprehensive metadata and documentation readable for humans and machines are best practice for FAIR data.

GFZ Data Services is a research data repository for the Geosciences domain, hosted at the GFZ German Research Centre for Geosciences in Potsdam. Datasets published via GFZ Data Services range from large dynamic datasets from data intensive global monitoring networks and observatories with real-time acquisition, to satellite data, to the full suite of highly heterogeneous datasets collected by individual researchers or small teams (“long-tail data”). Data and software are published with DOI and complemented by descriptive documents or Data Reports. Furthermore, GFZ is allocating agent for the International Geo Sample Number (IGSN), the globally unique persistent identifier for physical samples with discovery functionality of digital sample descriptions via the internet.

A major focus for the data curation is to provide a comprehensive data description via standardised and machine-readable metadata, including the use of controlled vocabularies and to careful cross-reference the different research products and people/ institutions involved via their metadata using Persistent Identifier (DOI, IGSN, ORCID, Fundref). The embedding of Schema.org in  DOI Landing Pages  allows discovery through search engines like the Google Dataset Search and Scholix allows to link data publications and scholarly literature, even when the data were published years after the article.


Biography:

Kirsten Elger received her Ph.D. degree from the Free University Berlin, in 2003, with a field based, interdisciplinary work on the tectonics of the Altiplano Plateau in the Southern Central Andes. Following several years in private companies and maternal leaves, she returned 2010 to scientific work at the Alfred Wegener Institute for Polar and Marine Research. She was responsible for the ground-data management for the evaluation of remote sensing products and data steward for two EU-funded projects (PAGE21, INTERACT). Since 2014she is head of the research data repository GFZ Data Services of the GFZ German Research Centre for Geosciences. She has developed the data repository into an internationally recognized place for the publication of citable research data and the unique identification of physical samples with International Geo Sample Numbers IGSN. Furthermore, she is a hub for research data management at GFZ and represents the GFZ in national and international context, e.g. within the Research Data Alliance (RDA).

The relational model and category theory as a basis for interoperable heterogeneous data repositories

Dr James Hester1

1ACNS, ANSTO, Kirrawee Dc, Australia

 

To achieve smooth data interoperability within a given domain, a machine-actionable and coherent description of the semantic content of the heterogenous file formats used in that domain is required. Typically, the semantic content associated with a given file format is expressed together with the format specification and bespoke programming is required to incorporate a new or legacy format into a larger collection.  Such programming must not only deal with identification of semantically-equivalent quantities, but also handle differences in organisation – for example, one format might distinguish “goniometer axes” and “detector axes” while another has “axes” which have type “goniometer” and “detector”. Such work can be minimised by first expressing arbitrary file contents in relational form (something which is always possible) and then identifying the resultant columns with definitions in a machine- and human- readable community ontology. By leveraging data pullback and pushforward functors from category theory, knowledge of these links is sufficient to computationally transform arbitrary data into the arrangement chosen by the ontology, allowing data spread over heterogeneous files to be merged and presented as a uniform whole to e.g. web-based tools. As a further benefit, the relationally structured data allows the ontology to contain simple machine-actionable pseudo-code describing how to manipulate known information to derive missing items. Preliminary work based on crystallographic raw image standards is discussed.


Biography:

Dr James Hester has been involved for many years in data standardisation through work with the crystallographic community on the Crystallographic Information Framework (CIF) and serves as the chair of the IUCr committee for the maintenance of the CIF standards. He is the creator and maintainer of the PyCIFRW Python package and more recently a Julia package for handling CIF files. He has worked for many years as a powder diffraction instrument scientist at ANSTO.

 

Data Versioning: understanding patterns and developing principles for datasets used in ‘Understanding the Earth’

Dr Jens Klump1, Dr. Mingfang Wu2, Mrs Gerry Ryder3, Mrs Julia Martin4, Dr. Ben Evans5, Dr. Lesley Wyborn5

1CSIRO Mineral Resources, Perth, Australia, 2Australian Research Data Commons, Melbourne, Australia, 3Australian Research Data Commons, Adelaide, Australia, 4Australian Research Data Commons, Canberra, Australia, 5NCI, Canberra, Australia

 

To enable reproducibility of research results, a researcher must cite the exact dataset that underpinned their research, yet community agreed, systematic data versioning practices are either not commonly implemented or impractical in some cases.

There is an urgency to develop common data versioning practices that should apply to a spectrum of datasets – from small datasets to large dynamic datasets such as petabyte-sized data from Earth systems, and dynamic Earth observation and geophysics datasets that evolve over time. Further, secondary datasets and data products that either process or aggregate these datasets may be periodically updated. For these, it is just not feasible to persistently store a copy of the exact data set that was used in the publication. Small datasets (e.g. in CSV) that undergo numerous changes also need to document and version data releases.

Versioning procedures and best practices are well established for scientific software. The code base of large software projects does bear some resemblance to large dynamic datasets. But are versioning practices for code also suitable for datasets or do we need a separate suite of practices for data versioning? How can we apply our knowledge of versioning code to improve data versioning practices?

Over the past two years, the Research Data Alliance Data Versioning Working Group has collected numerous use cases of data versioning practices and extracted data versioning patterns. In this presentation, we will present and discuss Working Group recommendations for data versioning practices.


Biography:

Jens Klump is a geochemist by training and Geoscience Analytics Team Leader in the Mineral Resources unit of CSIRO. His involvement in the development of publication and citation of research data through Digital Object Identifiers (DOI) sparked further work on research data infrastructures, such as enterprise data management systems and long-term digital archives.

Jens’ current work focuses on data in minerals exploration, looking at data capture and data analysis. This includes automated data and metadata capture, sensor data integration, both in the field and in the laboratory, data processing workflows, and data provenance, but also data analysis by statistical methods, machine learning and numerical modelling. Jens is the vice-president of the International Geo Sample Number Implementation Organization (IGSN). The organisation coordinates the development and introduction of persistent identifiers for physical specimens of research materials.

Jens earned degrees in geology and in oceanography from the University of Cape Town (UCT) and received his PhD in marine geology from the University of Bremen, Germany.

ABOUT AeRO

AeRO is the industry association focused on eResearch in Australasia. We play a critical coordination role for our members, who are actively transforming research via Information Technology. Organisations join AeRO to advance their own capabilities and services, to collaborate and to network with peers. AeRO believes researchers and the sector significantly benefit from greater communication, coordination and sharing among the increasingly different and evolving service providers.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.
© 2017 Conference Design Pty Ltd