The relational model and category theory as a basis for interoperable heterogeneous data repositories

Dr James Hester1

1ACNS, ANSTO, Kirrawee Dc, Australia


To achieve smooth data interoperability within a given domain, a machine-actionable and coherent description of the semantic content of the heterogenous file formats used in that domain is required. Typically, the semantic content associated with a given file format is expressed together with the format specification and bespoke programming is required to incorporate a new or legacy format into a larger collection.  Such programming must not only deal with identification of semantically-equivalent quantities, but also handle differences in organisation – for example, one format might distinguish “goniometer axes” and “detector axes” while another has “axes” which have type “goniometer” and “detector”. Such work can be minimised by first expressing arbitrary file contents in relational form (something which is always possible) and then identifying the resultant columns with definitions in a machine- and human- readable community ontology. By leveraging data pullback and pushforward functors from category theory, knowledge of these links is sufficient to computationally transform arbitrary data into the arrangement chosen by the ontology, allowing data spread over heterogeneous files to be merged and presented as a uniform whole to e.g. web-based tools. As a further benefit, the relationally structured data allows the ontology to contain simple machine-actionable pseudo-code describing how to manipulate known information to derive missing items. Preliminary work based on crystallographic raw image standards is discussed.


Dr James Hester has been involved for many years in data standardisation through work with the crystallographic community on the Crystallographic Information Framework (CIF) and serves as the chair of the IUCr committee for the maintenance of the CIF standards. He is the creator and maintainer of the PyCIFRW Python package and more recently a Julia package for handling CIF files. He has worked for many years as a powder diffraction instrument scientist at ANSTO.



