IUCr Committee for the Maintenance of the CIF standard

Dictionary-driven aggregation of Data Blocks into Datasets

Version: 1.0

Authors: J Hester, J Bollinger, A Vaitkus

Date: March 2022

Status: Accepted

Introduction

Data described by CIF dictionaries can be spread between multiple data containers. In more complex scenarios, a number of choices exist as to how such data should be distributed between data containers. Choice is generally undesirable in standards, as it complicates the task of aligning reading and writing software. Therefore, these principles have been developed to describe how CIF writing software should distribute and describe data dispersed over multiple data containers.

For the purposes of this document the data container is assumed to be a CIF data block. Note, however, that these principles apply to any data presentation where data are encapsulated.

Definitions

Data blocks: A data block is a collection of data values , each of which is associated with a data name defined in a CIF dictionary such that these data items may be arranged into tables (“loops”) as described by the CIF dictionary.
Data sets: A data set is a collection of data blocks
Conformance: An aggregate conforms to a relational schema (like those described by CIF dictionaries) if the data contained can, in principle, be assigned unambiguously and consistently to cells in that relational schema
Appearance: A category appears in a data block if any data name belonging to it is either present, or is referred to via _name.linked_item_id of a data name that is present
Dictionary compatibility: Two dictionaries describing data blocks are compatible if, for all categories that appear in those data blocks, the dictionaries prescribe the same set of key data names
Data block compatibility: Data blocks are compatible if either (1) their dictionaries are compatible or (2) a unique value can be determined or assigned for any key data names that are absent from one of the incompatible dictionaries
Set categories: A Set category is a table for which only one row may be presented in a data block

Rules

Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive,directory).
Dictionary conformance of individual data blocks is specified either by audit_conform data names in each block, or as-yet-undefined dataset-level audit_dataset datanames in any block. Dataset-level dictionaries override block-level dictionaries.
If a data name appearing in a data block is not defined in the dictionary to which that data block conforms, it is considered absent for the purposes of these principles
The set of “Set” categories for a data block is determined by the dictionary to which that data block conforms
A “Set” category may only be aggregated from multiple data blocks if at least one key data name for the “Set” category has been provided in the dictionary to which those data blocks conform.
Where a “Set” category has been provided with a key data name in the dictionary as per (5), and that key data name is not itself a child data name of some other data name, all child data names must also be provided in the dictionary.
The value of “Set” category key data names for a given data block may be assigned arbitrarily if missing
Values for child data names of “Set” category key data names may always be elided.

Discussion

The guiding principle in designing these rules is “can we uniquely assign the values in a given data block into cells of the relational structure describing the whole dataset?”. Reasons for failure might include:

Not being able to determine a single relational schema for the dataset (contradictory dictionaries)
Not knowing the values of all the key data names in a row
Contradictory attribute values for the same key data name values

The above conditions exclude these failure modes while also allowing maximum leeway in unambiguous situations (e.g. a category has only one value for a single key data name for the whole dataset, so the value is clear even for data blocks that don’t explicitly allow for it in their dictionary).