CIF parsing

The CIF API provides a parser for CIF 2.0, flexible enough to parse CIF 1.1 documents and those complying with earlier CIF conventions as well.

CIF versions

To date there have been two formal specifications for the CIF file format, v1.1 and v2.0, and a body of less formal practice predating both. Each specification so far has introduced incompatibilities with preceding practice, so correct parsing depends on correctly identifying the version of CIF with which a given document is intended to conform.

Early CIF

Before the release of the CIF 1.1 specifications, CIF format was defined by the Hall, Allen, and Brown's 1991 paper (Acta Cryst A47, 655-685). It somewhat loosely defines CIF as being composed of "ASCII text", arranged in lines not exceeding 80 characters, with a vague, mostly implicit, sense of some kind of whitespace separating syntactic units (block headers, data names, etc.). The paper makes a point that CIF files are easily readable and suitable for creation or modification via a text editor. There were no save frames, and there was some confusion about exactly which characters were allowed (in particular, about which of the C0 controls were allowed, and what they meant). The vertical tab and form feed characters were accepted by most CIF processors, and the prevailing practice was to treat the latter as a line terminator.

During this time, followup papers refined some of the initial CIF ideas, and ultimately IUCr attempted to bring a little more order to the early CIF world. Significant during this time was limiting the allowed characters in a CIF to the printable subset of those defined by (7-bit) US-ASCII, plus the tab, vertical tab, form feed, carriage return, and line feed control characters (albeit not necessarily encoded according to ASCII). This period was also marked by the development of semantic conventions, on top of basic CIF, for expressing information such as a limited set of non-ASCII characters and logical lines longer than the 80-character limit.

CIF 1.1

The CIF 1.1 specifications were initially drafted in 2002, as a more formal treatment of the language and to address some of the issues that the first ten+ years of CIF practice had uncovered. This revision deleted the vertical tab and form feed characters from the allowed set and extended the line-length limit to 2048 characters. It also, among other things, imposed a 75-character limit on data name, block code, and frame code lengths; allowed save frames (which meanwhile had been added to STAR, along with global_ sections), but not save frame references; formally excluded global_ sections (and reserved the "global_" keyword); and restricted the characters that may start an unquoted data value. Significantly, it also introduced the leading CIF version comment, though it is optional in CIF 1.1. With few exceptions, CIF 1.1 did not invalidate pre-existing CIFs.

CIF 2.0

The primary impetus for another CIF revision arose from the concept of adding "methods" to CIF dictionaries, while continuing to express those dictionaries in CIF format. This necessitated support for new data types (list and table) in CIF, and the opportunity was taken to also extend CIF's character repertoire to the whole Unicode space. CIF 2.0 also solves the longstanding issue that even restricting characters to 7-bit ASCII, there are data that cannot be expressed as CIF 1.1 values.

The cost of these changes is a considerably higher level of incompatibility between CIF 2.0 and CIF 1.1 than between CIF 1.1 and previous CIF practice, as reflected by incrementing CIF's major version number. Some data value representations known to appear in well-formed CIF 1.1 documents are not well-formed CIF 2.0; moreover, CIF 2.0 specifies that instance documents must have the form of Unicode text encoded according to UTF-8, which, ironically, makes it a binary format -- albeit one that can do a good impression of a text format on many systems. To assist in sorting this out, CIF 2.0 requires an appropriate version comment to be used in well-formed documents (itself a minor incompatibility).

The CIF API parsing interface

The main interface provided for CIF parsing is the cif_parse() function, which reads CIF text from a standard C byte stream, checks and interprets the syntax, and optionally records the parsed results in an in-memory CIF object. A variety of options and extension points are available via the options argument to the parse function, falling into two broad categories:

options controlling the CIF dialect to be parsed, and
callback functions by which the caller can obtain dynamic information about the progress of the parse, and exert influence on parsing behavior

These afford a great deal of parsing flexibility, some dimensions of which are explored in the following sections.

Basic parsing

Historically, a majority of CIF 1.1 parsers have operated by parsing the input into some kind of in-memory representation of the overall CIF, possibly, but not necessarily, independent of the original file. The cif_parse() function operates in this way when its third argument points to a location for a CIF handle:

void traditional(FILE *in) {
    cif_tp *cif = NULL;

    cif_parse(in, NULL, &cif);
    /* results are available via 'cif' if anything was successfully parsed */
    cif_destroy(cif);  /* safe even if 'cif' is still NULL */
}

By default, however, the parser stops at the first error it encounters. Inasmuch as historically, many CIFs have contained at least minor errors, it may be desirable to instruct the parser to attempt to push past all or certain kinds of errors, extracting a best-guess interpretation of the remainder of the input. Such behavior can be obtained by providing an error-handling callback function of type matching cif_parse_error_callback_tp . Such a function serves not only to control which errors are bypassed, but also, if so written, to pass on details of each error to the caller. For example, this code counts the number of CIF syntax and semantic errors in the input CIF:

int record_error(int error_code, size_t line, size_t column, const UChar *text, size_t length, void *data) {
    *((int *) data) += 1;
    return CIF_OK;
}

void count_errors(FILE *in) {
    cif_tp *cif = NULL;
    int num_errors = 0;
    struct cif_parse_opts_s *opts = NULL;

    cif_parse_options_create(&opts);
    opts->error_callback = record_error;
    opts->user_data = &num_errors;
    cif_parse(in, opts, &cif);
    free(opts);
    /*
     * the parsed results are available via 'cif'
     * the number of errors is available in 'num_errors'
     */
    /* ...  */
    cif_destroy(cif);
}

For convenience, the CIF API provides two default error handler callback functions, cif_parse_error_die() and cif_parse_error_ignore(). As their names imply, the former causes the parse to be aborted on any error (the default behavior), whereas the latter causes all errors to silently be ignored (to the extent that is possible).

Parser callbacks

Parsing a CIF from an external representation is in many ways analogous to performing a depth-first traversal of a pre-parsed instance of the CIF data model, as the cif_walk() function does. In view of this similarity, a cif_handler_tp object such as is also used with cif_walk() can be provided among the parse options to facilitate caller monitoring and control of the parse process. The handler callbacks can probe and to some extent modify the CIF as it is parsed, including by instructing the parser to suppress (but not altogether skip) some portions of the input. This facility has applications from parse-time data selection to validation and beyond; for example, here is a naive approach to assigning loop categories based on loop data names:

int assign_loop_category(cif_loop_tp *loop, void *context) {
    UChar **names;
    UChar **next;
    UChar *dot_location;
    const UChar unicode_dot = 0x2E;

    /* We can rely on at least one data name */
    cif_loop_get_names(loop, &names);
    /*
     * Assumes the name contains a decimal point (Unicode code point U+002E), and
     * takes the category as everything preceding it.  Ignores case sensitivity considerations.
     */
    dot_location = u_strchr(names[0] + 1, unicode_dot);
    *dot_location = 0;
    cif_loop_set_category(loop, names[0]);
    /* Clean up */
    for (next = names; *next != NULL; next += 1) {
        free(*next);
    }
    free(names);
}

void parse_with_categories(FILE *in) {
    cif_tp *cif = NULL;
    struct cif_parse_opts_s *opts = NULL;
    cif_handler_tp handler = { NULL, NULL, NULL, NULL, NULL, NULL, assign_loop_category, NULL, NULL, NULL, NULL };

    cif_parse_options_create(&opts);
    opts->handler = &handler;
    cif_parse(in, opts, &cif);
    free(opts);
    /* the parsed results are available via 'cif' */
    /* ... */
    cif_destroy(cif);
}

Note that the parser traverses its input and issues callbacks in document order, from start to end, so unlike cif_walk(), it does not guarantee to traverse all of a data block's save frames before any of its data.

CIF validation

The CIF API does not provide specific support for CIF validation because validation is dependent on the DDL of the dictionary to which a CIF purports to comply, whereas the CIF API is generic, not specific to any particular DDL or dictionary. To the extent that some validations can be performed during parsing, however, callback functions provide a suitable means for interested applications to engage such validation.

Comments

For the most part, the parser ignores CIF comments other than for attempting to identify the CIF version with which its input purports to comply. In the event that the caller wants to be informed of comments and other whitespace, however, there is among the parse options a pointer to a callback function for that purpose. Its use is analogous to the callbacks already discussed.

Syntax-only parsing

For some applications, it might not be desirable or even feasible to collect parsed CIF content into an in-memory representation. The simplest such application performs only a syntax check of the input -- perhaps to test compliance with a particular CIF version and/or parse options. The parser function operates in just that way when its third argument is NULL (that is, when the caller provides no CIF handle location). All provided callbacks are invoked as normal in this mode, including any error callback, and regardless of any callbacks, the parser's return code indicates whether any errors were detected. This syntax-only parsing mode does have a few limitations, however: primarily that because it does not retain an in-memory representation of its input, it cannot check CIF semantic requirements for data name, frame code, and block code uniqueness within their respective scopes.

Event-driven parsing

The availability and scope of callback functions make the syntax-only mode described above a CIF analog of the event-driven "SAX" XML parsing interface. To use the parser in that mode, the caller provides callback functions by which to be informed of parse "events" of interest -- recognition of entities and entity boundaries -- so as to extract the desired information during the parse instead of by afterward analyzing the parsed result. Callbacks can communicate with themselves and each other, and can memorialize data for the caller, via the user_data object provided among the parse options (and demonstrated in the error-counting example). Callbacks can be omitted for events that are not of interest.