Parse error recovery

Since the built-in parser allows syntax and grammar errors to be ignored, it must provide a mechanism for recovering from such errors to continue the parse. In general, there is no single, clear approach for recovering from any given error, so the following table documents the approach taken by the parser in each case.

Condition

Code

Recovery action

Wrong character encoding

CIF_WRONG_ENCODING

ignore the problem

CIF 2.0 must consist of Unicode character data encoded in UTF-8, and this error is emitted if the input is recognized to be encoded differently. That this error is emitted at all generally means that the parser has identified the signature of a different (known) encoding, so it can recover by reading the data according to the detected encoding, even though such input does not comply with the CIF 2.0 specifications.

Disallowed input character

CIF_DISALLOWED_CHAR

substitute a replacement character

This error indicates that an input character outside the allowed set was read. The parser can recover by accepting the character. Which characters are allowed depends on which version of CIF is being parsed.

Missing whitespace

CIF_MISSING_SPACE

assume the omitted whitespace

Whitespace separation is required between most CIF grammatic units. In some cases, the omission of such whitespace can be recognized by the parser, resulting in this error. In particular, this is the error that will be reported when a CIF1-style string with embedded delimiter is encountered when parsing in CIF 2 mode. If the opening delimiter of a table is omitted then this error will occur at trailing colon of each table key.

Invalid block code

CIF_INVALID_BLOCKCODE

use the block code anyway

Although the API's CIF manipulation functions will not allow blocks with invalid codes to be created directly by client programs, the parser can and will create such blocks to accommodate inputs that use such codes. The result is not a valid instance of the CIF data model.

Duplicate block code

CIF_DUP_BLOCKCODE

reopen the specified block

Block codes must be unique within a given CIF. To handle a duplicate block code, the parser reopens the specified block and parses the following contents into it. This may well lead to additional errors being reported.

Missing block header

CIF_NO_BLOCK_HEADER

parse into an anonymous block

To handle data that appear prior to any block header, the parser creates a data block with an empty name and parses into that. The data are available via that name, but the result is not a valid instance of the CIF data model.

Invalid frame code

CIF_INVALID_FRAMECODE

use the frame code anyway

Although the API's CIF manipulation functions will not allow save frames with invalid codes to be created directly by client programs, the parser can and will create such frames to accommodate inputs that use such codes. The result is not a valid instance of the CIF data model.

Duplicate frame code

CIF_DUP_FRAMECODE

reopen the specified frame

Save frame codes must be unique within a their containing block or save frame. To handle a duplicate frame code, the parser reopens the specified frame and parses the following contents into it. This may well lead to additional errors being reported.

Disallowed save frame

CIF_FRAME_NOT_ALLOWED

accept the save frame

This error occurs when a save frame header is encountered while parsing with save frame support completely disabled. The parser recovers by parsing the frame as if save frame support were enabled at the default level.

Unterminated save frame

CIF_NO_FRAME_TERM

assume the missing terminator

This error occurs when a data block header is encountered while parsing a save frame, or when a save frame header is encountered while parsing a save frame with nested frames disabled (the default). The parser recovers by assuming the missing save frame terminator at the position where the error is detected.

Unterminated save frame at end-of-file

CIF_EOF_IN_FRAME

assume the missing terminator

This is basically the same as the CIF_NO_FRAME_TERM case, but triggered when the end of input occurs while parsing a save frame. This case is distinguished in part because it may indicate a truncated input.

Unexpected save frame terminator

CIF_UNEXPECTED_TERM

ignore

If a save frame terminator is encountered outside the scope of a save frame, the parser recovers by ignoring it. This condition cannot be distinguished from the alternative that a save frame header is given without any frame code.

Duplicate data name

CIF_DUP_ITEMNAME

parse and drop the item

If a duplicate item name is encountered then it and its associated value(s) are dropped, including when the duplicate appears in a loop.

Unexpected value

CIF_UNEXPECTED_VALUE

ignore the value

This error occurs when a value appears outside a list or loop without being paired with a dataname or (in a table) a key.

Unexpected (closing) delimiter

CIF_UNEXPECTED_DELIM

ignore the delimiter

This error occurs when a list or table closing delimiter appears without matching opening delimiter preceding it. This can happen when such a delimiter appears in the middle of a whitespace-delimited data value.

Missing data value

CIF_MISSING_VALUE

use a synthetic unknown value

This error occurs when a data name or table key appears without a paired value. The parser recovers by synthesizing unknown-value placeholder value.

Empty loop header

CIF_NULL_LOOP

ignore

This occurs when the loop_ keyword appears without at least one subsequent data name. The parser recovers by ignoring it.

Truncated loop packet

CIF_PARTIAL_PACKET

fill out the packet with unknown values

This error occurs when the number of data values in a loop is not an integral multiple of the number of data names. In such cases, the parser can recover by filling in the missing values with out with unknown-value placeholder values.

Empty loop

CIF_EMPTY_LOOP

accept

This occurs when a valid loop header is not followed by any values. The parser recovers by accepting the empty loop, which can be accommodated by the API's internal CIF representation. The result is not a valid instance of the CIF data model, however.

Unterminated list or table

CIF_MISSING_DELIM

assume the missing delimiter

If the closing delimiter of a list or table is omitted, then the parser recovers by assuming the terminator to appear at the point where its absence is recognized.

Missing table key

CIF_MISSING_KEY

drop the value

If a value appears inside a table without an associated key, then the parser recovers by dropping it.

Missing table key

CIF_NULL_KEY

use a NULL key

If a table entry contains a (colon, value) without any key representation at all (not even an empty string), then the parser can recover by using a NULL key. The result is not a valid instance of the CIF data model.

Unquoted table key

CIF_UNQUOTED_KEY

accept

This case is distinguished from the CIF_MISSING_KEY case by the presence of a colon in the unquoted value. The key is taken as everything up to the first colon, and the value as everything after

Text block as a table key

CIF_MISQUOTED_KEY

accept

CIF 2.0 does not allow text blocks to be used as table keys, but this is a somewhat artificial restriction. If the parser encounters a table key quoted with newline/semicolon delimiters then it can recover by accepting that key as valid.

Missing text prefix

CIF_MISSING_PREFIX

accept

The text prefixing protocol requires every line of a prefixed text field to start with the chosen prefix. If any line fails to do so then the parser can typically recover by simply accepting that line verbatim.

Invalid unquoted value

CIF_INVALID_BARE_VALUE

accept

A whitespace-delimited data value has a restricted character repertoire and a more-restricted first character. When the parser recognizes that one of these restrictions has not been obeyed, it can recover by accepting the value as-is.

Unquoted reserved word

CIF_RESERVED_WORD

drop

The the strings 'data_' (without a block code), 'stop_', and 'global_' are reserved and must not appear as unquoted complete words in CIFs. If the parser encounters one, it can recover by dropping it.

Overlength line

CIF_OVERLENGTH_LINE

drop

If a CIF input line exceeds the allowed number of characters (2048 in CIF 1.1 and CIF 2.0) then the parser can recover by ignoring the problem. Note that the limit is expressed in Unicode characters -- not bytes, nor even UChar code units -- and it does not include line terminators.

Missing endquote

CIF_MISSING_ENDQUOTE

assume the quote

When a (single-) apostrophe-quoted or quotation-mark-quoted string is not terminated before the end of the line on which it begins, the parser can recover by assuming the missing delimiter at the end of the line.

Unterminated multiline string

CIF_UNCLOSED_TEXT

assume the closing delimiter

When a text block or triple-apostrophe-quoted or triple-quotation-mark-quoted string is not terminated before the end of the end of the input, the parser can recover by assuming the missing delimiter at the end of the input. In such cases, that is often much more text than the value was meant to include, but there is no reliable way to determine where it was supposed to end.

Disallowed first character

CIF_DISALLOWED_INITIAL_CHAR

accept

There are slightly different rules for the first character of a CIF than for others, in that a Unicode byte-order mark (U+FEFF) is allowed there. Moreover, an unexpected character at that position can be an indication of a mis-identified character encoding. The parser can recover by accepting the character, but that will result in at least one subsequent error.

 All Data Structures Variables

Copyright 2014, 2015 John C. Bollinger