Represents a collection of CIF parsing options. More...
#include <cif.h>
Data Fields | |
int | prefer_cif2 |
Influences the parser's selection of CIF 1 or CIF 2 format to handle a CIF stream. | |
const char * | default_encoding_name |
If not NULL , names the coded character set with which the parser will attempt to interpret plain CIF 1.1 "text" files that do not bear recognized encoding information. | |
int | force_default_encoding |
If non-zero then the default encoding specified by default_encoding_name will be used to interpret the CIF 1.1 or 2.0 input, regardless of any encoding signature or other appearance to the contrary. | |
int | line_folding_modifier |
Modifies whether line-folded text fields will be recognized and unfolded during parsing. | |
int | text_prefixing_modifier |
Modifies whether prefixed text fields will be recognized and de-prefixed during parsing. | |
int | max_frame_depth |
The maximum save frame depth. | |
const char * | extra_ws_chars |
ASCII characters that should be interpreted as CIF inline whitespace. | |
const char * | extra_eol_chars |
ASCII characters that should be interpreted as CIF end-of-line [whitespace] characters. | |
cif_handler_tp * | handler |
A set of handler functions by which the application can be notified of details of the parse progress as they occur, and through which it can influence the data recorded; may be NULL . | |
cif_syntax_callback_tp | whitespace_callback |
A callback function by which the client application can be notified about whitespace that is not an inherent part of any data value. | |
cif_syntax_callback_tp | keyword_callback |
A callback function by which the client application can be notified about CIF keywords. | |
cif_syntax_callback_tp | dataname_callback |
A callback function by which the client application can be notified about CIF data names as they are encountered. | |
cif_parse_error_callback_tp | error_callback |
A callback function by which the client application can be notified about parse errors, affording it the option to interrupt the parse or allow it to continue. | |
void * | user_data |
A pointer to user data to be forwarded to all callback functions invoked by the parser; opaque to the parser itself, and may be NULL . |
Represents a collection of CIF parsing options.
Unlike most data types defined by the CIF API, the parse options are not opaque. This reflects the struct's
intended use for collecting (only) user-settable option values. There is nevertheless still an object creation function, cif_parse_options_create()
; applications allocating parse options only via that function will thereby insulate themselves against changes to the struct's size arising from additions to the option list in future versions of the API.
A callback function by which the client application can be notified about CIF data names as they are encountered.
If not NULL
, this function will be called by the parser whenever it encounters a data name in the input CIF. before it is reported as part of a loop or individual data item.
const char* cif_parse_opts_s::default_encoding_name |
If not NULL
, names the coded character set with which the parser will attempt to interpret plain CIF 1.1 "text" files that do not bear recognized encoding information.
Inasmuch as CIF 1 is a text format and CIF 2.0 is a text-like format, it is essential for the parser to interpret them according to the text encodings with which they are written. Well-formed CIF 2.0 is Unicode text encoded via UTF-8, but the CIF 1 specifications are intentionally vague about the terms "text" and "text file", intending them to be interpreted in a system-dependent manner. Additionally, documents that conform to CIF 2.0 except with respect to encoding can still be parsed successfully (though an error will be flagged) if the correct encoding can be determined.
The parser will recognize UTF-8, UTF-16 (either byte order), and UTF-32 (either of two byte orders) for CIFs that begin with a Unicode byte-order mark (BOM). In most cases it will recognize UTF-16 and UTF-32 encodings even without a BOM, as well. In some cases it may recognize UTF-8 or other encodings when there is no BOM, and well-formed CIF 2.0 documents will always be recognized as UTF-8 (also wrongly-encoded CIF 2.0 documents may be interpreted as UTF-8).
If no encoding is specified via this option (i.e. it is NULL
) and no encoding signature is recognized in the input CIF, then for CIF 1.1 documents the parser will choose a default text encoding that is appropriate to the system on which it is running, and it will attempt to parse according to that encoding. How the default encoding is chosen is implementation dependent in this case.
If the correct encoding of the input CIF is known, however, then that encoding can be specified by its IANA name via this option. Implementations may also recognize aliases and / or unregistered encoding names. If it is supported, the named encoding will be used in the event that no encoding signature is detected, bypassing the library's usual method for choosing a default encoding. This allows a CIF written in a localized encoding to be parsed correctly on a system with a different default locale. It should be noted, however, that such a CIF can reasonably be considered erroneous (on the system where it is being parsed) on account of its encoding.
The names supported by this option are those recognized by ICU's "converter" API as converter / code page names.
A callback function by which the client application can be notified about parse errors, affording it the option to interrupt the parse or allow it to continue.
If not NULL
, this function will be called by the parser whenever it encounters an error in the input CIF. The parse will be aborted immediately with an error if this function returns non-zero; otherwise, it serves informational purposes only. Two pre-built error callbacks are provided: cif_parse_error_ignore()
and cif_parse_error_die()
.
If NULL
, or if parse options are not specified, then the parser will operate as if the error handler were cif_parse_error_die()
.
const char* cif_parse_opts_s::extra_eol_chars |
ASCII characters that should be interpreted as CIF end-of-line [whitespace] characters.
If not NULL
, specifies additional characters from the 7-bit ASCII set and / or from among the C1 controls that should be accepted as representing CIF end-of-line characters. Ordinarily, only the carriage return and newline characters have this function, and the C1 controls and most of the C0 controls are disallowed altogether. This option is mainly intended to allow successful parsing of pre-v1.1 CIFs, some of which used the form feed character as an end-of-line character, but it is not inherently limited to that use.
Extra end-of-line characters appearing within data values are subject to the same conversion to newlines as carriage return characters and carriage return / line feed pairs.
The space and tab characters cannot be given end-of-line significance via this option; if they appear in the provided string then they are ignored.
The carriage return and newline characters may be specified among these characters, but it has no additional effect.
The string is terminated by a null character ('\0'
); no mechanism is provided for treating that character as end-of-line.
const char* cif_parse_opts_s::extra_ws_chars |
ASCII characters that should be interpreted as CIF inline whitespace.
If not NULL
, specifies additional characters from the 7-bit ASCII set and / or from among the C1 controls that should be accepted as representing CIF inline whitespace. Ordinarily, only the space character and the tab character have this function, and the C1 controls and most of the C0 controls are disallowed altogether. This option is mainly intended to allow successful parsing of pre-v1.1 CIFs, some of which used the vertical tab character as inline whitespace, but it is not inherently limited to that use.
The space or tab may be specified among these characters, but it has no additional effect.
The carriage return and newline characters may be specified among these characters, but it does not change their role as end-of-line characters.
The string is terminated by a null character ('\0'
); no mechanism is provided for treating that character as whitespace.
If non-zero then the default encoding specified by default_encoding_name
will be used to interpret the CIF 1.1 or 2.0 input, regardless of any encoding signature or other appearance to the contrary.
If default_encoding_name
is NULL then it represents a system-dependent default encoding. That's the norm for CIF 1.1 input anyway, but if force_default_encoding
is nonzero then the same system-dependent default will be chosen for CIF 2.0 as well.
This option is dangerous. Enabling it can cause CIF parsing to fail, or in some cases cause CIF contents to silently be misinterpreted, if the specified default encoding is not in fact the correct encoding for the input. On the other hand, use of this option is essential for correctly parsing CIF documents whose encoding cannot be determined or guessed correctly.
This option can be used to parse CIF 2.0 text that is encoded other than via UTF-8. Such a file is not valid CIF 2.0, and therefore will cause an error to be flagged, but if the error is ignored and the specified encoding is in fact correct for the input then parsing will otherwise proceed normally.
A set of handler functions by which the application can be notified of details of the parse progress as they occur, and through which it can influence the data recorded; may be NULL
.
If not itself NULL, then any non-null handler functions in the cif_handler_tp
to which this option points will be invoked by the parser at appropriate times as it traverses the input CIF text. The handlers' return codes CIF_TRAVERSE_SKIP_CURRENT
and CIF_TRAVERSE_SKIP_SIBLINGS
are interpreted as directing which data to record in the target CIF, if indeed the user has provided one. (The parser cannot altogether skip parsing any part of the input if it must identify the end of that part and there resume normal operation.) Handler callbacks are not invoked for entities thereby passed over. Handlers may modify the CIF under construction, subject to the limitations inherent in the CIF being incompletely constructed when they are called.
A callback function by which the client application can be notified about CIF keywords.
If not NULL
, this function will be called by the parser whenever it encounters a CIF keyword that does not directly correspond to or trigger a CIF Handler callback. At this version, that is only the 'loop_' keyword, but a future version may also signal other keywords, so callback functions attached to this hook should check which keyword they receive if in fact they care.
Modifies whether line-folded text fields will be recognized and unfolded during parsing.
The line-folding protocol for text fields is part of the CIF 2.0 specification, but it is only a common convention for CIF 1. By default, therefore, the parser will recognize and unfold line-folded text fields when it operates in CIF 2.0 mode, but it will pass them through as-is when it operates in CIF 1.0 mode. This option influences that behavior: if greater than zero then the parser will unfold line-folded text fields regardless of CIF version, and if less than zero then it will not recognize or unfold line-folded text fields even in CIF 2 mode.
Note that where a text field has been both line-folded and prefixed, the line-folding can be recognized only if line unfolding and text de-prefixing are both enabled.
The maximum save frame depth.
If 1, then one level of save frames will be accepted (i.e. save frames are allowed, but must not be nested). If 0 then all save frames will be rejected as erroneous. That might be used to ensure that CIF data files (as opposed to dictionaries) do not contain save frames. If negative, then save frames are allowed, and may be nested without limit. Values greater than 1 are reserved for possible future use as a bound on save frame nesting depth.
The current version of STAR allows nested save frames, and their use was proposed for CIF 2, especially for DDLm dictionaries. Save frame nesting was not accepted into the CIF 2.0 standard, however, even for dictionary files.
The default is 1.
Influences the parser's selection of CIF 1 or CIF 2 format to handle a CIF stream.
Because the CIF-version code is required in CIF 2 but optional in CIF 1, it is most correct to assume CIF 1 when there is no version code. Nevertheless, if a CIF is known or assumed to otherwise comply with CIF2, then it is desirable to parse it that way regardless of the absence of a version code.
CIF 2 streams that erroneously omit the version code will be parsed as CIF 2 if this option has value greater than zero (albeit with an error on account of the missing version code). On the other hand, CIF 1 streams that (allowably) omit the version code may be parsed incorrectly if this option is enabled.
Moreover, CIF streams will be parsed as CIF 2 regardless of an explicit version code to the contrary if this option has a value of 20 or greater, and they will be parsed as CIF 1.1 regardless of an explicit version code to the contrary if this option has a value less than zero.
Modifies whether prefixed text fields will be recognized and de-prefixed during parsing.
The prefix protocol for text fields is part of the CIF 2.0 specification, but for CIF 1 it is only a local convention of certain organizations. By default, therefore, the parser will recognize and de-prefix prefixed text fields when it operates in CIF 2.0 mode, but it will pass them through as-is when it operates in CIF 1.0 mode. This option influences that behavior: if greater than zero then the parser will de-prefix prefixed text fields regardless of CIF version, and if less than zero then it will not recognize or de-prefix prefixed text fields even in CIF 2 mode.
Note that where a text field has been both line-folded and prefixed, the prefixing can be recognized only if line unfolding and text de-prefixing are both enabled.
A callback function by which the client application can be notified about whitespace that is not an inherent part of any data value.
If not NULL
, this function will be called by the parser whenever it encounters a run of insignificant whitespace (including comments) in the input CIF. Whitespace is insignificant if it serves only to separate other elements appearing in the CIF.
The parser does not guarantee to collect maximal whitespace runs; it may at times split consecutive whitespace into multiple runs, performing a callback for each one. It does, however, mark each transition from whitespace to non-whitespace by reporting a zero-length whitespace run. It does so, too, for virtual whitespace -- where optional whitespace is omitted, or where the parser detects and recovers from omission of required whitespace.
Copyright 2014, 2015 John C. Bollinger