SGMLParse
VSGMLParse is a high reliability, true SGML, state based parser. The parser takes a generic SGML/HTML/XML text data stream, and builds a tree of tags, attributes and strings. The resultant tag, may be iterated like any other tree. The content of the tag and it's tree may be modified, split or augmented.
The parser is resilient to missing end tags, or end tags that are present and should not be. The parser has a tag stack to allow start/end tag association, but it also has a set of rules that allow it to process through violations with tenacity. The tag stack is separate from the execution stack to avoid stack overflows where end tags are never used, in a long SGML document.
The parser annotates the tag tree with the character positions of the text representations that created the tags, attributes, and strings. This annotation can be used for error checking and user reporting. Leading and trailing spaces are annotated in the tag but stripped from the actual tree data. As a user of the tree one may include this whitespace, or not, as one chooses.
The parser provides a scheme for converting to and from HTML special entities like "&". One can typically expect to work with ASCII strings in the tag tree, safe in the knowledge that they will be correctly encoded in the output.
A resultant tag tree can be dumped to text using the parser. The parser supports a variety of dump modes;
- Pretty mode produces an easily human readable output.
- Safe mode, preserves leading and trailing space data to ensure that a rendered HTML page will appear the same before and after parsing.
- Compressed mode preserves the space data, but does it's best to compress the data onto as few lines as possible.
In all of the output modes, one may specify a guide line length, but the parser will only insert line endings where it is safe so to do.
The text dump side of the parser does not have a separate tag stack, and it uses the execution stack to iterate the tag tree. It is important then that a tag tree is not passed directly from parse to dump. After parse, the tag tree must be inspected to establish it's depth before attempting any further operations.
|