|
Introduction
Some of our applications that support text editing and manipulation, also offer the capability to validate Hyper Text Markup Language (HTML). HTML validation in our applications depends on our Standard Generalised Markup Language (SGML) parser, which is capable of woking with any SGML document, including HTML.
The distinction between SGML and HTML is made by a definition or specification called a Document Type Descriptor (DTD). The DTD generally is a text document, in a fairly loose but complex format, which defines;
- Name and function of allowable elements (tags)
- Rules about closure of elements (with closing tags)
- Rules about ownership among elements (which are allowed to parent which)
- Name and function of allowable attributes associated with each element
Strictly speaking eXtensible Markup Language (XML) is different from SGML. In practice the differences are slight. XML can be treated as a subset of SGML. With the widespread advent of XML it is now necassary for a true SGML parser to work with XML. This ensures that it is capable of handling the malformed HTML commonly found on websites. Our parser handles this general problem with aplomb.
At the moment, however, our parser is unable to read DTD's in their common format. In addition we have no "release quality" method for allowing a user to specify their own DTD. Whilst we have, and work with, DTD's that we and others have generated, the HTML validation implementation of our parser will only validate to a single DTD that has been generated pragmatically for the express purpose of validating webpages.
HTML has, from the perspecive of a browser, very varied support. We might, for example, support the full W3C DTD for each flavour of HTML, XHTML and so on. We know that most browsers don't work in that way. Why then, would anyone need to validate that way? Even if we did provide full support (as the W3C do online) we would still be unable to say that a user could use the W3C valid badge on the page. This is for three clear reasons.
- The W3C vaild badge is theirs, they control it, so it would be unethical (and possibly unlawful) for us to permit propogation of their badge.
- We have insufficient power to have the W3C aknowledge, analyse and approve our validation solution.
- The W3C might not like our approach to this problem. They may deem it necassary for us to make changes that we have not the manpower to implement.
So then, the DTD that we have defined for this purpose, is what we call "HTML 4.01 Transitional & Frameset Merger". It must be made clear that, to the letter of the W3C rulebook, there is no such DTD. In addition, because our SGML parser is able to read and work with XML, it is also happy and effective at validating XHTML. We think that this DTD will spot 99.9% of problems with virtually any internet markup that is presented to it. Despite being non-standard the DTD is derived carefully from the W3C standard to the best of our ability. The validation engine is implemented to the greatest extent possible, given our software capability at this time.
Overall, those who feel that the "letter of the internet standards law" is important, may have concern at our approach. We feel that these people shouldn't be concerned. We aim to raise the quality of markup that may be encountered. Although our parser checks the validity and/or deprecation of attributes, it's most useful capability is that which checks parenthood, and missing end tags. We feel that those people who just want to produce quality pages in quantity should try our validation parser. It's a viable, quick method to check web pages. We think you will find our validation capability very useful indeed.
If we have success with our software in general, we hope to expand the capability of the SGML parser user interface. Notable capabilities that we hope to include are; spell checking in HTML text fields, and user specified DTD's.
|