Using XML to generate research tools for Wittgenstein scholars by collaborative groupwork


  a brief introduction to XML
  XML projects usually have at least 3 elements: XML, DTD and XSL. The XML document is the marked-up text. XML is a markup language like SGML and MECS, i.e. it encodes "book title" (content) rather than "italic" (presentation). An example of a familiar presentation-markup language is HTML, used to create web pages. The DTD is a list of all of the tags that are used to markup the XML document with information about their nesting. One can invent any tags for any purpose but they must be listed in the DTD. Finally, the XSL stylesheet determines how the marked content will be presented and in what format. A common XSL transformation is to create HTML web pages, e.g. take the content tag "book title" and present it as "italic".

This project uses the Oxygen XML editor. Oxygen understands the relationship between XML, DTD and XSL. If one loads these 3 elements one can "validate" (check) that the XML file contains only tags that are listed in the DTD. One can also apply the XSL and transform the document into formatted output such as an HTML web page. If one has created a new XML file using tags of one's own design, Oxygen can search the document and create the DTD that lists all the tags. Finally, Oxygen displays various elements in colour, which helps to identify faults, and automatically suggests tags that are appropriate in any particular position in the document.

There are two main types of output from Oxygen. We have mentioned HTML, which is used for web pages. Early HTML standards, e.g. 3.2, had limited options for page layout control. Control was achieved by using borderless (invisible) tables, and nested tables. Most complex layouts on the internet were created using graphics because individual areas of a single graphic could be activated as links. This gave control down to single pixels if necessary, but made loading slow because graphics files are large. With the advent of CSS (cascading style sheets) HTML 4 offers much more control over page layout and the positioning of elements. However, there is still the problem that the page designer cannot control the individual settings of the user's browser. The second type of output from Oxygen offers even more control. It is a new option called "Formatting Objects" [FO]. FO is used to create "printed" pages, either on paper or as PDF files, and enables typesetting. FO is therefore interesting because it allows a high level of page control but can still be used in an internet environment via PDF.

XML can be combined with markup language standards like the Guidelines of the TEI.