SCHOLARLY TEXT PROCESSING AND FUTURE MARKUP SYSTEMS

Abstract

This paper gives a brief overview of the background and development of markup systems for text processing, concentrates on certain basic features of current markup systems and makes an attempt to discern tendencies that seem to be reaching into the future. It aims to show that markup technology is important for the humanities, but equally that the humanities disciplines are also important for markup technology. They have already contributed a great deal to the development of markup theory and markup systems, and future technological development may therefore benefit considerably from further contributions from the humanities.

1. Introduction

The use of generic markup has become pervasive in nearly all kinds of document processing, and the number and diversity of systems, tools and applications for document markup has grown rapidly in recent years. The present account will concentrate on certain basic features of current markup systems and make an attempt to discern tendencies that seem to be reaching into the future.[1]

It aims to show that markup technology is important for the humanities, but equally that the humanities disciplines are also important for markup technology. They have already contributed a great deal to the development of markup theory and markup systems, and future technological development may therefore benefit considerably from further contributions from the humanities.

2. The Rise and Growth of Generic Markup

What is markup, and why is markup relevant to the concerns of scholarly text processing? According to one view, all texts, i.e. not only electronic documents, are marked up. On such a view the reason why humanities scholars should care about markup is simply that markup reflects the structures of texts,– whether in the form of electronic, printed, manuscript or other written documents.[2] But another view has it that markup simply consists of the codes or reserved character strings which are inserted into the stream of characters of electronic text files in order to denote or signal features of the document which cannot readily be conveyed by characters directly representing its verbal content. In other words, markup consists of character strings carrying information about other character strings. Also on this view it may firmly be maintained that virtually all electronic texts are marked up.

In the early days of text processing, the lack of a universally accepted standard for document representation posed a serious problem. Software manufacturers employed their own separate encoding systems in the form of proprietary file formats, and for a long time they seemed to regard these systems as a strategic means of holding on to their customers. In any case they did usually not make documentation of their systems publicly available. Unfortunately this made it difficult not only for competitors, but also for users to understand these encodings. The lack of publicly available documentation and the corresponding lack of standards made the exchange and reuse of electronic texts as well as software for text processing difficult and costly in terms of resources.

Furthermore, most encoding systems were directed towards capturing and controlling the visual appearance of documents rather than their intellectual structure and contents. This kind of encoding merely replicated the functionality of print technology without taking advantage of new possibilities provided by the digital media. Documents with such procedural or presentational[3] markup were well suited for publication, but less well for computer-assisted retrieval, linguistic analysis and other uses which are peculiar to digital texts.

The result of this was considerable expense and inconvenience for users in general, but quite possibly an even greater problem in the humanities than elsewhere. Whereas other disciplines use texts primarily as a medium for the transmission of information about some object of study, in the humanities the object of study is often the text itself. In other settings texts tend to be of relevance for only limited periods of time, yet in the humanities scholars work with texts that are transmitted over hundreds or even thousands of years. Moreover, any text is a potential object of future historical interest. For humanities research it is therefore important not just to facilitate the exchange and reuse of the texts that record the results of research, but also to ensure that texts produced in very different contexts can be preserved in a form that will make them accessible also to research in the future.

In addition, humanities research often has to rely on software specially developed by those who work in the respective research environments. On top of the expense of developing this software there were the costs of maintaining it and ensuring that it can be used on texts stored in various and ever-changing formats. Scholars, and the institutions responsible for conserving source materials, such as archives and libraries, were among the first to encourage standardization of the formats used in text representation.

Internationally, considerable effort was (and still is) invested in the development of common standards for text encoding. Major players in the computer industry itself threw their support behind these developments, the principal aim of which can be described as improved efficiency in the production and distribution of electronic texts and the relevant software. One outcome of these efforts was the adoption of Standard Generalized Markup Language (SGML) as an ISO standard in 1986.[4]

In its simplest forms, SGML markup lends itself to a straightforward model for markup interpretation and processing: the features of a document are represented by SGML elements, which nest within each other and which normally contain character strings representing the verbal contents of the document. An SGML document therefore has a natural representation as a tree whose nodes represent elements and whose leaves represent the characters of the document. The structure of the elements, i.e. the legal forms of the document tree, may be restricted using a Document Type efinition (DTD), which provides a form of context-free grammar. The document structure may thus be checked by a validating SGML parser.

SGML is a flexible and powerful tool. Its power consists above all in its ability to give users control over the document structure by designing DTDs against which documents can be validated. Its flexibility consists in providing users the possibility to design their own DTDs with tag vocabularies suited to their individual needs, instead of a pre-defined and fixed tag set. Although in principle SGML can be used also for other purposes, the SGML community has strongly recommended so-called descriptive markup, as opposed to presentational or procedural markup. Users should in general not mark up their documents' visual appearance, but rather features ›underlying‹ the typography of conventional printed documents.[5]

Work on the Text Encoding Initiative (TEI) began in 1987, just one year after SGML had been approved as an ISO standard. The TEI Guidelines for Electronic Text Encoding and Interchange,[6] the result of a collaborative effort by a hundred or so researchers from a variety of humanities backgrounds, was published in 1994. The TEI Guidelines describes one of the most comprehensive and advanced text markup systems ever devised. The TEI Guidelines provide not a single DTD but a set of DTD fragments and an environment for creating customized DTDs. One such customization, known as TEILite, has become particularly popular.

However a number of circumstances slowed down SGML's adoption and success during its first decade. The most important reason probably was the complexity of the standard itself. SGML incorporates many complicated optional features. Due to abbreviation options element boundaries cannot be reliably determined without reference to the document grammar. Thus, even a non-validating parse of a document is not possible without processing the DTD. In addition, SGML includes several other features which makes it difficult to write parsing routines. Consequently, SGML software development proceeded slowly.

Since 1993, the propagation of SGML received a considerable boost from the explosive rate of growth of the World Wide Web. The document standard used on the web, known as HTML (HyperText Markup Language), is based on SGML, allowing us to claim that the incredible popularity of the Web also represents a success for SGML.

Even though HTML is an SGML-based standard, it has a number of peculiar characteristics that conflict with many of the fundamental ideas underlying SGML. Firstly, the user cannot alter the DTD, which means that HTML is essentially static. Secondly, HTML is far more appearance oriented than content oriented. Thirdly, the opportunities for automatic validation are only minimally exploited.[7]

These drawbacks of HTML led many people, not least in academic circles, to start looking for alternative ways to transfer SGML documents via the web. It was against this background that work was begun on XML (Extensible Markup Language). The aim was to combine the simplicity of HTML with the expressive power and flexibility of SGML. The World Wide Web Consortium published XML as a W3C Recommendation on February 10, 1998.[8]

XML has retained important features of SGML, such as the simple notation lending itself to a data model representing a document as a tree structure, the possibility of constraining document structure by means of a DTD, and the freedom of the user to define his own tag sets with their associated DTDs. The basic difference to SGML is that markup abbreviation has been eliminated so that a document can be parsed without access to its DTD. Many other, less used, but complicating mechanisms of SGML have also been eliminated. Compared to SGML, software development for XML is consequently much easier.

Like HTML, XML has enjoyed considerable success, albeit of a different kind. XML documents can easily be converted to HTML. It has become common practice to prepare and exchange documents in XML, and then to generate HTML for the visual presentation of those documents on the web. Great quantities of web content therefore use HTML exclusively as a presentation format, with XML as the underlying primary format.

Much SGML and HTML-based data and many associated applications have been or are in the process of being converted to XML. For example, HTML itself is now available in an XML-based version: XHTML.[9] Moreover, whereas TEI P3 (the version of the TEI Guidelines published in 1994) was based on SGML, TEI P4 (the follow-on version published in 2002)[10] is simply an XML-based version of the same system.

3. Current Markup Technologies

Although proprietary formats (like PostScript, PDF, RTF et cetera) are still widely in use, it is fair to say that XML is gaining ground at such a rapidly increasing pace that perhaps it is the predominant format for encoding and exchange of text documents already today, or at least it will be so in the near future.

While part of the attractiveness of XML lies in its simplicity, a huge and potentially bewildering variety of related standards, technologies, applications and tools has emerged alongside XML, partly based on it and partly augmenting its capabilities. In this presentation, I limit myself to a brief mention of developments which seem particularly relevant to humanities computing (although none of them have been designed with humanities applications as their main object, and they all have other application areas as well).

XSL (Extensible Stylesheet Language)[11] is a set of specifications used primarily for transformation of XML documents to other forms of XML or to non-XML formats. XSL uses XSLT (XSL Transformations) for transforming documents; XPath (XML Path Language) to access or refer to specific parts of a document; and XSL-FO (XSL Formatting Objects) to specify document formatting.

XML is suited for the representation not only of text documents, but also for database data. XQuery[12] provides a query language similar to those known from relational database systems to XML data. XQuery is based partly on XPath, but provides additional functionality such as construction of new XML elements and attributes, reordering and suppression of selected data, data typing et cetera.

XLink (XML Linking Language)[13] provides mechanisms for creating and describing links in XML documents in familiar ways known from the unidirectional links of HTML, as well as more sophisticated hyperlinks. XPointer provides an addressing language – i.e. a language for specifying locations in XML documents – which is a superset of XPath. XLink can use XPointer expressions to specify the locations of link ends.

XForms,[14] one of the most recent additions to the wealth of W3C recommendations, replicates and greatly enhances the functionality of HTML forms for XML. In particular, XForms separates handling of data content from its presentation, and offers strong data typing.

XML defines the structure of markup, but provides limited means of constraining element content and attribute values. W3C XML Schema[15] allows DTD designers to define elements that respect complex data types, such as are found in high-level programming languages. Other schema languages for defining XML vocabularies are also in use; the two best known, after W3C XML Schema, are probably Relax NG and Schematron.

Different XML markup languages often provide different vocabulary and grammar for semantically equivalent structures. The ISO HyTime specification Architectural forms[16] allow DTD designers to design reusable modules and to define element types as synonyms or subtypes of other well-known element types.

SMIL (Synchronized Multimedia Integration Language)[17] is an XML-based language that allows for the creation of streaming multimedia presentations of text, sound, still and moving images.

Semantic Web[18] refers to a number of interrelated XML-based research and standardization efforts which lie at the intersection of markup technology and knowledge representation. One of these enterprises is W3C's Resource Description Framework (RDF),[19] another is the ISO Topic Maps standard.[20]

There also is a need to allow programs and scripts written in other languages than XSL to access and update the content, structure and style of XML documents. SAX (Simple API for XML) and W3C's DOM (Document Object Model)[21] satisfy this requirement by means of an API (Application Program Interface) to the data structure that results from parsing an XML document.

But how, more precisely, does all of this relate to the needs of the humanities? In general, and as explained above, use of openly specified non-proprietary formats, such as XML, in order to represent humanities research material, whether it is source material (literary or historical texts, databases et cetera) or the results of the research itself (monographs, articles et cetera), ensures that the documents are readable and exchangeable without loss or distortion of information independently of particular hardware and software platforms used.

Because of the widespread use of XML-based technology in public as well as private sectors, hosts of software is available for processing of XML documents. Furthermore, XML allows projects or individual scholars to create and adapt XML-based tools and applications for their own purposes, without having to rely on the industry to provide such tools for them, while still being assured that what they do can be accessed and reused, as it is based on firm international standards.[22]

In order to give some more specific indication of what XML and related technologies may mean to the humanities, let us take a closer look at a typical kind of humanities project, e.g. the creation of a critical edition on the basis of some set of source manuscripts. One of the first requirements for such a project is to design or select a DTD appropriate for the purpose. Some projects will find that they can simply apply an existing DTD, such as e.g. TEI. Others will find that they need to customize an existing DTD or build one from scratch, and they may find that they want to exert stricter control over element and attribute content than XML itself allows. In the latter case, XML Schema may be of help.

Once the DTD has been set up, source texts can be entered using virtually any text processing tool. Some editors will require markup to be typed into the texts manually, other, XML-aware editors, allow markup to be selected using graphical interface elements such as menus and toolbars. Some XML editors offer WYSIWYG options, employing XSL stylesheets to format the screen presentation of the text being edited. Transcriptions are validated either continuously during input, or manually at selected intervals, thus ensuring that the result of the transcription process is always a valid XML document.

In projects like this, transcriptions are usually edited in several cycles. For example, after the first entering of the text by transcribers, others may go over the transcription adding markup for names and dates, for dramatic, metrical or thematic features etc cetera. There is often then a danger of inadvertently corrupting some of the transcription while editing other parts of it. The newly adopted XForms standard promises a solution to such problems. It allows projects relatively easily to create their own, specialized XML editors for editing selected elements while leaving others unaffected.

XSL stylesheets will typically be designed for alternative presentations of the texts in varying degree of detail and according to project-designed specifications. By means of these stylesheets output files can be created in PDF, PostScript or other formats for the production of high-quality print, or in HTML for presentation on the Web. Web presentations can be enriched with RDF metadata for easier retrieval and cataloguing of the resource.

RDF or topic maps can also be used for storing and linking the text resource with hypertextual and richly structured presentations of bibliographic or biographic data, which may in their turn be stored in and extracted from relational databases by means of XML interfaces based on e.g. XQuery. Associated material in the form of still or moving images and sound can be integrated into such presentations by use of SMIL or similar XML-based standards.

Thus, nearly all aspects of traditional text-critical work as well as the traditional printed or hypertextual presentation of such material in combination with multimedial presentations of additional material may be done entirely within a framework of XML-based standards and technologies. And in the case that a project like this should find a need to develop its own software, the XML format is made accessible to most major programming languages by means of the DOM and SAX APIs.

4. Recent Trends and Developments4.1 XML Technologies

Compared to the situation before the advent of SGML, it is fair to say that the currently widespread use of generic markup represents a victory over the proprietary formats that have been dominant earlier. Even the most popular word processing systems still based on such proprietary formats are now appearing in versions which include at least some support for XML (MicroSoft Office 2003, StarOffice, Word-Perfect).

As already mentioned, however, proprietary formats like RTF or page description languages like PostScript and PDF are also still widely in use. It is not very likely that these formats will be completely replaced by generic markup. Quite on the contrary, we have seen that tools have been and are being developed for conversion of XML documents to such formats for purposes of visual presentation. On the other hand, while it is relatively easy to generate Postscript, PDF, RTF et cetera from XML, it is hard to do a conversion the other way around. Since documents stored in XML lend themselves to a number of other uses than just visual presentation, it is therefore likely that XML will replace the others as the most commonly used primary representational format for documents.

That is not to say, of course, that XML itself will necessarily remain unchanged. In particular, the surrounding technology is rapidly changing and developing. As mentioned, one of the strengths of XML compared to SGML is its simplicity. Because of this simplicity, it was easy to develop extensions in the form of XML-based technologies and applications. In the five years that have gone by since XML went public, the number and variety of such extensions (some of which were mentioned in the previous section) have grown so high that it is hardly within the scope of any individual to be in command of all aspects of these technologies. Developing software to parse an XML document was and is within the reach of a few day's work for a skilled programmer, whereas developing software complying with and keeping up to date with the ongoing changes and developments in the various surrounding technologies requires quite considerable resources.

At least two scenarios seem possible: Either, XML remains a narrowly defined core standard surrounded by increasingly complex related and XML-based standards and technologies, or XML itself is extended and modified to include parts of the currently surrounding technology. The first scenario carries with it a danger that the surrounding technologies will develop in incompatible and confusing ways; the second that XML loses its simplicity and itself becomes increasingly complex.

In either case, it should be clear that what happens to XML-related technology in the future is of utmost importance to anyone who tries to keep up to date with document processing technology.

4.2 The Text Encoding Initiative

The TEI Guidelines have found wide acceptance in the humanities community and are by now regarded as a major reference and used by a great number of projects within the humanities. As already mentioned, the first public version of the TEI Guidelines was published in 1994. In December 2000, a non-profit corporation called the TEI Consortium[23] was set up to maintain and develop the TEI standard. The Consortium has executive offices in Bergen, Norway, and hosts at the University of Bergen, Brown University, Oxford University, and the University of Virginia. The Consortium is managed by a Board of Directors, and its technical work is overseen by an elected Council.

One of the first actions of the TEI Consortium was to prepare and publish (in June 2002) an XML version of the Guidelines, called P4. Apart from ensuring that documents produced to earlier TEI SGML-based specifications remain usable with the new XML-based version, P4 restricted itself to error correction only. The next version, P5, is already well under way and will contain more substantial extensions and improvements to the current version. A number of TEI work groups and task forces[24] are currently working on proposals for inclusion in P5.

The Character Encoding Workgroup is adapting the TEI's handling of character sets and languages to Unicode/ISO 10646 and providing users with advice on how they may migrate to Unicode. In the current version of the TEI Guidelines, documentation of character sets and languages are handled by the so-called Writing System Declaration. With Unicode/ISO 10646, which is required by the XML recommendation, the Writing System Declaration will become obsolete. Even so, there will still be a need to declare languages and writing systems independently of each other. The Work Group's recommendations will cater for this need.

Another Work Group is charged with stand-off markup and linking issues. Stand-off markup, i.e. markup which is placed outside of the text it is meant to tag, has become increasingly widespread in recent years, particularly in linguistics applications. It has proved useful for markup of multiple hierarchies as well as in situations when the target text cannot for some reason or other itself be modified. Links which go beyond the simple linking mechanisms of HTML are desirable in many of the same situations. The current TEI Guidelines already include methods for stand-off markup. The Guidelines also contain advanced mechanisms for linking, the so-called TEI Extended Pointers, which have provided an important part of the basis for the XML XPointer draft. The TEI Stand-off and Linking Work Group attempts to modify and extend the TEI Guidelines to answer to the needs of linguistic communities, as well as synchronizing the next version of the TEI Extended Pointers with the evolving XML XPointer standard.

The TEI Guidelines contain mechanisms for the encoding of linguistic annotations using feature structure formalisms. This proposal is now generally recognized as covering many needs in the field of linguistics. Natural Language Processing (NLP) based on this proposal have further increased interest in this aspect of the TEI within the linguistics community. As the proposal is tightly integrated with the rest of the TEI scheme, its adoption offers the prospect of opening up the application of NLP techniques to a very wide community of users, while at the same time offering the NLP community access to a real-world range of different text types and applications. The Joint ISO-TEI activity on Feature Structures works in cooperation with the International Standards Organization (ISO TC37/SC4) in order to synchronize efforts to the effect that the P5 revised TEI encoding for Features Structures will at the same time be an ISO standard.

The TEI Metalanguage Markup Workgroup may be said to deal with the conceptual as well as the logistic basics of the TEI. The TEI Guidelines are an example of literate programming, in which the documentation and the information required to build DTDs are combined in a single document. The web and print versions of the Guidelines, and the DTD modules, are all generated using a set of transformations. The Metalanguage Work Group works to simplify, document, and extend this internal literate programming language and replace existing dependencies on SGML or the DTD language. XML schema languages are being used within the markup to document markup constraints.

In consideration of the large amount of text that has been prepared according to the SGML-based TEI P3 recommendation of 1994, the TEI Consortium recognizes a responsibility for facilitating effortless transition of these documents to later XML-based TEI versions. The TEI Migration Work Group collects case studies, provides examples and gives recommendations concerning strategies as well as software and best practice on conversion of TEI documents from SGML to XML.

The above work group activities will result in proposals all or most of which will probably be included in P5, which is planned for publication in the course of 2004. P5 will also include other substantial and general changes compared to earlier versions. For example, the document grammar will be expressed in an XML Schema language (Relax NG), as well as an XML DTD. The Guidelines will define a TEI namespace, facilitating inclusion of elements from other XML standards in TEI documents, and vice versa. The methods for combining various TEI DTD fragments will make use of newer and simpler mechanisms than the traditional parameter entity-based methods. Last, but not least: The TEI root element will be changed from TEI.2 to TEI.[25]

In addition to the Work Groups mentioned so far, the TEI also organizes Special Interest Groups (SIGs). SIGs reflect user community interests not yet implemented in the form of work groups, and may as such be considered candidate work groups. Therefore, a quick overview of the current TEI SIGs may also give some indication in which way the consortium may be drifting in the years to come. SIGs have been established on subjects as diverse as Manuscript transcription and description; Human Language Technologies; Training TEI Trainers; Graphics and Text; Overlapping Markup; Multilingual markup; Presentation Issues; Authoring Issues; User Interface Issues; Digital Libraries.

4.3 Beyond XML

With what has been said so far, it might seem as if generic markup today is all about XML. Even so, a number of alternative technologies have been proposed, or are under development.[26] Many of these have been developed or proposed in response to what is seen by some as weaknesses of XML.

However, before going into such purported weaknesses, let us remind ourselves of the particular strengths of XML. Considering the fact that SGML was around for more than a decade without having anywhere near the success of XML, which experienced such tremendous success almost immediately after its release, it is tempting to ask what it was that XML added to SGML.

One answer is that XML added nothing to SGML: As mentioned, XML is a proper subset of SGML. Another answer is that what XML added was simplicity, by taking away many of the specialized features which admittedly make SGML in many ways both more expressive and more flexible, but also more complex and difficult to use than XML.

The full answer, thus, is that XML not only removed some bells and whistles, but also managed to retain what constitutes the most basic and important features of XML. So the strengths of XML are those of SGML, i.e. the tight integration and mutual support of a simple linear form (the angle bracket notation), a natural interpretation in the form of a well-known data structure (the document trees), and a powerful constraint language (the DTD).[27]

Now, to the weaknesses. Common complaints about XML is that it provides poor support for interactive, multi-medial or multi-modal documents, that it does not have a well-defined semantics (or no semantics at all), and that it does not support the encoding of overlapping hierarchies and other complex structures.[28]

The first complaint, that XML provides poor support for interactive, multi-medial or multi-modal documents was to a large extent justified not so long ago, when e.g. Macromedia Flash provided better support for interactive and multi-medial streaming data. With the latest developments within XML-based technologies and standards such as e.g. SVG, SMIL and EMMA,[29] however, this objection becomes increasingly irrelevant.

The second complaint, that XML is a purely syntactic specification and has no semantics,[30] is often countered with the claim that being a purely syntactic standard is precisely one of the strengths of XML.[31] Even so, a generally applicable formal method of expressing the semantics of particular XML-based markup systems would be of great advantage to markup translation, document authenticity verification and a number of other common tasks. Considerable progress has been made in attempts to develop a formal semantics for XML markup,[32] but much work remains to be done in this area.

Many projects have addressed the third complaint, i.e. the problem that XML does not support complex document structures. It should be noted that this problem is easily explained by the tight integration between linear form, data structure and constraint language just mentioned. XML is based on a context-free grammar, which presupposes exactly the hierarchical structures we find imposed by XML. If one were to let go of the hierarchical nesting of elements in XML documents, there would be no known way of retaining the tight control over document structure provided by the DTD mechanism as we know it.[33]

Among complex structures, overlapping hierarchies are the ones which have received most attention. Overlap is ubiquitous in documents – pages, columns and lines often overlap chapters, paragraphs and sentences in printed material, verse lines often overlap metrical lines in dramatic poetry, hypertext links and anchors overlap in hypertexts et cetera.

The original SGML specification actually does have a mechanism which allows for the encoding of documents as overlapping hierarchies, i.e. CONCUR.[34] Unfortunately, this feature suffers from certain technical complications, it has only very rarely been implemented in SGML software, and it has been entirely removed from XML.

TEI has given a lot of attention to overlapping hierarchies, and provides a number of mechanisms to deal with them, such as milestone elements, so-called virtual elements and stand-off markup.[35] These are the methods most commonly used to represent overlapping hierarchies in XML today. A general drawback with these methods is that they presuppose customized processing in order to be effective.

An example of a more radical proposal is the ›Just-In-Time-Trees‹ .[36] According to this proposal, documents may still be stored using XML, but the XML representation is processed in non-standard ways and may be mapped on to different data structures than those known from XML.

Other and yet more radical proposals, which also attempt to solve problems with complex structures beyond overlap (i.e. including discontinuous elements, alternate ordering et cetera), offer alternatives to the basic XML linear form as well as its data model and processing model.

One such approach is the MLCD project with its TexMECS notation and GODDAG data structure,[37] another is the LMNL project, also with an alternative notation and a data structure based on Core Range Algebra.[38] Both proposals claim downward compatibility with XML.

However, none of these proposals provide constraint languages anywhere near the strength of the XML DTD mechanism for their proposed solutions to markup systems dealing with complex structures.[39] As long as this problem remains unsolved, it is unlikely that these proposals will be considered serious alternatives to XML, at least by the larger community.

5. Conclusion

While there is no reason to believe that XML is forever, there is every reason to believe that generic markup has come to stay with us for a long while. In a broader perspective, a curious fact about XML should not go unmentioned: XML came from the document world, soon bridged the gap to the database world, and is now used for representation also of non-textual material as diverse as e.g. graphics, mathematical and chemical notation, and music.[40]

Thus, generic markup seems to be turning into a general tool for what might be called knowledge representation, in fields widely different from textual studies and document management. It is to be expected that these other fields will contribute considerably to solutions to known problems, as well as presenting generic markup systems with entirely new problems to tackle.

In other words, the humanities is far from the only field which has a role to play in this development. It is worth noting, however, that the humanities have already made important contributions to the development of markup systems. In this situation it is a deplorable fact that many humanities scholars still regard markup as a product of computing technology and thus a purely technical tool of no interest or concern to humanities scholarship.

The experience and expertise of textual scholars may turn out to be essential, and they have a correspondingly high responsibility to make their methods available and adopt them for use in a digital environment.

Textual scholars should not relate to markup technology as passive recipients of products from the computing industry, but rather be actively involved in the development and in setting the agenda, as they possess insight which is essential to a successful shaping of digital text technology.

Claus Huitfeldt (Bergen, Norwegen)

Prof. Dr. Claus Huitfeldt
Filosofisk institutt
Sydnesplassen 7
NO-5007 Bergen
Claus.Huitfeldt@fil.uib.no


(24. März 2004)
[1] Needless to say, this account is constrained by the perspective and the limited knowledge of the author. My knowledge of markup is based primarily on experience from the work of the Text Encoding Initiative <http://www.tei-c.org/> (22.1.2004), the Wittgenstein archives <http://www.aksis.uib.no/projects/wab> (22.1.2004), and research on problems concerning markup of complex documents <http://www.aksis.uib.no/projects/mlcd> (22.1.2004). – Many thanks to Michael Sperberg-McQueen (World Wide Web Consortium), Sebastian Rahtz (Oxford University), Ralph Jewell (University of Bergen and Tone Merete Bruvik (Aksis, Bergen) for their comments and advice during my work with this article, the shortcomings of which they are of course in no way responsible.
[2] This is the view expressed in one of the most influential articles written on markup theory, an article to which also the title of the present text alludes: James H. Coombs/Allen H. Renear/Steven J. DeRose: Markup Systems and the Future of Scholarly Text Processing. In: Communications of the ACM 30/11 (1987), pp. 933-947.
[3] This use of the term ›presentational‹ is not strictly in accordance with the taxonomy given in J. H. Coombs et al., where the visual layout itself is what is considered ›presentational markup‹. It has become customary, however, to use ›presentational markup‹ to refer to markup which records (or ›is about‹) visual layout.
[4] SGML: Information Processing – Text and Office Systems – Standard Generalized Markup Language (SGML), ISO 8879-1986, Geneva: International Organization for Standardization 1986.
[5] Cf. J. H. Coombs et.al.: Markup Systems. (footnote 2).
[6] TEI P3:C. Michael Sperberg-McQueen/Lou Burnard (Eds.): TEI P3: Guidelines for Electronic Text Encoding and Interchange. Chicago/Oxford/Providence/Charlottesville/Bergen: ACH-ACL-ALLC 1994.
[7] A. H. Renear/David Dubin/C. Michael Sperberg-McQueen/Claus Huitfeldt: XML Semantics and Digital Libraries. In: Catherine C. Marshall/Geneva Henry/Lois Delcambre (Eds.): Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries. Houston, May./New York: Association for Computing Machinery 2003, pp. 303-305.
[8] The World Wide Web Consortium <http://www.w3.org/XML/> (22.1.2004).
[9] The World Wide Web Consortium <http://www.w3.org/MarkUp/> (22.1.2004).
[10] TEI P4: C. Michael Sperberg-McQueen,/Lou Burnard (Eds.): TEI P4: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium. XML Version: Oxford/Providence/Charlottesville/Bergen.
[11] The World Wide Web Consortium <http://www.w3.org/Style/XSL/> (22.1.2004).
[12] The World Wide Web Consortium <http://www.w3.org/XML/Query> (22.1.2004).
[13] The World Wide Web Consortium <http://www.w3.org/XML/Linking> (22.12004).
[14] The World Wide Web Consortium. See <http://www.w3.org/MarkUp/Forms/> (22.1.2004).
[15] The World Wide Web Consortium <http://www.w3.org/XML/Schema> (22.1.2204).
[16] Gary F. Simons: Using architectural forms to map TEI data into an object-oriented database. In: Computers and the Humanities 33/1-2 (1999), pp. 85-101 and ISO/IEC 10744:1997: Information processing – Hypermedia/Time-based Structuring Language (HyTime), 2nd ed. International Organization for Standardization, Geneva, May 1997, appendix A.3 Architectural Form Definition Requirements.
[17] The World Wide Web Consortium <http://www.w3.org/AudioVideo/> (22.1.2004).
[18] Tim Berners-Lee,/ James Hendler/ Ora Lassila: The semantic web. In: Scientific American 284, 5 (May 2001), pp. 35-43.
[19] The World Wide Web Consortium <http://www.w3.org/RDF/> (22.1.2004).
[20] Michel Biezunski/Martin Bryan/Steven R. Newcomb (Eds.): ISO/IEC 13250: 2000 Information technology – SGML Applications – Topic Maps. Geneva: International Organization for Standardization 2000.
[21] The World Wide Web Consortium <http://www.w3.org/DOM/> (22.1.2004).
[22] In practice XML and XML-based technologies such as XSL, XQuery etc. may be regarded as de facto industry standards. It should be noted, however, that they are so-called W3C ›recommendations‹ and not ISO standards.
[23] TEI Consortium: <http://www.tei-c.org> (22.1.2004).
[24] TEI Consortium: <http://www.tei-c.org/Activities/> (22.1.2004).
[25] One immediate effect is that the downward compatibility between different versions will be broken – any P4 (or earlier) TEI document will ipso facto be invalid in P5. However the TEI will continue to maintain P4 for any foreseeable future, and provide help and guidance in converting documents from earlier to later versions.
[26] See e.g., Steven J. Murdoch: Markup Language Survey <http://www.cl.cam.ac.uk/users/sjm217/projects/markup/survey/> (22.1.2004).
[27] C. Michael Sperberg-McQueen: »What matters?« Extreme Markup Languages 2002. Montreal/Canada, August <http://www.w3.org/People/cmsmcq/2002/whatmatters.html> (22.1.2004).
[28] By ›complex structures‹ I refer to such structural phenomena as overlapping elements, overlapping hierarchies, discontinuous elements, multiple alternative ordering of elements, structured attributes etc. – in short ›complex structure‹ is here admittedly defined simply as any structure not straightforwardly representable in SGML/XML. Cf. <http://www.aksis.uib.no/projects/mlcd> (22.1.2004).
[29] The World Wide Web Consortium <http://www.w3.org/Graphics/SVG/>, <http://www.w3.org/AudioVideo/> and also <http://www.w3.org/TR/EMMAreqs/> (each 22.1.2004).
[30] This may seem confusing in relation to another claim which is also often made, namely that XML is semantic markup. Unfortunately, the term ›semantic‹ in such contexts seems to have been confused with the more appropriate terms ›descriptive‹ or ›declarative‹. The point is that XML provides syntax, but no vocabulary, and thus no semantics. – Another source of confusion is that XML is sometimes used to represent semantics, e.g., in RDF, TopicMaps and other XML-based semantic web activities. In these cases, however, XML is used as a tool to represent the semantics of some subject matter other than XML. The various semantic web activities do not in general try to provide XML itself with a semantics.
[31] See e.g., Tim Bray,: On Semantics and Markup <http://www.tbray.org/ongoing/When/200x/2003/04/09/SemanticMarkup> (22.1.2004).
[32] See for example, the BECHAMEL project: David Dubin/C. Michael Sperberg-McQueen/Allen Renear/Claus Huitfeldt: A logic programming environment for document semantics and inference. In: Literary and LinguisticComputing, 18/2 (2003), pp. 225-233. (This is a corrected version of an article that appeared in 18/1 pp. 39-47). – At the risk of making confusion complete, it should still be mentioned that this formal semantics may in turn be represented in e.g., RDF or TopicMaps, although the BECHAMEL project currently uses other forms of representation.
[33] C. Michael Sperberg-McQueen: »What matters?« (footnote 27).
[34] C. Michael Sperberg-McQueen/Claus Huitfeldt: Concurrent document hierarchies in MECS and SGML. In: Literary and Linguistic Computing 14/1 (1999), pp. 29-42.
[35] David Barnard, et al.: Hierarchical Encoding of Text: Technical Problems and SGML Solutions. In: Computers and the Humanities 29/3 (1995), pp. 211-231.
[36] Patrick Durusay: Just-In-Time-Trees (JITTs), see e.g., <http://sbl-site2.org/Extreme2002/> (22.1.2004).
[37] C. Michael Sperberg-McQueen/Claus Huitfeldt: Markup Languages for Complex Documents, see <http://www.aksis.uib.no/projects/mlcd> (22.1.2004).
[38] Jeni Tennisson/Wendell Piez: The Layered Markup and Annotation Language (LMNL) <http://xml.coverpages.org/LMNL-Abstract.html> (22.1.2004).
[39] XML Schema allows for the expression of some context-sensitive constraints on XML documents. However providing a general constraint language for complex structures is another and more demanding task.
[40] For graphics, see e.g., SVG <http://www.w3.org/Graphics/SVG/> (22.1.2004), for mathematical notation MathML <http://www.w3.org/Math/> (22.1.2004), for Chemical markup language <http://xml.coverpages.org/cml.html> (22.1.2004), for music <http://xml.coverpages.org/xmlMusic.html> (22.1.2004).