Markup on the web -- from TEI to XML

John A. Lehman,
Professor of Accounting and Information Systems,
Professor of International Business
School of Management,
University of Alaska Fairbanks
Fairbanks, AK 99775, USA, (907) 474-6275
ffjal@aurora.alaska.edu

Lisa M. Lehman,
Associate Professor of Information Science,
Rasmuson Library,
University of Alaska Fairbanks
Fairbanks, AK 99775, USA,
(907)474-7403
lisal@muskox.alaska.edu

Abstract

Over the last several years, most computer-based full-text projects have standardized on a family of SGML DTD's which were developed as part of the international Text Encoding Initiative (TEI). With the recent proposals to replace HTML with XML as the underlying technology for the www, there has been confusion among the participants and managers of full text projects as to the relationships between SGML, XML, and TEI. This paper explains those relationships, and discusses the probable impact on full text projects from the more general acceptance of XML. Specifically, it provides guidelines for managers of full text projects, and for library and information services executives on the planning and execution of full-text projects in a changing technical environment.

Background

Since the 1960's, users of computers for text processing have struggled with incompatibilities and obsolescence arising from proprietary hardware and software. IBM's original solution developed into the internationally-standardised SGML (Standard Generalised Markup Language, ISO 8879) [Goldfarb 1990]. Since SGML is intended as a general purpose approach which should be able to deal with any type of document structure, it does not include specific structural elements directly. Rather, a separate Document Type Definition (DTD) written in SGML is used to describe the structure of a type of document. Various industry-standard DTD's exist; the best known is Hypertext Markup Language (HTML).

HTML was originally developed as a very simple SGML Document Type Definition (DTD) for markup of academic papers. It became the markup standard for the world wide web more or less by accident. The vast majority of web browsers interpret HTML directly rather than treating it as a DTD to be interpreted through SGML. There has long been a consensus that it is insufficient both for layout and for marking up complex documents. Various vendors, especially Microsoft and Netscape have introduced proprietary extensions, aimed primarily at HTML's limitations as a layout language. These extensions have led to incompatibilities between different browsers, defeating the standard nature of non-proprietary markup.

For most of this decade, projects involving computerization of textual resources have followed one of two separate but related paths. Projects where the structure of the texts was relatively simple or where quasi-universal access was important have used the world wide web and its underlying HTML Document Type Definition. Projects where the structure of the text was complex and where special software with limited availability was acceptable have used the Standard Generalize Markup Language (SGML), usually in association with an industry-standard Document Type Definition (DTD).

Elementary logic (Figure 1) tells us that the above decision rules fail for many projects. Specifically, texts with complicated structures which require widespread access are not well served by the alternatives which have developed.

          | Simple    | Complex 
 -------------------------------        
Universal |  HTML     |   ??   |
------------------------------- 
Limited   |  HTML     |   SGML |
------------------------------- 
Figure 1
Document structures and solutions

The Extensible Markup Language (XML) is an attempt to deal with the limitations of HTML and with the limited distribution of SGML browsers. XML is a project of the World Wide Web Consortium (W3C); development of the specification is being supervised by their XML Working Group. Basically, XML is a subset of SGML designed for network use. The goals of the project are:

  1. To deal with complex documents on the www via a standards-based rather than a proprietary mechanism,
  2. To provide for the use of DTD's to specify document structure rather than trying to build all possibilities into a web browser,
  3. To preserve the investment which users of SGML have made in their text collections.

This last goal leads to an important question for managers of text projects: to what extent can their existing archives be viewed using XML browsers; in other words, to what extent are SGML-based textual archives web-ready?

Most humanities text archives created in the last several years have standardised on some variant of the TEI (Text Encoding Initiative) DTD, most commonly on the varient known as TEI-lite. Ironically, one of the major problems faced by the compilers of such archives has been the failure of SGML software to reach a mass market, leading to a serious lack of viewing software. The generally recommended solution to date has been to store the text in TEI/SGML, but to translate on the fly to HTML for viewing. This solution has been less than satisfactory, both from a technical and from a cost standpoint.

The development of XML as a technical basis for the www has gone far towards solving this problem. The TEI-lite DTD already exists in XML (http://www.loria.fr/~bonhomme/xml.html) and TEI has recently chartered a workgroup with the task of converting full TEI to XML. Thus, the short answer to the question of to what extent existing archives can be viewed using XML browsers is that anything marked up using TEI-lite can be so viewed today.

Unfortunately, the provision of an XML browser which can interpret TEI does not entirely solve the problem. While casual users may be content to browse, few readers seem to prefer reading their texts on screen. Scholarly use of computer-based texts requires that the texts be available for computer-based analysis, and a browser does not provide these facilities except for simple searches.

The typical browser search mechanism has three major shortcomings. The first is that is is limited to simple pattern matching. The second is that it can search on tect contents, but not based on indeces and keywords. The third is that it has difficulty dealing with corpora rather than simple texts. For all of these reasons, scholarly users require more sophisticated facilities.

The standard approach to providing such facilities on the current www is to exit the browser using either CGI or JAVA, and to write a separate program for analysis. The most common approach is to write one or more CGI scripts using the PERL language. Currently each PERL script must encorporate knowledge of the structure of the document which it is analysing, which is the very approach which standards-based text encoding was designed to avoid.

This problem should be largely solved within the next year or so, as PERL is presently being rewritten to recognize and interpret XML DTD's. Thus, future scholars should be able to use much more general purpose tools for searching and analysising of texts than is true today.

Conclusion

Earlier it was pointed out that full text projects need to deal with complex documents on the www via a standards-based rather than a proprietary mechanism, to provide for the use of DTD's to specify document structure rather than trying to build all possibilities into a web browser, and To preserve the investment which users of SGML have made in their text collections by enabling the use of current DTD's such as the TEI family. Furthermore, it was pointed out that more generalizable methods for searches and analysis were required. The combination of XML and an XML-aware PERL appear to have met these requirements in a way which will require no change to text projects for viewing on the next version of the www, and which will provide a far more capable set of analysis tools than has been the case heretofore.

References

Charles Muller, "Some Basic Guidelines on the Minimal Preparation of Humanities Articles for Conversion to HTML Presentation" <http://www.acmuller.gol.com/HTMLarticles.htm>

http://www.jtauber.com/xml/

http://lcweb.loc.gov/global/etext/etext.html

SunSITE Digital Collections http://sunsite.berkeley.edu/Collections/

Peter Flynn et al, The XML FAQ Version 1.21 (3 February 1998) http://www.ucc.ie/xml/