Encoding for

The Thomas

Red Diamond button The Thomas
Red Diamond button
The Use of
Red Diamond button

Red Diamond button



Vertical Moroon Bar

Textual Encoding for Humanities Research

The Textual Encoding Initiative

The Thomas MacGreevy Hypertext Chronology is committed to encoding principles based on the Textual Encoding Initiative (TEI). TEI is 'an international project to develop guidelines for the preparation and interchange of electronic texts for scholarly research, and to satisfy a broad range of uses by the language industries more generally.'

TEI was developed in the late 1980s to enable better interchange and integration of data in the humanities. It aims to support all genres of text in all languages and across all periods. It provides guidance on what and how to encode for 'loss-free platform-independent interchange of information'.

TEI is sponsored by the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC). It has international support from groups including the Directorate XIII of the Commission of the European Communities (CEC/DG-XIII), the U.S. National Endowment for the Humanities (NEH), the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada.

Standard Generalised Markup Language

TEI is an application of Standard Generalised Markup Language (SGML). And SGML is an ISO standard (ISO 8879:1986) notation for defining generalised markup languages. In other words, despite its name, SGML is not actually a language but a meta-language.An SGML Document Type Definition, or DTD, defines the rules for marking up a type of document, be it a journal article, a newsletter, a letter, and so on. Each SGML-marked-up document must contain, or refer to, the relevant DTD.

Among the applications of SGML, the most widely known is Hypertext Markup Language (HTML), the native language of the World Wide Web. HTML is a relatively simple application with a restricted tag set, designed specifically for Web distribution and access.

HTML describes the visual appearance of the documents it is used to define; the markup is directly translated into physical features by HTML browsers such as Netscape and Microsoft Internet Explorer. In general, however, SGML-based languages, for example TEI, describe the logical structure and meaning of documents. So, for example, SGML might be used to identify components such as the title, author and section headings of a piece of prose but there might be no indication as to where to display these components on the screen or what fonts to use.

Because SGML-based languages, with the exception of HTML, do not generally describe appearance, an application is required to convert the documents in question to a suitable viewing format. This application could be an SGML viewer, for example, Softquad’s Panorama for Netscape. Panorama uses style sheets to relate logical markup to physical appearance.

Alternatively, SGML might be converted to another format for display. For example, the CELT (Corpus of Electronic Texts) project at University College Cork which provides 'an online resource for contemporary and historical Irish documents in literature, history and politics', encodes electronic documents using the TEI DTD. They are converted to HTML and both the TEI and HTML versions are made available to the end-user. This makes it possible to maintain a corpus of richly marked-up texts but also to make them easily accessible using more widely-available Web technology.

SGML tags are composed of ASCII data. In other words, they are not proprietary to particular computer programs. This difference is succinctly described by David Seaman of, the Electronic Text Center, University of Virginia:

SGML tags comprise of ASCII data only; they are not proprietary to a particular computer program. This sets them apart from - say - the codes in a WordPerfect document, which belong to, and are meaningful only within, the WordPerfect program. And while the WordPerfect code defines something by its visual appearance - a word is italicized - SGML is designed to describe the class of information to which the phrase belongs. Italics can be used for a variety of purposes, and most SGML tagsets can clearly delineate an emphatic word from a book's title or a chapter heading.

SGML in Textual Encoding Initiative Research

SGML was chosen as the vehicle for the TEI Guidelines, and the main deliverable of the TEI project is an SGML DTD describing over four hundred textual feature definitions, along with associated documentation and examples. These feature definitions are grouped into 'tag sets' of various kinds, for various encoding applications. The TEI tag sets are both comprehensive and extensible. They are documented in a 1400 page reference manual, the Guidelines for Text Encoding for Interchange.

All TEI-conformant texts, that is, texts marked up in accordance with the TEI Guidelines, include a TEI 'header' plus the marked-up body of the text. The header provides information analogous to that provided by a title page of a printed document. It consists of four sections:

  • A bibliographic description of the text
  • A description of how the has been encoded
  • A non-bibliographic description of the text (eg languages used, participants in production etc)
  • And a revision history.

This TEI header is sometimes used independently of any TEI tags sets as a source of bibliographic information for a document which itself may not be encoded using the TEI tag sets.

Extensible Markup Language

One of SGML's greatest assets, and one of its greatest drawbacks, is its comprehensiveness as a meta-language: it enables the definition of very complex languages incorporating features which are rarely used and difficult to understand.

The simple HTML language, on the other hand, with its small tagset was never designed for encoding complicated information, such as is required in humanities computing. Thus in 1996, the W3C (World Wide Web Consortium) announced ‘an extremely simple dialect of SGML’ called Extensible Markup Language, or XML.

The goal of XML is 'to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.' XML was not initially intended to take the place of HTML but to work along side it. However, it has proved to offer such significant advantages over the fixed tag set of HTML that it is now predicted to take over from HTML in Web publishing within the next five years.

An XML version of the TEI DTD is currently being developed; only minor changes to the DTD are required to ensure XML compatibility.


W3C (World Wide Web consortium) HTML

Cover, Robin 'The SGML/XML Web Page'. URL:

Khare, Rohit and Adam Rifkin 'XML: A Door to Automated Web Applications' IEEE Internet Computing 1:4 (July-August 1997), pp78-79.

Seaman, David 'Guidelines for SGML Text Mark-up at the Electronic Text Centre'. URL:

Sperberg-McQueen, C. M and Lou Burnard, Editors 'A Gentle Introduction to SGML' Chapter two of Guidelines for Electronic Text Encoding and Interchange (TEI P3). URL:

The World Wide Web Consortium, 'SGML, XML, and Structured Document Interchange'. URL:

XML - Frequently Asked Questions. URL:

TEI Home Page. URL:

Judith Wusteman, Department of Library and Information Studies, University College Dublin, Belfield, Dublin 4
Susan Schreibman, Department of Modern English, University College Dublin, Belfield, Dublin 4