Thesaurus Linguae Graecae: A Digital Library of Greek Literature

1. GOALS AND SIGNIFICANCE

The Thesaurus Linguae Graecae (TLG), a research center at the University of California, Irvine, proposes a three-year plan to design and implement a search and retrieval system to facilitate Internet access to the extensive bibliographic and textual materials in its data bank. The project will focus on two primary areas: at the back end, it will specify a new document structure, explore new encoding techniques, and use metadata to both enhance the content of data and create new opportunities and capabilities to serve existing and new user communities; at the front end, it will offer customized (adaptive) query and browser facilities to support intelligent access to information and to facilitate collaboration in both the search process and sharing of scholarly information. With a large and diverse digital collection already in place, a world-wide community of regular users, and over twenty five years of experience in the production of digital texts, the TLG already constitutes a technically and financially sustainable resource and is uniquely positioned to serve as a testbed for the technical and organizational challenges of future digital libraries in the Humanities.

2. INTRODUCTION

2.1. Description and history of the TLG Project

For more than twenty-seven years, the TLG has been developing a digital library containing all extant works of Greek literature from Homer to the Byzantine era. This corpus encompasses Homer, classical Greek philosophy and drama, the New Testament, writings of the Eastern Church Fathers, and the Justinian legal corpus, texts which have helped define western literature and thought. The importance of the project is not limited to the specific collection of texts it has assembled. The TLG is also known for its pioneering role in introducing the concept of computer-aided research in the Humanities and for developing encoding and formatting conventions and methodologies that have minimized the centuries-old need for laborious and time-consuming manual data-collection in advance of the analytical scholarly process.[1]

Today the TLG collection contains some 3,300 authors and 10,000 works comprising more than 75-million words. It is disseminated in CD ROM format in more than 50 countries world-wide and used by thousands of specialists and non-experts who are interested in the ancient world. TLG users include researchers, educators and students from a wide range of disciplines such as classics, archaeology, history, art, history, philosophy, linguistics, and theology/religious studies in more than 2,000 universities and research centers around the world.

The advantages of digitizing such a collection are obvious. Fast retrieval of information and preservation of the source materials from decay and oblivion are only two of TLG's contributions. Access to TLG materials has helped scholars conduct research more effectively and has served as a stimulus to scholarship, enhancing the quality of research and publication in a number of fields. Classicists today are more willing to research textual questions that were too laborious to pursue in the past without computerized resources. The ability to browse texts that previously were found only in remote libraries or to locate, in just a few minutes, all the occurrences and uses of a term in the digital equivalent of hundreds of thousands of printed pages has allowed researchers to concentrate on critical analysis and interpretation and to deepen their understanding of these texts.

2.2. Project objectives

Recent technological advances provide us with the tools to go even further and make us optimistic that the benefits the TLG offers can be further expanded and enhanced. Web access will allow users to search the TLG faster, more efficiently, and in conjunction with other Web projects, such as the Perseus multimedia library and lexicographic resources, the Duke Data Bank of Documentary Papyri and the wealth of journals, and other materials which are being made available over the Internet. [2] On-line access to such diverse resources will encourage collaboration among traditional scholars, enhance the learning experience of our students, and attract new audiences to the study of the humanities. It will also reinforce the potential of the digital library to provide added value. Although we do not believe we can help non-specialists read the Greek original, we expect that intelligent and simultaneous on-line access to the original texts together with translations, bibliographies, and dictionaries will remove the "intimidation" factor of the printed original and encourage far more people (including beginning students and individuals interested in our subject) to search through these resources.

Traditionally the TLG has concentrated on expanding its collection and providing content. Until recently, software development had not been a priority for the project; in fact all TLG search engines have been produced by independent software developers.[3] Although the mandate of the TLG to create a digital collection of all extant Greek literature has not changed, we see technology increasingly becoming an integral part of our work. The TLG is presently advancing along two interdependent paths. One is the expansion of the digital collection through the addition of new works and the updating of older editions.[4] The second path deals with text encoding and software development and includes:

* restructuring of the data using new and evolving methodologies of text encoding;

* developing the tools to process the extensive textual, bibliographic, and lexical information already available in the TLG data bank;

* designing a user interface that is conducive to the needs of users, both specialists and non-specialists.

Data structuring and proper text encoding will allow us to declare the internal structure of the documents in an explicitly formalized and widely used format, and develop search engines and query feedback techniques to process both content and structurally-based queries. This is a necessary step because despite its massive collection of digital texts, the TLG presently lacks a cognitively based search and retrieval mechanism. Existing retrieval tools have been created by independent software developers, mostly classicists.[5] These retrieval tools allow information recall by simple string and substring searches, key word matching, or fixed indexing. Such searches, however, lack precision and presume that the user knows exactly what to look for, and is prepared to invest the time and effort to filter out extraneous data. It should also be noted that existing retrieval software has been developed to search the texts in conjunction with the TLG CD ROM. [6]

The limitations of the existing retrieval tools can be attributed to a number of factors. The problem resides partly in the textual data itself, since TLG texts do not contain the markup necessary to facilitate complex and precise searches. Established computational techniques that work well for relatively uninflected languages, such as English and French, are much less effective in Greek, where a single vocabulary entry often generates hundreds of different forms. Furthermore, the TLG data bank contains many idiosyncrasies (e.g., words broken by hyphenation, fragmentary words from papyrus fragments, citation systems that vary from work to work, etc.), which compound the linguistic problems. Our aim will be to define and build a database architecture which can effectively deal with the particular linguistic and formatting issues of a large multilingual collection and to work with the user community to define a user interface that will assist specialists in conducting their research and enable users without proficiency in the Greek language to access the resources of the collection.

2.3. Broader significance of the project

Digital libraries employ technology-mediated structures to reduce the barriers of distance and time, support intelligent representation and retrieval of information, and facilitate research and learning. This is exactly the focus of the proposed system. The TLG brings three important assets into the ongoing digital libraries research: a corpus of unprecedented comprehensiveness and scope, an extensive collection of intricate data, more than twenty-seven years of experience and expertise in large-scale data preservation and management, and a large, committed, and multidisciplinary base of users. Beyond the obvious subject-specific benefits, this project can provide a prototype for other digital collections of literary works and an excellent test-case to measure the digital library's potential to deliver enhanced value. More specifically, this project will test:

* the ability of the digital library to surpass the functionality of traditional libraries and make highly specialized materials accessible to non-experts;

* the benefits of using SGML/XML to structure a large and complex collection of (non-commercial) textual materials, and to improve search and retrieval mechanisms;


* the capacity of the Unicode to standardize and facilitate access to a diverse collection of non-Roman alphabet materials.

3. WHERE WE ARE NOW

3.1. Digitization

The data bank presently contains some 3,300 authors, many of them represented by more than one edition. Additions to the data bank continue at the expected rate of about 4-million words per year. Most of the additions represent new authors. For many authors already in the data bank, new editions must be entered to replace outdated, older editions. One obvious advantage of WWW dissemination is the freedom from the limitations imposed by a static medium such as the CD ROM. Texts can be continuously updated to reflect new editions, and without the problem of storage, multiple (and often rare) editions of the texts can be entered so that users will be able to compare different editions.[7]

For reasons of economy and product quality, inputting of data is done manually at the keyboard. The alternatives, scanning and OCR systems, typically involve considerable manual effort, which would not be cost-effective for such a large and diverse document collection. The texts are entered in Beta-code, which is a character and formatting encoding convention.[8] Each of the twenty-four Greek letters has an assigned ASCII position, and diacritics are indicated by non-alphabetic characters following the accented vowel. Beta-code remains to this day the most practical way to encode polytonic Greek data. Its use has been necessitated by the absence of a standard, cross-platform font table and keyboard for ancient (polytonic) Greek [Rusten, 96]. The recent decision of the Unicode Consortium to assign slots to the whole range of standard ancient accent/character combinations in "Greek Extended" is an important step towards the establishment of Unicode as the convention for typing Greek characters. The availability of the Greek Extended set will make the move away from the Beta-code more feasible although many challenges remain to be dealt with primarily due to the formatting complexity of the collection.

3.2. Data verification

During the past year, the project has concentrated its efforts on producing an efficient library of verification and correction programs capable of detecting potential data entry errors in morphology, accentuation, formatting, and citation. Once the texts have been verified and corrected, they become permanent additions to the collection. In the process of correction, new vocabulary items are automatically indexed and added to a lexical database (currently implemented as a relational database on an MS SQL-Server). The database stores the vocabulary in three tables, one containing the author and work number, a second containing every unique word in the texts (presently over 1.5 million records), and a third table containing the work in which each word occurs and the number of times the word occurs in that work (currently over 11 million records). New vocabulary items are added to the lexical database for future morphological and semantic analysis.

3.3. The Canon of Greek authors and works

In addition to the texts, the project has compiled the Canon of Greek authors and works [Berkowitz, 1990], a comprehensive database of all known ancient Greek and Byzantine authors, together with bibliographies of existing critical editions of their extant works.[9] The Canon was originally designed to function as a register, namely an electronic guide to the authors and works selected for inclusion in the data bank. Besides providing a list of authors and works included, or about to be included, the canon contains various categories of information that help distinguish one author from another or define a particular work (e.g., traditional epithets that define the author's literary activity, geographic location, dates, dialects, meters, etc.), as well as bibliographic information for each work and links to other related works.

The bibliographic information of the Canon has now been transferred to a new and significantly enhanced relational database used to search the data as well as enter new information and edit existing entries.[10] The database is divided into three major subsystems: author/collection, work, and publication of the work. A schematic diagram of the Canon database is provided in Figure 1.

4. DESIGN OF FUTURE ACTIVITIES

Building on the existing data, we propose to develop a mechanism that will provide more efficient access to the data bank. Figure 2 illustrates the tasks that will have been completed prior to the beginning of the grant period (shaded area) and the activities to be undertaken during the grant period.

4.1. The Canon database

The WWW Canon database will be a fully functional (for all scholarly purposes) replication of the in-house Canon database.[11] Bibliographic information collected and organized by the research staff will be made available to all web users without paid subscription. The purpose of this database is twofold. From the users' point of view, it will provide an information resource, a public "encyclopedia" or a reference point for all publications related to the history of Greek literature. From the back-end programming point of view, the database will facilitate the recall of more specialized and complex queries by interacting with the other elements/data residing in the TLG server.

4.2. The texts

The encoding conventions applied to the TLG texts were developed in the 70's. While groundbreaking at the time for the amount of scholarly information they captured, these conventions have been superseded by more recent developments and must be revised. We expect that much of our work in the next two years will be devoted to structuring our data in ways that will facilitate sophisticated and content-specific searches. The use of two encoding standards (SGML or XML and Unicode)[12] will also outlast future changes in the delivery system and ensure that our data is compatible with those produced by other digital projects.

4.2.1. Unicode

We will seek to convert our texts to the Unicode Standard during the grant period [Rusten, 98 and Rusten, FAQ]. This project will be a significant test of the application of Unicode to a non-Roman-script, information-enriched text collection. The conversion of the text collection to Unicode would be far from straightforward. While there are many reasons to convert to Unicode, it is not clear how much information can be converted and how much will be handled with other encoding conventions. For instance, Greek characters and diacritics would be converted to Unicode from Beta-code as Greek text, and Roman characters as Roman text. But Beta-code also contains a copious number of other typographical information, such as different kinds of brackets, punctuation signs, and miscellaneous text symbols, including numismatic, editorial, and alchemical signs. It is not immediately obvious whether all text symbols should be converted to Unicode characters, or whether some of them should be rendered as SGML-entities, with some degree of semantic tagging (i.e. cross-reference to standard definitions). Indeed, not all these elements are likely to be incorporated in the Unicode standard in the near future; so an SGML-entity solution would have to be harmonized with a Unicode-based approach, to a degree unprecedented in similar projects to date.

An added benefit of the conversion of Beta-code to Unicode would be a formal proposal to the Unicode Consortium of character assignments for the set of glyphs required to adequately render ancient Greek texts, including papyrological and epigraphical texts. Such a proposal would help address the problem of absent glyphs in the Unicode repertory requisite for our purposes, and would render a valuable service to the entire field, which to date has relied on ad hoc solutions.[13] This would be a project of some complexity, as the existing Beta-escapes have to be weeded out for semantically distinct, widely used characters, and it would have to be undertaken in consultation with epigraphists, papyrologists and palaeographers worldwide. The TLG, with its comprehensive and diverse text-bank, is best poised to initiate such a venture.

4.2.2.SGML or XML encoding

Beta-encoding does not address content structure or morphological and semantic features. This has been the main obstacle in developing tools to explore the rich potential of the TLG texts. The next step, therefore, will be to increase the functionality of our texts by adding SGML or XML (TEI conformant) encoding to new texts and new editions. Determining the appropriate level of tagging will be an important part of this process. Careful planning will be necessary, given the large variety of genres and citation systems present in our collection. Since the existing TLG texts are already marked in Beta-code for elements such as chapters, paragraphs, line number, end of line, etc., automatic retro-conversion should be possible.

The benefits of using such encoding are many. Conversion to Unicode will accelerate data entry and correction of the texts and, therefore, reduce the overall cost of digitization. At the same time, SGML or XML encoding will significantly improve content-based search queries and will provide a well-defined, stable standard for the long term-maintenance of our digital collection. By adopting a formal, explicit and commonly-used tagging system such as SGML or XML, and a philologically sensitive tagset such as TEI or some subset thereof, the project guarantees compatibility with other digital library projects, and allows the use of cross-platform, cross-library software. A proper encoding regimen will also allow the separation of text characters from format tagging, which is indispensable to intelligent text processing, and is not the case for the current encoding (Beta-code). For example, text searches would not have to scan past particular non-alphabetic ASCII characters, as they currently do in Beta-code, to determine what constitutes a Greek word.

The choice of encoding is dictated by our desire to leverage the widely deployed existing technologies and make our data conveniently accessible to larger audiences; with the range of audience we wish to reach, it is increasingly apparent that XML offers the most feasible scheme in the near future for such an application. The choice of a tagset is a less straightforward decision. While TEI has a deserved reputation for its comprehensiveness, the conversion of full-TEI to XML-compliance is not yet a reality. Furthermore, it is not clear that our project requires the full range of TEI tags and options, or that the TEI tagset would be complete and adequate to our purposes. We envisage using a subset of TEI as a starting point, and augmenting this subset as the need arises through manipulating the appropriate DTD. Developing and testing a tagset for a collection of this level of comprehensiveness and diversity will most certainly be a significant challenge, and would be instructive for future work in tagsets in general.[14]

A significant part of the project would be a formal specification of the existing encoding scheme, which is already underway. As a product of over two decades of work formulated prior to current mainstream encoding research, Beta-code displays several inconsistencies which complicate its use unnecessarily for software developers and Greek scholars alike. Formal specification is a necessary first step in the conversion process, and will aid in identifying these inconsistencies.

4.3. The lexical database

The TLG is an invaluable resource for lexicographers who are interested in identifying forms, locating them in the texts, and analyzing their usage [Ad&S, 1994]. Owing to the highly inflected nature of Greek, no retrieval system can produce a high level of precision unless it can perform morphologically sensitive searches. For years, lexicographers have been telling us that their work would be greatly facilitated if the TLG word index were fully lemmatized, that is, if words were divided into morphologically related sets so that a search for a given word would locate not only the word itself but also its cognates. One of our objectives, therefore, is to organize the TLG lexical database so that the user can search the TLG for all instances of a given word by simply requesting its dictionary entry form. This is a necessary step in view of the complexity of Greek morphology and the fact that many morphologically related forms do not at all resemble each other. For example, at present the user who searches for the present stem of the verb /ktiz-õ/ "to create" will miss the future tense of the verb /ktis-õ/ or the past /e-ktis-a/ but will most likely receive unrelated occurrences of the verb /oiktizõ/ "to have pity" and /laktizõ/ "to kick". This is also frustrating to non-specialists who are trying to locate the use of a particular word and have to sort through thousands of irrelevant entries. Gregory Crane has already done work towards this purpose with the Perseus Project [Crane/Lexicon]. The morphological analyzer developed by the Perseus team is capable of identifying morphologically a large number of words, although it is still not achieving satisfactory levels of precision. The Perseus Project is continuing its work on Greek morphology. Our collaboration with Perseus will make sure that our complementary resources are available to our users and both projects avoid unnecessary duplication of effort.[15] Our objectives are to:

* structure our lexical database[16] so that morphologically sensitive searches can be performed as part of the larger scheme of searching the TLG data and

* link our resources to the Perseus morphological and lexicographical tools, so that users can access both sites easily and efficiently.

5. DELIVERY SYSTEM

TLG directions and methodologies have always been guided by the needs of its user base, mostly traditional scholars in a variety of disciplines. In the last few years the community of TLG users has been broadened to include non-specialists, life-long learners, and students. We believe that the delivery system and user interface will play a significant role in making our digital library accessible to the traditional scholar as well as to the wider audience.

5.1. Processing of data

Our ultimate goal is to develop an integrated delivery system that will link the texts to the other two internal databases (lexical and bibliographical) and allow the user to perform specialized searches on the texts by drawing information from all three types of data with the added benefit of content and structurally based encoding. The user should be able to perform simple database searches to retrieve all the occurrences of a word in a particular author, historical period, genre, dialect, or any other entity represented in the Canon database. The user should also be able to do such a search for one given form or its full morphological paradigm. More important, the user should be able to search for predefined descriptive relationships within the documents (e.g., locate the use of a word specifically in choral passages in Greek tragedy or in a speech, as opposed to a narrative passage). The delivery system would then parse and optimize the query to determine the individual components involved, combine the results from individual components, and ultimately display them to the user.

5.2. User interface and system assessment

The TLG presently has more than 20,000 users around the world in a broad range of disciplines and levels of expertise. Assessment of user needs and readiness for particular technologies is an essential part of the TLG's working relationship with its community. Over the years, user surveys[17] have complemented less structured channels in determining present needs, in laying out areas of future expansion, and in providing a reliable picture of the types of questions our users are most likely to pursue.

In creating the tools for access to the structurally enhanced corpus, such community involvement, as well as more formal useability studies, will be principal mechanisms in constructing user interfaces matched to the scholarly needs and working styles of the expanding community of users. In the past, our users have been most comfortable with electronic presentation which replicated the printed page as well as technology allowed. The TLG followed that inclination even to the extent of retaining the citation systems of different (print) editions. The broad acceptance of web technology allows us to improve on the paper original not just in terms of the browseability and functionality of the electronic content, but also in the visual presentation of text. In designing the user interface, our goal will be to develop a model that will encourage collaboration among traditional scholars and students, and will enable non-specialists to navigate the intricacies of this multilingual and diverse corpus of materials.

We intend to co-design the user interface with a sample of TLG users that represent experts, novices and teachers [Beyer &Holtzblatt, 1998]. We are interested in discerning the efficacy of collaboration in query formation and the analysis of results (especially among novice users), the appropriate use of visualization techniques, and whether or not the shared annotation of document retrieval sets results in collaboration among users. Empirical research for texts of this kind is limited [Ruhleder, 1991]. We plan to build on existing work in the area, and to build a real system in close collaboration with our existing and potentially new user community. We believe that our user interface work will not only create a more efficient and effective electronic scholarly environment, but, because we intend to build a socially and cognitively grounded collaborative inquiry system, it will also make a contribution to both applied and theoretical aspects of user interface design.

The TLG Project has always coupled long-term vision with practical concern for using technology in service of a specific community. In keeping with this vision, the third year of the proposed grant includes a comprehensive evaluation of the TLG system. This evaluation will include technical aspects (e.g., speed, efficiency, performance under load), [18] the impact of various aspects of the TLG on scholarly work, identification of future areas of development (e.g., different types of content-based and context-sensitive searches), and the changing nature of the community of users. We plan to evaluate the system as it is being developed by using rapid prototyping techniques. In the third year we will use in situ techniques to evaluate the extent to which users incorporate the TLG into their work and the extent to which the new system facilitates research and collaboration.

5.3. Dissemination

The results of this project will be disseminated through publications in refereed journals and presentations in national and international meetings. We plan to present papers at the annual meetings of the American Philological Association, the American Society for Information Science, the Association for Computers and the Humanities, the Association of Literary and Linguistic Computing, and other scholarly societies.

Since its inception, the TLG has worked closely with publishers and other copyright holders to ensure both broad availability of the Project's CD ROM products and due respect for intellectual property rights. Indeed, the TLG electronic corpus positions the Project as a significant collaborator in any digital library or publishing venture which draws on the broad range of materials in that corpus. TLG funding has been a combination of grants, donations, and subscription fees, giving it practical experience in the economics of dissemination.

Dissemination of the materials will occur in several ways:

1. The Canon Database will be open and available to all Web users without subscription. It will be continuously updated and expanded.

2. The textual materials will first become available to a select number of beta test sites. Once the system has been properly evaluated, it will become available to all TLG subscribers via a secure password protected server. Reasonable subscription will be necessary to support the future expansion and refinement of the data bank. We will apply a sliding scale in which major research libraries and institutions with a larger number of users will pay a higher share compared to small undergraduate programs, secondary schools and individuals. [19]

3. The option of using the CD ROM in conjunction with, or instead of, the WWW will be available especially to users outside the US.

6. PROJECT TEAM

To realize the goals of this project we have brought together a multidisciplinary team of collaborators with demonstrated competence and expertise in the various areas of research represented in this proposal, namely, Classics, electronic publishing, text and character encoding, system building, and end-user interface.

Professor Maria Pantelia, the TLG Director, will be the Principal Investigator of the project. The PI will oversee all activities and participate in the overall design of the various project components.

The bulk of technical work will be performed by Nick Nicholas, TLG Associate Researcher and Nishad Prakash, TLG Programmer/Analyst. Dr. Nick Nicholas is a linguist specialising in Greek linguistics, with a background in Computer Science. He has been involved in text tagging projects involving SGML and TEI and has conducted software development at the TLG involving Beta-code, text verification and correction (format and spell-checking), and CD ROM production. He will handle aspects of the project related to text encoding. Nishad Prakash, the TLG programmer, holds a degree in Classics and Computer Science. He has already worked on the implementation of the Canon database and will participate in the development of the search and retrieval system.

Professor Jeffrey Rusten, Classics, Cornell University, will participate in the redesign of the encoding system. A classicist with extensive experience in fonts and character encoding, Jeff Rusten has offered technical support on behalf of the American Philological Association for the GreekKeys set of fonts (polytonic Greek) since 1989. He has also produced a (beta) version of Athena Roman, a Unicode font.

Elli Mylonas, Associate Director for Projects and Research, and Lead Project Analyst for the Scholarly Technology Group at Brown University, will participate in the design and implementation of text encoding. Her areas of specialty are hypertext, SGML/XML, structured text problems and electronic publishing, namely, presentation of structured data by means of the WWW. She has served as the Managing Editor of the Perseus Project at Harvard University. Mylonas has published and spoken on hypertext and electronic text, as well as project management and academic software projects. Her background is in Classics.

Dr. Richard Giordano, Lecturer and Collaboration Systems Scientist, Center for Innovation in Product Development, MIT, will participate in the user-interface design and will oversee the evaluation phase of the project.

Professor Mark Ackerman, Information and Computer Science at UCI, will collaborate in the design of user interface. Dr. Stephen Franklin, Office of Academic Computing and UCI Department of ICS will serve as project technical adviser and assist in all stages with particular attention to assessment and deployment.

Jennifer Beach, a Classics Ph.D. candidate, will assist in this project. She has over seven years of experience in programming and data administration and is very familiar with the format of the TLG texts.

7. CONCLUSION

The TLG was the first major computerized collection of texts in the Humanities. Over the last 27 years, it has engaged in the primary mission of any library, namely to provide physical and intellectual access to, and preservation of, unique resources. Its present collection encompasses a corpus of unprecedented diversity and comprehensiveness of content. This very content, collected over decades, with often inconsistent philosophies predating the modern notion of the digital library, offers a significant challenge to any contemporary digitization scheme. This project affords an opportunity to the TLG and its collaborators to utilize a vast data bank to test and exploit current technological developments, a process which will most certainly prove beneficial to ongoing digital libraries research.

The restructuring and on-line access of the TLG collection will support visual and structured representation of the texts reflecting the original sources, and facilitate dynamic collaboration with related digital projects. Flexible, platform-independent, and user-friendly availability of the TLG collection will enable scholars, educators, and non-specialists to explore the full potential of the texts in ways not possible before. Through this project, the TLG will expand the benefits it has always offered to the scholarly community, and make a culturally important corpus available to a broader audience.

[1] The TLG was founded in 1972. It has received funding from a wide range of sources including the National Endowment for the Humanities, The Andrew W. Mellon Foundation, The David and Lucile Packard Foundation, the University of California, and many individuals and private organizations.

[2] URL: http://www.perseus.tufts.edu/ and http://scriptorium.lib.duke.edu/papyrus/ For a list of Classics resources see http://www.tlg.uci.edu/~tlg/resources.html.

[3]A list of software to search the TLG is available at : http://www.tlg.uci.edu/~tlg/software.html

[4]Routine expansion and refinement of content is covered by subscription revenue and is not part of this proposal.

[5]In an effort to concentrate on its collection of texts, the TLG did not engage in software development. Text correction and verification was implemented with a system developed in the early 80's by D.W. Packard and W. A. Johnson on the Ibycus computer (HP-1000). A new and rather sophisticated Correction and Verification system has been developed during the last two years. This system checks the formatting of the texts as well as accentuation and morphology in beta code.

[6]The only WWW-based text retrieval system for Greek texts has been developed by the Perseus Project at Tufts University. For the moment the Perseus site offers limited search capabilities. We envision a significantly expanded user interface.

[7]The TLG does not contain the critical apparatus, namely the variant readings offered by printed critical editions. The decision not to include the apparatus was made in the early stages of the project and was based primarily on financial and technical considerations. The ability to compare two editions will significantly increase the strength of the electronic text.

[8] Beta code is based on a post-positive convention consistent with Unicode and can be converted with simple character substitution. An electronic version of the beta code manual is available at: <http://www.tlg.uci.edu/~tlg/beta.html> For encoding methods, see [MacKay, 1996].

[9] A printed version of the Canon has been published by Oxford University Press [Berkowitz, 1990]. An electronic version of the Canon is disseminated with the TLG CD ROM.

[10]The data was originally stored in a flat database in the Ibycus system.

[11]The in-house database contains "house-keeping" information such as shipment number, correction status, word count, etc., which is irrelevant to the scholarly content.

[12]Deciding which one of the available technologies is best suited is a critical part of this project.

[13] Extensive use of Unicode also presupposes the availability of Unicode fonts and keyboards for Polytonic Greek. Jeff Rusten, a collaborator of this project [Rusten, 98 and FAQ] has made a font set available for Windows NT. A Microsoft Unicode keyboard is not expected in the near future although interim shareware solutions are available. Macintosh Unicode compatibility is also not expected soon. Nevertheless, given the three-year span of this project, as well as the ostensive commitment of the computer industry to the Unicode standard, we are confident that this is a timely path for the TLG to follow.

[14] Especially in view of the fact that XML has received broad commercial support but has not been adequately tested for academic applications.

[15]The roots of this collaboration go back several years. Over the last two years, we have been collaborating with Perseus to provide access to TLG texts over the WWW using the morphological and lexicographical resources developed by Perseus.

[16] In the area of lexical research, the PrincetonWordNet [Fellbaum, 1998] and its European collaborators offers a useful approach, although our emphasis will be on morphology and not semantics.

[17] For a comprehensive study of user needs, see Karen Ruhleder's 1991 dissertation [Ruhleder,1991].

[18]For questions of efficiency and straight technical features, [Nielsen 1993].

[19]The TLG CD ROM is disseminated on the basis of a five-year license agreement. Three types of licenses are offered (individual, institutional/departmental and site license). The TLG is presently considering a restructuring (in most cases a reduction) of its fees. Future cost of subscription is largely contingent on the availability of external funding. A permanent endowment fund (to ensure the long-term viability of the project) has also been established since 1992 following the award of a Challenge Grant from the National Endowment for the Humanities.