
Date: December 13, 2006
Last updated on December 15, 2006
We are pleased to release a new feature of the TLG® search engine, i.e. Lemmatized Searches.
Work on lemmatization began in 2003 and benefited from access to software known as Morpheus developed by the Perseus Project.
Morpheus was designed to deal effectively with a relatively narrow, well-documented cross section of the Greek language, i.e. the classical canon, meaning Epic and Attic Greek with some Doric, Ionic, and Koine forms. The TLG corpus encompasses the totality of Greek literature, including Early Modern Greek, and Byzantine texts. As a result, lemmatization of the TLG corpus required a different philosophy and a significantly more complex architecture which combines lexical and morphological databases and extensive programming in order to increase parses and achieve higher and more accurate form recognition. This project was executed largely thanks to the efforts of Nick Nicholas and Nishad Prakash. Cindy Moore and Zeya Myint have contributed to the implementation of the system. The current beta version of the TLG lemmatizer recognizes approximately 92% of the unique wordforms in the TLG corpus.
The lemmatiser makes use of the following sources:
The advantage of using lemmatized searches is that the user can retrieve in one search a large number of forms that might otherwise require multiple string searches. One disadvantage maybe that the current TLG lemmatizer recognizes approximately 92% of the wordforms, therefore, some forms may not be recognized or maybe recognized incorrectly. Due to morphological ambiguity, the engine may also retrieve forms that could be part of the requested lemma but may actually belong to a similar lemma.