Authored: Nick Nicholas, TLG
Created: October 1999
Last Revised: 2002-09-09
The following defines the algorithm used to sort Greek words in the TLG project. The sort is used to order the TLG word index, as distributed on CD ROM, and words as retrieved in the TLG search engine.
A TLG word list is sorted primarily by Greek alphabetical order, according to the algorithm outlined below. For most purposes, it is necessary only to know the order of the Beta code letters (
'ABGDEVZHQIKLMNCOPRSTUFXYW), and that smooth breathings sort before rough breathings, acute accents before circumflex. The actual sort performed to produce the word list is more precise, and is detailed below.
To sort the list in Greek alphabetical order, a binary code is used (hereinafter referred to as the sorting key). The sorting key is derived from the ASCII representation of a Greek word, in Beta code. Each character is assigned the low seven bits of a byte, and the sign bit (bit 7) is turned on for each byte, to distinguish the sorting key from any surrounding ASCII where necessary.
The sorting key is conceptually the same as that used by the Packard Humanities Institute in the production of TLG CD ROMs #B-#D, and on the IBYCUS computer. However, the IBYCUS sorting process used 5-bit values in a 16-bit word. The need to use modern microcomputer-compatible 8-bit words, and to avoid any 8-bit word having a null value (which would be taken as indicating the end of the string), has made it necessary for us to instead allocate 8 bits for each value.
To sort a Beta code word list, each word in the list is converted to a string containing the sorting key, then a tab character, then the ASCII original of the word. As a result, where the sorting key algorithm yields identical results for any two words, ASCII sort takes over for the ASCII original strings. For instance, the algorithm ignores a third iota subscript in a word; so when presented with the strings
A|A|AR ᾳᾳαρ and
A|A|A|R ᾳᾳᾳρ, it will generate an identical sorting key for both. The strings would then be sorted in ASCII order (as given in the remainder of the string), whereby
With the exception of the partial word bytes (see below), each byte of the sorting key also has its 6th bit turned off, and its 5th bit turned on. The remaining five bits of each byte contain the same information as each 5-bit value in the IBYCUS code. That means that each byte contains 160 (binary
1010 0000) + the 5-bit value.
The input to the sorting algorithm is expected to have been subjected to normalization through the algorithm outlined in TLG Technical Note 001.
The sorting keys for any two words are compared byte by byte. If all bytes are equal up to the length of one word, the shorter string precedes the longer; for example,
KAI και <
KAINOS καινος. In the following, the sorting keys computed for each word are given in hexadecimal (with the standard prefix
In this scheme, values 0-4 of the 5-bit sorting key value represent the Greek breathings, while values 5-30 represent the Greek alphabet:
'ABGDE VZHQI KLMNC OPRST UFXYW
Each word is encoded so that the bytes representing each letter of the word (codes 5-30) are given first, followed by the breathing code (if any) for the entire word. For example, the following appear in sorted order:
|Smooth breathing on diphthong.|
|Rough breathing on vowel.|
|Rough breathing on diphthong.|
Given the sorting keys, the order of the words is:
If there is crasis in a word, the code for the first coronis follows the breathing code. This code, like those following, is also a 5-bit value prefixed by the bits
101. In that code, bit 0 indicates the type of coronis (0 if smooth, 1 if rough), and bits 1-4 indicate the location within the word (counting from the start of the word):
|No breathing;||2nd letter from beginning:||Smooth coronis;||Acute (see below).|
Sorting coronides separately from breathing marks allows a distinction to be made between
E)GW)=|MAI ἐγᾦμαι and
E)GW=|MAI ἐγῷμαι. If there is no coronis in the word, the 5-bit value assigned is zero.
On the TLG Search engine, rough coronides are normalized to smooth: thus no distinction is made between
X(WSχὡς, since the TLG Word Index already treats both as
XW)Sχὠς. The distinction between rough and smooth coronis is only preserved on TLG CD ROM #E as of this writing.
The coronis code is followed (if necessary) by an accent code. In this code, bit 0 indicates the accent (0 if acute, 1 if circumflex), and bits 1-4 indicate the location within the word (15 minus the number of letters from the end of the word: the last letter of the word has thus the location value 15-0). If the word is unaccented, or bears an accent before the 14th last letter, the location value is zero:
|Rough on diphthong;||No coronis;||Last letter:||acute.|
|No breathing;||No coronis;|
|3rd last letter:||circumflex.|
|No breathing;||No coronis;|
|2nd last letter:||circumflex.|
MU=WY μῦωψ sorts before
0xB2BA BEB9 A0A0 BB <
0xB2BA BEB9 A0A0 BD. In general, earlier accents sort before later accents in the word, and for words accented in the same position, acutes sort before circumflexes. (Graves are conflated with acutes in normalization, and are not considered separately.) In classical terms, the sort order is: Proparoxytone < Paroxytone < Properispomenon < Oxytone < Perispomenon.
The accent code, when necessary, is followed by the iota subscript code. In that code, the 5-bit value gives the location of the subscript: the number of letters from the end of the word, plus one (an absent value is denoted by a 5-bit value of zero):
|No coronis;||4th last letter:||acute;||Last letter: iota subscript.|
Thus words with later iota subscripts are sorted before words with earlier iota subscripts, ceteris paribus.
Up to two iota subscripts may be encoded for a word:
|No coronis;||3rd last letter:||circumflex;||3rd last letter: iota subscript;||Last letter: iota subscript.|
When necessary, the iota subscript codes are followed by a hypodiastole code, giving the location of the hypodiastole counting from the start of the word (where the first letter is 1):
|Rough on vowel;||No coronis;|
|3rd last letter:||acute;||No iota subscript (1);||No iota subscript (2);|
|1st letter, hypodiastole.|
Note that the diacritic codes are included only as needed: the sorting key follows the hierarchy letters > breathing > coronis > accent > 1st iota subscript > 2nd iota subscript > hypodiastole, and a value is not output if no values lower than it are output.
Thus, the encoding of
OI( οἱ need not include accent, subscript, coronis or hypodiastole information. The encoding of
MOI μοι need not even include breathing information, and can be terminated in three bytes (encoding just the letters.) On the other hand, the encoding of
PH| πῃ requires 6 bytes: a byte needs to be filled for the absent breathing, coronis and accent, before the subscript byte.
All such filler bytes contain the value zero for their 5-bit value, since the absence of a feature precedes the presence of the feature in sorting order. Thus,
0xB5AF A4A0 BE);
0xB6AC A0A0 A0A1<
0xB5AC A0A0 BFA1).
The final byte of the sorting key contains partial word information. In this byte, to distinguish it from preceding bytes, the binary prefix is
100 instead of
101. Because of this, a partial word byte can be appended to the remainder of the key without superfluous subscript, accent, coronis or breathing bytes intervening. If the word is partial to the left, the byte has the value 144 (
1001 0000); if it is partial to the right, it has the value 136 (
1000 1000); if it is fragmentary on both sides, the byte has the value 152 (
1001 1000). As a result, the following sorting order obtains:
|Partial to the right.|
|Partial to the left.|
|Partial on both sides.|
|No breathing;||No coronis;||Last letter:||circumflex.|
|No breathing;||No coronis;||Last letter:||circumflex;||Partial to the right.|
Because of the lower byte prefix, partial words always sort immediately after their equivalent full words, and before any words differing by additional diacritics. The following is a complete illustration of the TLG sorting algorithm.