3. "Don't Proliferate; Transliterate!"

Home > Greek > Unicode

1. Spoilsports

Begotten as it is of compromise and negotiation, it is difficult to establish rules for what characters are truly platonic enough ideals to warrant inclusion in Unicode, and which do not. We've seen that Coptic was conflated with Greek, and now is not, for example. The more general issue is, what should be counted as a distinct character, and what should be counted only as a variant character, which need be differentiated only through markup. (For instance, the difference between a and a.) This issue is clear in well-established modern scripts, which have a history of standardisation and a well-defined repertoire.

In historical scripts, and particularly poorly understood, partly deciphered or undeciphered scripts, this issue is much murkier. It is so murky, in fact, that specialists have concluded it is pointless to even attempt to derive a repertoire of codepoints for inclusion in Unicode:

  1. For poorly understood scripts, we don't know which characters are distinct, and which are merely variants of each other.
  2. Since Unicode is a standard For The Ages, we cannot decide on one repertoire of codepoints today, and shift to a smaller or bigger repertoire when we know more: Unicode needs to get things right from the get-go, or we will be stuck with errors in perpetuity.
  3. Ergo, no Easter Island script for you.

But there's more. When we know enough of a script to publish texts in it, we form a standardised repertoire of graphemes to put such texts in. In doing so, we generalise from whatever variations may have in the script through time and space. Fraktur a and Celtic a are shaped quite differently, as are the a written in Roman in 400 AD and in 900 AD; but we represent all of these as U+0061 Latin Small Letter A.

Now, scholars understand enough about Egyptian Hieroglyphics, Sumerian Cuneiform, and Ogham to be able to publish texts in them. A photograph of the stone or tablet obviously does not count as publication: the scholar needs to reduce the glyphs in the original to a standardised set of graphemes, fix any errors, fill in gaps and so forth. So in most instances, you won't care whether the Ogham glyph for "b" was at a 65° or 70° angle; you just want to know that it was an "b". (And if you really wanted to know, you can always refer to the photo.)

The catch is, the standardised repertoire of glyphs scholars use to publish ancient texts is typically not a form of the script itself. It's Roman transliteration. (But see proviso below.) Scholars working on cuneiform don't take the Assyrian variant and the Babylonian variant , and decide which of the two to use consistently in their publications. [Glyphs courtesy of John Heise's ultracool online course on Akkadian.] That's because scholars working on Akkadian don't publish anything in cuneiform: they publish their texts in Roman transliteration, and both variants are normalised to the single, Roman sequence ni. If you want to know what the wedges look like, you look at the clay; if you want to know what the wedges say, you read the transliteration.

This means that the scholars who work on historical scripts don't actually use those scripts in anything but the initial deciphering. Since they don't publish in Akkadian cuneiform, they have no earthly need for a standardised Akkadian cuneiform script: their standardised script is Latin. And since Unicode is concerned with the platonic ideal and not the minute variant, it will not find a repertoire of cuneiform platonic ideals, but only of Latin transliterated ideals—so ideal, one might say, they've even been stripped of their cuneiformicity.

And if the scholars don't need a standardised Akkadian cuneiform or Egyptian hieroglyphic, the only people who do are those who think the scripts are K00l, and want to use the scripts for recreational purposes. One might argue they are also used in teaching people about the scripts themselves, as distinct from the Akkadian or Old Egyptian languages; but when a textbook on Akkadian says "the Babylonian equivalent of is ", it's not clear that we're actually dealing with text as distinct from illustrations. If the reading exercise in such a text features a normalised sequence of cuneiform glyphs, rather than a photo of the stone, then you might argue this really does constitute cuneiform used as text; but that kind of use is fairly marginal.

The preceding does not do justice to Carl-Martin Bunz's patient exposition Encoding Scripts from the Past, which constituted Unicode Technical Note #3. Nor does it really convey the tension that must have given rise to Bunz writing his note: reading between the lines (and the standard disclaimers apply), it is apparent that Unicode implementers, who would dearly love to implement hieroglyphics and cuneiform, think of the academics as spoilsports who are getting in the way of their scriptal delights—and have let them know it. (And I do not trivialise that kind of delight: if Daniels & Bright's The World's Writing Systems does not make your heart go aflutter, you have little business working on Unicode.) But the bottom line is, standardisation for such scripts is hard, and the people who would do the standardisation don't need it. Without the support of the people who work with the scripts for a living, any attempt at such standardisation is ill-starred, and likely to wait for feedback indefinitely.

For an example of the "don't need it" reaction, see the DIN (German Standards body) response to the proposal on Meroitic, and on Egyptian hieroglyphs.

2. Epichorica

What has this to do with Greek? The Greek repertoire of characters is well-defined and understood, and it's only very rarely that Greek is published in anything but Greek. Yet Greek script, like any other, went through a period of flux, particularly in its first few centuries; the repertoire wasn't always as established as it is now, and there was much greater variation in the shape of characters. In later times, that variation becomes the province of palaeography: scribes from different times and places write letters slightly differently (or occasionally quite differently), and the editor's job is to reduce those characters to their modern standard forms, and disentangle their combinations into ligatures.

Ligatures, as we have seen, are not something Unicode is concerned with; published texts rarely indicate that clusters of letters used to be joined and jumbled together, and if that do that properly becomes an issue of markup rather than of the underlying platronic forms.

But for the first few centuries of Greek script, there was huge variation from place to place as to the shape of letters, and their phonetic value. Each city had its own epichoric alphabet. Epichoric is Greek for 'local' (ἐπιχώριος), and the fact that epigraphers call local alphabets epichoric instead of local is the kind of turf practice you might expect from the industry.

If you didn't know what epichoric meant, you're not alone: a scribe of the mediaeval Greek poem that I've recently coauthored a book on managed to mangle the phrase "epichoric proverb", referring to a pig, into ἐπιχοίριος—pig-ological.

At any rate, the variation was prodigious—some epsilons looking like betas, Η being either /h/ or /ɛː/, at least two different letters for /s/—and epigraphers wouldn't be doing anyone any favours by perpetuating this confusion in print. The normalised inventory of Greek letters is the standard Greek script as we know it; and that's what epigraphers reduce epichoric scripts down to. It is this reduction which is within the scope of Unicode: if epigraphers do anything with normalised Greek script that Unicode cannot cope with, then there is a case for expanding it. But there is no case for Unicode incorporating every epichoric variant of a letter, because that's not what anyone publishes or expects a font to contain.

So for instance the glyph that was to become Ψ stood for /ps/ in Ionia, /kh/ in Euboea, and /ks/ in Crete. But that is the worry of the epigrapher, not the computer programmer; because any normalised text, which would be published in Unicode, would have Ψ standing only for /ps/. Likewise, many scripts had a crooked rather than a straight line for their iota; but this does not mean a normalised text will represent iota as anything but Ι. The platonic forms Unicode bases itself on are graphemes, not phonemes: Unicode is not intended to do your phonological analysis for you, and if your script doesn't map to the phonology as elegantly as it might have, that's not for Unicode to fix. But the convention in Greek epigraphy is to take care of that mapping, and only let transliterations which do follow Greek phonology as we know it see the light of day. The only time you will ever see funny uses of psi, or crooked iotas, or epsilons that look like betas, is in histories of the Greek script (and even there as illustrations rather than text), and in depictions of inscriptions.

3. Epigraphers vs. Linguists

Almost. Epigraphers do normalise the shape of their letters to the modern standard; and they do accent and punctuate their texts as would be normal—including lower case and diacritics, both of which were unknown at the time of the inscriptions. However, they do not disrupt the graphemic system of the inscription, adapting it to the modern norm:

So the text an epigrapher publishes does not look the same as it would if the text had been preserved in manuscript, and published in normal orthography—even when the inscription dialect in question is Attic, whose phonology is what underlies the standard script. To illustrate, here is an inscription from the Acropolis in Athens, probably dating to 566 BC, as published by an epigrapher, and as it would appear in normal script:

[τὸ]ν δρόμον̣ [ἐποίε̄σαν .... Κρ]άτε̄ς [Θρασ]υκλε̄ς Ἀ[ρ]ι̣σ̣τό̣δ̣ιϙος Βρ̣[ύσο̄ν] Ἀντε̣̄́[νο̄ρ ... ιροποιοὶ τὸν ἀ]γο̄́[να θέσ]αν προ̄το̣[ι] γλα̣υ̣[ϙ]ο̣̄πιδι ϙόρ[ε̄ι]. (Jefferys 1990:401)

τὸν δρόμον ἐποίησαν ... Κράτης, Θρασυκλῆς, Ἀριστόδικος, Βρύσων, Ἀντήνωρ ... ἱροποιοὶ τὸν ἀγώνα θέσαν πρώτοι γλαυκώπιδι κόρῃ.

This race-track was made by ... Crates, Thrasycles, Aristodicus, Bryson, Antenor ..., Supervisors of Religious Rites [hieropoioi], who first established the race [in the Panathenaea Games] in honour of the Gleaming-Eyed Maiden [Athena].

By contrast, someone writing a linguistic or historical account using those texts will be likelier to use a normalised rendering, and a standard repertoire of symbols. There is a revealing contrast between the index of Jeffery (1990), a text on the early history of the Greek alphabet, and Buck (1955), a dialectology manual. Jeffery is studying the history of Greek letters in detail, so it is obvious for her to follow her sources closely in transliteration, and to sort different Ancient graphemes as distinct letters. Buck, on the other hand, while a much more comprehensive and linguistically informed treatment of its subject matter than his predecessors, still takes Attic Greek as the departure point in his index, and is only concerned with phonology: he eliminates archaic letters where they were not emic, and in his index ignores even distinct dialectal phonemes which Attic dropped (digamma), sorting everything as if it is Attic. For his purposes, this makes sense: if you want to know about the old form of "king", ϝάναξ (wánax), you expect to find it listed under its classical form, ἄναξ (ánax). We join the respective indices for ech- through to thar-.


  • ἐχθός
  • ἐψαφίττατο
  • ἕωκα
  • ϝέχω
  • ζά
  • ζᾶ
  • ζαμιοργία
  • ζαν
  • ζέλλω
  • ζέρεθρον
  • Ζῆνα
  • ζίκαια
  • ζίφυιος
  • Ζόνυσσος
  • ζτε̄ραῖον
  • ζώω
  • ἐ̄
  • ἤγραμμαι
  • ϝῆμα
  • ἦμεν
  • ἤμην
  • ἠμί
  • hε̄μίδιμμνον
  • ἠμίνα
  • hε̄μιρρήνιον
  • ἥμισσον
  • ἡμίτεια
  • ἠμιτυέκτο̄
  • ἥμυσυ
  • ἤν
  • ἦν
  • ἦναι
  • ἤνατος
  • ἦνεικα
  • ἦνται
  • ϝηρόντων
  • ἦς
  • ᾗς
  • ἥσσαντο
  • ἤστω
  • ἦται
  • ἤτω
  • ηὑτῶν
  • ἥχοι
  • ἠώς
  • θάλαθθα
  • θάλαττα
  • Θαρῆς
  • θαρρέω

  • εχε̣[ε]
  • εχινος
  • ϝαναξ
  • ϝεϝρε̄μενα
  • ϝειδο̄ς
  • ϝεκαβολο̄ι
  • ϝεξ
  • ϝεργα
  • ϝετεα
  • ϝ⊢εδιεστας
  • ϝικατι
  • ζο̄ος
  • ημεας
  • ε̄νικε
  • ε̄ριον
  • ⊢αγεν
  • [⊢αιρ]ε̄σει
  • ⊢αλιι̯ος
  • ⊢εζατο
  • ⊢ενατον
  • ⊢ε̄μιτριτον
  • ⊢ε̄ρο̄ος
  • ⊢ιαρος
  • ⊢ιατρο
  • ⊢ικατι
  • ⊢ιμερ[ος]
  • ⊢ιπ(π)ι[ϙο]
  • ⊢ιπ(π)οδρομο
  • ⊢ιροποιοι
  • ⊢οδο̄ι
  • ⊢οπλα
  • ⊢ορος
  • ⊢ορϙος
  • ⊢υιος
  • ⊢[υπαρχοις]
  • ⊢υποδ[εξαι?]
  • ⊢υπυ
  • θακος
  • [θαν]α̣τ̣οιο
  • θανο̄ν

The alphabetical order Jeffery uses is:


4. Corinthian EI

That said, epigraphers do not leave the alphabet as open-ended as they might. Epigraphers will use digamma, heta, and koppa, but are reluctant to use san, and tend not to use the extra locally coined letters at all; if they feel they must represent the distinction between the normal and the innovated letter, they tend to use devices like diacritics instead.

To illustrate, consider the case of the Corinthian epsilons. Classical Greek had three e-like sounds. Short /e/ is what was is represented in the standard script by epsilon. Long /ɛː/ is what is represented by eta, but was usually written just as epsilon in epichoric alphabets; eta was an Ionic innovation. A second long vowel, /eː/, arose latterly from two sources: monophthongisation of /ei/ ([hiéreia] > [hiéreːa] "priestess"—cf. [hiereús] "priest"), and lengthening of /e/ ([ksénwos] > Ionic [kséːnos] "stranger").

The missing digamma of both xenwos and wanax (see above) feature prominently in a fanfic piece by Kevin Wald on Xenwa, er, Xena, Warrior Wprincess.

Because this sound was at least sometimes originally a diphthong, it is written in standard script as the diphthong ει: ἱέρεια, ξεῖνος. (The former kind of ει is called a "genuine" diphthong, while the latter—which was never a diphthong at all—is termed "spurious" or "false".)

In Corinth, the monophthongisation took place early; but something strange then happened to the epichoric alphabet (Jeffery 1990:114-115):

Phonetic Value Standard Greek Corinth
/b/ Β
/e/ Ε Β
/ɛː/ Η Β
/eː/ ΕΙ Ε

Corinth conflated /e/ and /ɛː/, as was pretty standard in that part of Greece. But it used a completely new glyph for beta, and the beta glyph for the epsilon. As for the epsilon glyph, it was put to use for the new monophthong. (Corcyra [Corfu], which was a Corinthian colony, still used ΒΙ = ΕΙ for the monophthong /eː/.) So Corinth divided up vowel space differently to standard Greek. (Tirnys, down the road, divided it differently again, using Β for /ɛː/ but Ε for /e/, and ΕΙ for /eː/—pretty much anticipating the Milesian alphabet division.)

And of course, this makes an unholy mess of transliteration. Noone wants to transliterate Corinthian /b/ and /e/ ~ /ɛː/ as anything but β and ε. If you are intrigued by the Corinthian innovation, you might want to add a new letter for /eː/ in your transliteration; the problem is, ε is already taken for /e/. So how do you transliterate Corinthian?

Awkwardly. Buck (1955:294) uses a capital E for /eː/, and normal ε for /e/ ~ /ɛː/:

ΔϝΕνία τόδε [σᾶμα], τὸν ὄ̄λεσε πόντος ἀναι[δέ̄ς]
Δϝεινία τόδε σᾶμα, τὸν ὤλεσε πόντος ἀναιδής
This memorial is Dweinias', who perished by the reckless sea.

Σιμίο̄ν μ' ἀνέθ(ε̄)κε ΠοτΕδάϝο̄ν[ι ϝάνακτι]
Σιμίων μ' ἀνέθηκε Ποτειδάϝωνι ϝάνακτι
Simion raised me for Poseidon the King.

Obviously there are going to be problems for using that in the general case. Buck needs to differentiate the two e's because he's making a point about Corinthian phonology. But you couldn't keep using a capital e to distinguish from a lowercase e; the minute you talk about someone whose name starts with an epsilon, you're done for.

Ironically Jeffery (1990:404), so meticulous with her koppas and hetas, takes the opposite approach in her transliteration—and the one most epigraphers would take. The Corinthian Ε glyph stands for what was subsequently written as ει—and that's exactly how she transcribes it:

Δϝεινια τοδε̣ [σαμα] τ̣ον ο̄λεσε ποντος αναι[δε̄ς]
Σιμιο̄ν μ' ανεθ<ε̄>κε Ποτειδαϝο̄ν[ι ϝ̣α̣]νακτι

Given that the Corinthians didn't have an /ei/ distinct from /eː/, the only thing ει can mean in Corinth is /eː/. (If we ignore the fact that, as Jeffery 1990:115 reports, the Corinthians did occasionally get confused, and wrote ΒΙ for /ei/ and E for /eː/) And Jeffery doesn't have to deal with the problem of how to write an epsilon as distinct from an epsilon. (If someone does feel the urge to using diacritics, the traditional closed vowel underdot will not do, since it's already being used in epigraphy for damaged letter. I would propose the IPA raising diacritic: ε̝.)

And if even epigraphers are comfortable ditching the extra letter for expediency, you can bet your bottom dollar Greek Letter Corinthian Ei is not going to be proposed for inclusion in Unicode any time soon.

Or so I believed in 2003. EI is now included as Raised E in a proposal I have submitted to the UTC, L2/05-003 Proposal to add Greek epigraphical letters (see also L2/04-389). Raised E conflates the Corinthian glyph with the Boeotian use of ⊢ to represent a short raised /e/ (Thespiae, ca. 424 BC, for a raised /e/ before a vowel: Buck 1955:22, Jeffery 1990:89). The problems with the glyphs are still there; but enough epigraphers in feedback to me think that this deserves a distinct codepoint after all, that the proposal is worth making, even if the glyphs are not ready for primetime.

5. What to transliterate into

A little excursus I owe to discussion with John Cowan is the issue of target transliteration script. It's all very well to say that scholars choose not to proliferate glyphs in their source material, but instead choose a normalised script to transliterate into. But what determines scholars' choice of this script? Because it's not always Roman (let alone IPA).

The choice of script to transliterate-not-proliferate into for Western scholarship was dictated by two principles: patrimony and accessibility. If you were a Slavonicist writing for other Slavonicists, or an Arabists writing for other Arabists, you would be expected to leave your Cyrillic and Arabic (or Syriac or Hebrew) untransliterated: that was the patrimony you were discussing, after all. Your target audience would be sure to already know Cyrillic and Arabic. Furthermore, if your script had a significant contemporary constituency -- significant enough that a non-trivial typographical tradition could develop -- then the script was deemed accessible to your peers, even if that use was limited to the liturgy. A theologian or a linguist could quote Syriac or Coptic in those scripts, because those scripts continued in liturgical use -- so they were known to printers, and to specialists outside the field of palaeography, who could learn their Coptic and Syriac from printed books rather than manuscripts.

If on the other hand you were discussing material in a script which did not make it to print, but was present only in the original sources (accessible to the scholarly republic only with difficulty), then it was your business to transliterate it out of the original script, into a script you deemed accessible --- and which corresponded to your notion of the script's patrimony. Gothic was deemed part of the Germanic patrimony; so it was transliterated out of the long extinct and unfamiliar, Greek-like Gothic script, into the same alphabet used for Old English and Old Norse (with an addition or two). Slavicists rejected Glagolitic in favor of Cyrillic, as Glagolitic was not regarded as accessible enough, being restricted in printed use to a corner of Dalmatia.

In the same way, Semiticists treated Phoenician and its ilk as part of the Hebrew patrimony, and so transcribed it into Hebrew (as a furious thread on Unicode List through much of 2004 brought forth; Semiticists do not see any point in encoding Phoenecian separately from Hebrew because they are isomorphic). It certainly helped that Hebrew persisted in liturgical use and had a print tradition, so it was accessible. But the choice of Hebrew rather than Latin transliteration reflected an ideological choice, as well as a practicality: Semiticists approached Semitic via texts published in modern Hebrew script, so modern Hebrew was the natural target script, which the other variants of Phoenician were fully isomorphic to. There was of course no surviving Moabite constituency to protest transliteration into an enemy script.

If you were writing for an audience of general linguists, however, your choice of script was constrained to what scripts you could expect a generalist to be familiar with. In the 19th century, that was really only Greek and Latin --- and maybe Hebrew. (Behaghel's Historical Syntax of German, for example, written in the 1920s, numbers its sections hierarchically with Roman numerals, Arabic numerals, Greek letters, and Hebrew letters --- so he expected his Germanist readers could tell a Bet from a Gimel. And at that font size, they'd be doing better than I would.) Handbooks of Indo-European never present Sanskrit or Armenian in their original scripts. You would, however, expect that a generalist could deal with the different traditions of Latin script use, and would leave Gothic, Irish, French, etc. in their conventional Latin transliteration or orthography. And to this day, no Indo-Europeanist will cite Greek in anything but Greek script.

In the late 20th century, the abandonment of Classical education means that you cannot expect a general linguist to have any fluency in reading Greek, and Greek is universally transliterated in generalist contexts (outside of traditional historical linguistics). Similarly, there will be some supplementing of Latin orthography with IPA, although it does not appear to have supplanted orthographic quotation of Latin text.

Even in specialist usage, the use of Greek script has disappeared in non-Greek contexts: those languages are no longer deemed part of the Greek patrimony -- possibly as a measure of political correctness, but more likely as an acknowledgement of the trickiness in pinning down the phonetics of the letters (a primary concern of decipherment, which the use of Greek script, with its reconstructed phonetic values, would not really address). 19th century editions of Phrygian or Lycian would think nothing of publishing their texts in the normalised Greek orthography. In the 20th century, the "transliterate don't proliferate" approach used Latin, not Greek transliteration, as its target --- though this was a Latin script used by Hellenists, so that Greek letters were readily called on (e.g. tau or beta, as opposed to any number of IPA renderings: recall that philologists don't do IPA).

Outside the quite distinct Mycenaean script tradition, this has not happened with texts in Greek. The Pamphylian script is pretty close to Carian and Lycian, but because Pamphylian itself is a dialect of Greek (however deviant), there has never been any hesitation in publishing it in normalised Greek script (with a couple of additions).

Nick Nicholas, opoudjis [AT] optusnet . com . au
Created: 2003-08-19; Last revision: 2005-05-22
URL: http://www.opoudjis.net/unicode/unicode_epichorica.html