Unicode Collation to alphabetically sort glyphs (e.g. Coptic)

Follow-up from here.

The default in Glyph’s baked-in glyph data is to order characters numerically by the (hex/binary) Unicode codepoint value. Obvious though that may seem, characters are added to the Unicode standard on an ongoing base, being assigned codepoints of code chart slots left empty in forgoing releases. Consequently, characters tend to end up in a sequence which does not necessarily have any intrinsic linguistic or otherwise meaningful value, but only reflects the insertion order in the Unicode standard.

The purpose of ordering characters and glyphs within a font (i.e assigning consecutive GIDs) is to help end users easily find and pick characters from a glyph palette. While there are of course many ways to arrange letters and sorts [sic :-)], I think alphabetical sorting makes most sense, since end users do know the order of the letters in an alphabet, Unicode bit order not so much. That’s why I sort glyphs in my fonts alphabetically, uppercase first, followed by lowercase (at least in bicameral scripts), grouped letter by letter, preceded by case alternates, if any, each followed by stylistic alternates, if any. E.g. A.titl a.sc A a a.calt. But it’s a tedious manual work…

Point in case [sic :-p] for the Coptic. In the newly proposed sort order the sequence starts with Ϣ ϣ Ϥ ϥ Ϧ ϧ Ϩ ϩ Ϫ ϫ Ϭ ϭ Ϯ ϯ (U+03E2…U+03EF), only then to be followed by Ⲁⲁ Ⲃⲃ Ⲅⲅ… (~ABC…). That’s because a former version of the Unicode standard assumed Greek characters could be used for encoding Coptic, only adding Coptic letters absent in the Greek alphabet, putting those in the left-over slots within he Greek code chart. Later it was agreed upon to grant the Coptic script its own characters and code chart(s), while the former letters obviously kept their lower-ranking codepoint, resulting in a numerical sort order which has the lesser used letters put first…

IANAL (I am not a linguist⸮) of oriental languages, but rely on Wikipedia to infer correct/traditional alphabetical sort order for foreign languages and scripts, along with Carl Faulmann’s 1880 standard Buch der Schrift. As with many other Euro-Semitic scripts, in Coptic the order of the letters basically is that of the Greek, c.q. original Phoenician, alphabet, more ore less phonetically-associatively inserted with indigenous letters, appended with deprecated letters used for counting. But my sources are not always exhaustive here, and by lack of a standard it’s not only tedious, but also an error-prone process.

Luckily there is a standard, though! Unicode is after all more than about indexing characters alone, and we may happily use Unicode Collation Charts for proper and standardized sorting. For the Coptic this is: https://www.unicode.org/charts/collation/chart_Coptic.html

Below is the glyph list for Coptic, using glyph names as proposed by @GeorgSeifert here, but arranged following the Unicode collation chart cited above. Nota bene: Ⳬⳬ Ⳳⳳ Ⳮⳮ are missing from the Georg’s list, hence need a glyph name still.

Besides the little Coptic case study, here then is my proposal, generalized: Could Glyphs built-in glyph data use Unicode Collation Order for proper alphabetical glyph sorting, instead of numerical character codepoint values?

Unicode glyph name
uni2C80 Alfa-coptic
uni2C81 alfa-coptic
uni2C82 Vida-coptic
uni2C83 vida-coptic
uni2C84 Gamma-coptic
uni2C85 gamma-coptic
uni2C86 Dalda-coptic
uni2C87 dalda-coptic
uni2C88 Eie-coptic
uni2C89 eie-coptic
uni2CB6 Cryptogrammiceie-coptic
uni2CB7 cryptogrammiceie-coptic
uni2C8A Sou-coptic
uni2C8B sou-coptic
uni2C8C Zata-coptic
uni2C8D zata-coptic
uni2C8E Hate-coptic
uni2C8F hate-coptic
uni2C90 Thethe-coptic
uni2C91 thethe-coptic
uni2C92 Iauda-coptic
uni2C93 iauda-coptic
uni2C94 Kapa-coptic
uni2C95 kapa-coptic
uni2CB8 dialectPkapa-coptic
uni2CB9 dialectpkapa-coptic
uni2C96 Laula-coptic
uni2C97 laula-coptic
uni2C98 Mi-coptic
uni2C99 mi-coptic
uni2C9A Ni-coptic
uni2C9B ni-coptic
uni2CBA dialectPni-coptic
uni2CBB dialectpni-coptic
uni2CBC Cryptogrammicni-coptic
uni2CBD cryptogrammicni-coptic
uni2C9C Ksi-coptic
uni2C9D ksi-coptic
uni2C9E O-coptic
uni2C9F o-coptic
uni2CA0 Pi-coptic
uni2CA1 pi-coptic
uni2CA2 Ro-coptic
uni2CA3 ro-coptic
uni2CA4 Sima-coptic
uni2CA5 sima-coptic
uni2CA6 Tau-coptic
uni2CA7 tau-coptic
uni2CA8 Ua-coptic
uni2CA9 ua-coptic
uni2CAA Fi-coptic
uni2CAB fi-coptic
uni2CAC Khi-coptic
uni2CAD khi-coptic
uni2CAE Psi-coptic
uni2CAF psi-coptic
uni2CB0 Oou-coptic
uni2CB1 oou-coptic
uni2CBE OouOld-coptic
uni2CBF ⲿ oouOld-coptic
uni2CC0 Sampi-coptic
uni2CC1 sampi-coptic
uni03E2 Ϣ Shei-coptic
uni03E3 ϣ shei-coptic
uni2CEB
uni2CEC
uni2CC2 SheiCrossed-coptic
uni2CC3 sheiCrossed-coptic
uni2CC4 SheiOld-coptic
uni2CC5 sheiOld-coptic
uni2CC6 EshOld-coptic
uni2CC7 eshOld-coptic
uni03E4 Ϥ Fei-coptic
uni03E5 ϥ fei-coptic
uni03E6 Ϧ Khei-coptic
uni03E7 ϧ khei-coptic
uni2CF2
uni2CF3
uni2CC8 KheiAkhmimic-coptic
uni2CC9 kheiAkhmimic-coptic
uni03E8 Ϩ Hori-coptic
uni03E9 ϩ hori-coptic
uni2CCA HoriDialectP-coptic
uni2CCB horiDialectP-coptic
uni2CCC HoriOld-coptic
uni2CCD horiOld-coptic
uni2CCE HaOld-coptic
uni2CCF haOld-coptic
uni2CD0 HaLshaped-coptic
uni2CD1 haLshaped-coptic
uni2CD2 HeiOld-coptic
uni2CD3 heiOld-coptic
uni2CD4 HatOld-coptic
uni2CD5 hatOld-coptic
uni03EA Ϫ Gangia-coptic
uni03EB ϫ gangia-coptic
uni2CED
uni2CEE
uni2CD6 GangiaOld-coptic
uni2CD7 gangiaOld-coptic
uni03EC Ϭ Shima-coptic
uni03ED ϭ shima-coptic
uni2CD8 DjaOld-coptic
uni2CD9 djaOld-coptic
uni2CDA ShimaOld-coptic
uni2CDB shimaOld-coptic
uni2CDC Shima-nubian
uni2CDD shima-nubian
uni03EE Ϯ Dei-coptic
uni03EF ϯ dei-coptic
uni2CB2 dialectPalef-coptic
uni2CB3 dialectpalef-coptic
uni2CB4 AinOld-coptic
uni2CB5 ainOld-coptic
uni2CDE Ngi-nubian
uni2CDF ngi-nubian
uni2CE0 Nyi-nubian
uni2CE1 nyi-nubian
uni2CE2 Wau-nubian
uni2CE3 wau-nubian
uni2CE4 kai-coptic
uni2CE5 miro-coptic
uni2CE6 piro-coptic
uni2CE7 stauros-coptic
uni2CE8 tauro-coptic
uni2CE9 khiro-coptic
uni2CEA shimasima-coptic
uni2CF9 fullstop-nubian
uni2CFA directquestion-nubian
uni2CFB indirectquestion-nubian
uni2CFC versedivider-nubian
uni2CFD onehalf-coptic
uni2CFE fullstop-coptic
uni2CFF ⳿ morphologicaldivider-coptic
1 Like

Glyphs does not use the unicode for sorting. It uses either the glyph names or a special key called “sortName”.

Yes, but no. Most of the <glyph>s in the default GlyphData.xml file do not have a sortName attribute. Neither can the glyph name be used for sorting (name="space" comes before name="exclam", though /s/ alphabetically comes after /e/). Hence, Glyphs does not, by default, use either of both exclusively, but instead must also rely on the implied source order in GlyphData.xml of the <glyph> items as listed within the <glyphData> array.

The question then is: How has GlyphData.xml been generated in the first place? M.m. what was the initial sorting algorithm used for arranging glyphs with a codepoint assigned? Obviously: as these glyphs are mapped to characters (i.e. they have a unicode attribute), which are hex values of binary numbers, then the naive sort order is a numerical sort, assuming characters in the Unicode standard have been sorted alphabetically. Which they are not; see above.

For sorting, the standard precisely specifies the Unicode Collation Algorithm (UCA), which is implemented by IBM and the ICU, which has bindings for Python as PyICU. A demo app is available online: http://demo.icu-project.org/icu-bin/collation.html

My proposal is to list the <glyph>s in GlyphData.xml in compliance with Unicode collation (using e.g. PyICU), in order to have true alphabetical sorting enabled by default. Surely users could roll their own, but then must do incredibly tedious manual and brittle sorting, and/or rely on extremely verbose attribute usage in lists like sortName="teng001", sortName="teng002", sortName="teng003"… Repeat each time Unicode updates its standard…

I just did a few experiments, with this result (†). Sorting by ICU/UCA (depending on parameters used) gives, for example, the following peculiar sequence inserted into the Greek alphabet:

… Κ κ ϰ Ϗ ϗ Λ λ ᴧ µ Μ μ ㎂ ㎌ ㎍ ㎕ ㎛ ㎲ ㎶ ㎼ Ν ν …

That’s indeed alphabetical sort order to the extreme! One would probably want to have UCA sorting scoped to alphabets, i.e. within subsets of characters with a common script and/or category, so CJK Compatibility symbols do not go on a trip to Greece ;-)…

It might take quite some fiddling with what is possible with PyICU configs, but I think it would be worth the effort indeed, when you guys would implement proper alphabetical collation, from which fonts and auto-generated type specimens could benefit. Thanks a lot in advance!

(†) Beware: that page on Github may crash your browser while it tries to render all the Unicode BMP characters at once…

Instead of implementing automatic Unicode sorting, I rather hard code it into the GlyphData. There are a lot glyphs without unicodes and those would not be handled well otherwise.

Have you tried the latest Glyphs version? You need to update glyph info for at least one glyph to trigger the sorting.