Unicode Collation to alphabetically sort glyphs (e.g. Coptic)

rhythmus · April 5, 2018, 11:08pm

Follow-up from here.

The default in Glyph’s baked-in glyph data is to order characters numerically by the (hex/binary) Unicode codepoint value. Obvious though that may seem, characters are added to the Unicode standard on an ongoing base, being assigned codepoints of code chart slots left empty in forgoing releases. Consequently, characters tend to end up in a sequence which does not necessarily have any intrinsic linguistic or otherwise meaningful value, but only reflects the insertion order in the Unicode standard.

The purpose of ordering characters and glyphs within a font (i.e assigning consecutive GIDs) is to help end users easily find and pick characters from a glyph palette. While there are of course many ways to arrange letters and sorts [sic :-)], I think alphabetical sorting makes most sense, since end users do know the order of the letters in an alphabet, Unicode bit order not so much. That’s why I sort glyphs in my fonts alphabetically, uppercase first, followed by lowercase (at least in bicameral scripts), grouped letter by letter, preceded by case alternates, if any, each followed by stylistic alternates, if any. E.g. A.titl a.sc A a a.calt. But it’s a tedious manual work…

Point in case [sic :-p] for the Coptic. In the newly proposed sort order the sequence starts with Ϣ ϣ Ϥ ϥ Ϧ ϧ Ϩ ϩ Ϫ ϫ Ϭ ϭ Ϯ ϯ (U+03E2…U+03EF), only then to be followed by Ⲁⲁ Ⲃⲃ Ⲅⲅ… (~ABC…). That’s because a former version of the Unicode standard assumed Greek characters could be used for encoding Coptic, only adding Coptic letters absent in the Greek alphabet, putting those in the left-over slots within he Greek code chart. Later it was agreed upon to grant the Coptic script its own characters and code chart(s), while the former letters obviously kept their lower-ranking codepoint, resulting in a numerical sort order which has the lesser used letters put first…

IANAL (I am not a linguist⸮) of oriental languages, but rely on Wikipedia to infer correct/traditional alphabetical sort order for foreign languages and scripts, along with Carl Faulmann’s 1880 standard Buch der Schrift. As with many other Euro-Semitic scripts, in Coptic the order of the letters basically is that of the Greek, c.q. original Phoenician, alphabet, more ore less phonetically-associatively inserted with indigenous letters, appended with deprecated letters used for counting. But my sources are not always exhaustive here, and by lack of a standard it’s not only tedious, but also an error-prone process.

Luckily there is a standard, though! Unicode is after all more than about indexing characters alone, and we may happily use Unicode Collation Charts for proper and standardized sorting. For the Coptic this is: https://www.unicode.org/charts/collation/chart_Coptic.html

Below is the glyph list for Coptic, using glyph names as proposed by @GeorgSeifert here, but arranged following the Unicode collation chart cited above. Nota bene: Ⳬⳬ Ⳳⳳ Ⳮⳮ are missing from the Georg’s list, hence need a glyph name still.

Besides the little Coptic case study, here then is my proposal, generalized: Could Glyphs built-in glyph data use Unicode Collation Order for proper alphabetical glyph sorting, instead of numerical character codepoint values?

Unicode		glyph name
uni2C80	Ⲁ	Alfa-coptic
uni2C81	ⲁ	alfa-coptic
uni2C82	Ⲃ	Vida-coptic
uni2C83	ⲃ	vida-coptic
uni2C84	Ⲅ	Gamma-coptic
uni2C85	ⲅ	gamma-coptic
uni2C86	Ⲇ	Dalda-coptic
uni2C87	ⲇ	dalda-coptic
uni2C88	Ⲉ	Eie-coptic
uni2C89	ⲉ	eie-coptic
uni2CB6	Ⲷ	Cryptogrammiceie-coptic
uni2CB7	ⲷ	cryptogrammiceie-coptic
uni2C8A	Ⲋ	Sou-coptic
uni2C8B	ⲋ	sou-coptic
uni2C8C	Ⲍ	Zata-coptic
uni2C8D	ⲍ	zata-coptic
uni2C8E	Ⲏ	Hate-coptic
uni2C8F	ⲏ	hate-coptic
uni2C90	Ⲑ	Thethe-coptic
uni2C91	ⲑ	thethe-coptic
uni2C92	Ⲓ	Iauda-coptic
uni2C93	ⲓ	iauda-coptic
uni2C94	Ⲕ	Kapa-coptic
uni2C95	ⲕ	kapa-coptic
uni2CB8	Ⲹ	dialectPkapa-coptic
uni2CB9	ⲹ	dialectpkapa-coptic
uni2C96	Ⲗ	Laula-coptic
uni2C97	ⲗ	laula-coptic
uni2C98	Ⲙ	Mi-coptic
uni2C99	ⲙ	mi-coptic
uni2C9A	Ⲛ	Ni-coptic
uni2C9B	ⲛ	ni-coptic
uni2CBA	Ⲻ	dialectPni-coptic
uni2CBB	ⲻ	dialectpni-coptic
uni2CBC	Ⲽ	Cryptogrammicni-coptic
uni2CBD	ⲽ	cryptogrammicni-coptic
uni2C9C	Ⲝ	Ksi-coptic
uni2C9D	ⲝ	ksi-coptic
uni2C9E	Ⲟ	O-coptic
uni2C9F	ⲟ	o-coptic
uni2CA0	Ⲡ	Pi-coptic
uni2CA1	ⲡ	pi-coptic
uni2CA2	Ⲣ	Ro-coptic
uni2CA3	ⲣ	ro-coptic
uni2CA4	Ⲥ	Sima-coptic
uni2CA5	ⲥ	sima-coptic
uni2CA6	Ⲧ	Tau-coptic
uni2CA7	ⲧ	tau-coptic
uni2CA8	Ⲩ	Ua-coptic
uni2CA9	ⲩ	ua-coptic
uni2CAA	Ⲫ	Fi-coptic
uni2CAB	ⲫ	fi-coptic
uni2CAC	Ⲭ	Khi-coptic
uni2CAD	ⲭ	khi-coptic
uni2CAE	Ⲯ	Psi-coptic
uni2CAF	ⲯ	psi-coptic
uni2CB0	Ⲱ	Oou-coptic
uni2CB1	ⲱ	oou-coptic
uni2CBE	Ⲿ	OouOld-coptic
uni2CBF	ⲿ	oouOld-coptic
uni2CC0	Ⳁ	Sampi-coptic
uni2CC1	ⳁ	sampi-coptic
uni03E2	Ϣ	Shei-coptic
uni03E3	ϣ	shei-coptic
uni2CEB	Ⳬ
uni2CEC	ⳬ
uni2CC2	Ⳃ	SheiCrossed-coptic
uni2CC3	ⳃ	sheiCrossed-coptic
uni2CC4	Ⳅ	SheiOld-coptic
uni2CC5	ⳅ	sheiOld-coptic
uni2CC6	Ⳇ	EshOld-coptic
uni2CC7	ⳇ	eshOld-coptic
uni03E4	Ϥ	Fei-coptic
uni03E5	ϥ	fei-coptic
uni03E6	Ϧ	Khei-coptic
uni03E7	ϧ	khei-coptic
uni2CF2	Ⳳ
uni2CF3	ⳳ
uni2CC8	Ⳉ	KheiAkhmimic-coptic
uni2CC9	ⳉ	kheiAkhmimic-coptic
uni03E8	Ϩ	Hori-coptic
uni03E9	ϩ	hori-coptic
uni2CCA	Ⳋ	HoriDialectP-coptic
uni2CCB	ⳋ	horiDialectP-coptic
uni2CCC	Ⳍ	HoriOld-coptic
uni2CCD	ⳍ	horiOld-coptic
uni2CCE	Ⳏ	HaOld-coptic
uni2CCF	ⳏ	haOld-coptic
uni2CD0	Ⳑ	HaLshaped-coptic
uni2CD1	ⳑ	haLshaped-coptic
uni2CD2	Ⳓ	HeiOld-coptic
uni2CD3	ⳓ	heiOld-coptic
uni2CD4	Ⳕ	HatOld-coptic
uni2CD5	ⳕ	hatOld-coptic
uni03EA	Ϫ	Gangia-coptic
uni03EB	ϫ	gangia-coptic
uni2CED	Ⳮ
uni2CEE	ⳮ
uni2CD6	Ⳗ	GangiaOld-coptic
uni2CD7	ⳗ	gangiaOld-coptic
uni03EC	Ϭ	Shima-coptic
uni03ED	ϭ	shima-coptic
uni2CD8	Ⳙ	DjaOld-coptic
uni2CD9	ⳙ	djaOld-coptic
uni2CDA	Ⳛ	ShimaOld-coptic
uni2CDB	ⳛ	shimaOld-coptic
uni2CDC	Ⳝ	Shima-nubian
uni2CDD	ⳝ	shima-nubian
uni03EE	Ϯ	Dei-coptic
uni03EF	ϯ	dei-coptic
uni2CB2	Ⲳ	dialectPalef-coptic
uni2CB3	ⲳ	dialectpalef-coptic
uni2CB4	Ⲵ	AinOld-coptic
uni2CB5	ⲵ	ainOld-coptic
uni2CDE	Ⳟ	Ngi-nubian
uni2CDF	ⳟ	ngi-nubian
uni2CE0	Ⳡ	Nyi-nubian
uni2CE1	ⳡ	nyi-nubian
uni2CE2	Ⳣ	Wau-nubian
uni2CE3	ⳣ	wau-nubian
uni2CE4	ⳤ	kai-coptic
uni2CE5	⳥	miro-coptic
uni2CE6	⳦	piro-coptic
uni2CE7	⳧	stauros-coptic
uni2CE8	⳨	tauro-coptic
uni2CE9	⳩	khiro-coptic
uni2CEA	⳪	shimasima-coptic
uni2CF9	⳹	fullstop-nubian
uni2CFA	⳺	directquestion-nubian
uni2CFB	⳻	indirectquestion-nubian
uni2CFC	⳼	versedivider-nubian
uni2CFD	⳽	onehalf-coptic
uni2CFE	⳾	fullstop-coptic
uni2CFF	⳿	morphologicaldivider-coptic

GeorgSeifert · April 6, 2018, 12:04pm

Glyphs does not use the unicode for sorting. It uses either the glyph names or a special key called “sortName”.

rhythmus · April 10, 2018, 8:36pm

Yes, but no. Most of the <glyph>s in the default GlyphData.xml file do not have a sortName attribute. Neither can the glyph name be used for sorting (name="space" comes before name="exclam", though /s/ alphabetically comes after /e/). Hence, Glyphs does not, by default, use either of both exclusively, but instead must also rely on the implied source order in GlyphData.xml of the <glyph> items as listed within the <glyphData> array.

The question then is: How has GlyphData.xml been generated in the first place? M.m. what was the initial sorting algorithm used for arranging glyphs with a codepoint assigned? Obviously: as these glyphs are mapped to characters (i.e. they have a unicode attribute), which are hex values of binary numbers, then the naive sort order is a numerical sort, assuming characters in the Unicode standard have been sorted alphabetically. Which they are not; see above.

For sorting, the standard precisely specifies the Unicode Collation Algorithm (UCA), which is implemented by IBM and the ICU, which has bindings for Python as PyICU. A demo app is available online: http://demo.icu-project.org/icu-bin/collation.html

My proposal is to list the <glyph>s in GlyphData.xml in compliance with Unicode collation (using e.g. PyICU), in order to have true alphabetical sorting enabled by default. Surely users could roll their own, but then must do incredibly tedious manual and brittle sorting, and/or rely on extremely verbose attribute usage in lists like sortName="teng001", sortName="teng002", sortName="teng003"… Repeat each time Unicode updates its standard…

rhythmus · April 10, 2018, 10:02pm

I just did a few experiments, with this result (†). Sorting by ICU/UCA (depending on parameters used) gives, for example, the following peculiar sequence inserted into the Greek alphabet:

… Κ κ ϰ Ϗ ϗ Λ λ ᴧ µ Μ μ ㎂㎌㎍㎕㎛㎲㎶㎼ Ν ν …

That’s indeed alphabetical sort order to the extreme! One would probably want to have UCA sorting scoped to alphabets, i.e. within subsets of characters with a common script and/or category, so CJK Compatibility symbols do not go on a trip to Greece ;-)…

It might take quite some fiddling with what is possible with PyICU configs, but I think it would be worth the effort indeed, when you guys would implement proper alphabetical collation, from which fonts and auto-generated type specimens could benefit. Thanks a lot in advance!

(†) Beware: that page on Github may crash your browser while it tries to render all the Unicode BMP characters at once…

GeorgSeifert · April 11, 2018, 5:24am

Instead of implementing automatic Unicode sorting, I rather hard code it into the GlyphData. There are a lot glyphs without unicodes and those would not be handled well otherwise.

Have you tried the latest Glyphs version? You need to update glyph info for at least one glyph to trigger the sorting.