Lam-alef ligatures missing from GlyphsData.xml

khaled · August 26, 2020, 10:55pm

The two ligatures lam_alefWasla-ar and lam_alefWasla-ar.fina though present in the Arabic -> Basic list, don’t have entries in the GlyphsData which causes Make Composite Glyphs to make nonessential components for them, which is rather odd since all other lam-alef ligatures in that list work fine.

GeorgSeifert · August 27, 2020, 8:17am

You are right. I’ll add them. What would you prefer as production name:
uni06440671 or uniFEDFFB51?
uniFEDFFB51 or uni06440671.fina
or any other combination. I tend to use the none presentation form unicodes if possible.

khaled · August 27, 2020, 3:56pm

I’d prefer the uni06440671 and uni06440671.fina.

In general I think using presentation forms for Arabic positional variants is a bad practice and shouldn’t be enabled by default. Very few applications need this (and these don’t support OpenType at all, so users get very degraded experience anyway). It also breaks text extraction from PDFs that use the cmap table or glyph names to populate PDFs /ToUnicode mapping as one would end with presentation forms in the extracted text not the original characters (some applications with normalize this away, but since it is NFKC/NFKD normalization, not all applications will do it).

GeorgSeifert · August 27, 2020, 5:34pm

So should I remove all presentation Unicodes? I would very much like this.

khaled · August 27, 2020, 6:22pm

Yes, that would be my preference. If some people actually have use for them (I highly doubt it), they can add the presentation forms Unicode values manually.

GeorgSeifert · August 27, 2020, 9:14pm

That is quite a bit work to untangle this. Some chars in that block don’t have and obvious mean to access them (FBC1, FD3E, FDFC). And a lot production names need to be updated.

this is a diff for the changes (mostly done by some scripting).
0001-don-t-use-Arabic-presentation-form-unicodes.patch.zip (40.5 KB)

I found some issues that I need to solve tomorrow (FBF9 uighurkirghizyehHamzaabove_alefMaksura-ar, There is no uighurkirghizyehHamzaabove-ar )

khaled · August 27, 2020, 11:43pm

Looks fine, some things I noticed:

alef-ar.fina.short still have the production name uniFE8E.short.
alefFathatan-ar and alefFathatan-ar.fina should remain legacy encoded, as they are of no much use in modern fonts (they come from metal type when it was easier to have one sort for this too common combination, it is basically a ligature of alef and fathatan not a single letter).
thalAlefabove-ar, rehAlefabove-ar, hehAlefabove-ar.init and alefMaksuraAlefabove-ar.fina are also a legacy ligatures of base letter and mark so should remain legacy encoded (no idea why these got into Unicode at all, can’t imagine what legacy use they were for!).
fathatan-ar.isol, fathatan-ar.medi, fatha-ar.isol, fatha-ar.medi, damma-ar.isol, damma-ar.medi, kasra-ar.isol, kasra-ar.medi, shadda-ar.isol, shadda-ar.medi, sukun-ar.isol, sukun-ar.medi, dammatan-ar.isol, kasratan-ar.isol, all are legacy positional variant for vowel marks (for systems that didn}t have a way to place marks over glyphs), they should remain legacy encoded as well.

GeorgSeifert · August 28, 2020, 7:11am

Thanks a lot.

GeorgSeifert · August 28, 2020, 10:17am

Why is the last group different from all the other positional forms? So why should fathatan-ar.isol have a presentation unicode and not alef-ar.isol?

GeorgSeifert · August 28, 2020, 10:18am

And do you have an suggestion for FBF9/uighurkirghizyehHamzaabove_alefMaksura-ar?

khaled · August 28, 2020, 3:02pm

Combining marks don’t have positional forms, it was a hack in some old systems that couldn’t position the marks so they were placed over a space (.isol form) or over a tatweel (.medi form), so instead of:
ضَرَب
it would be:
ضـَر َب

But since combining marks have no positional forms, OpenType shaping engines will not apply isol or medi lookups on them, so if one really wants to emulate this behavior in OpenType it will need contextual substitutions, and for legacy systems (I have never seen any of this) it will need the legacy code points anyway.

khaled · August 28, 2020, 3:11pm

It decomposes to yeh-ar hamzaabove-ar alefMaksura-ar in Unicode, so it basically the same as yehHamzaabove_alefMaksura-ar, so has no use other than the legacy code point (there is no uighurkirghizyehHamzaabove-ar in Unicode AFAICT, probably because it is the same as yehHamzaabove-ar).

GeorgSeifert · August 28, 2020, 7:57pm

but those will need the legacy unicode for everything else, too? So we still don’t need them?

khaled · August 28, 2020, 9:10pm

Indeed, you probably can drop them completely without any loss.