Automatic assignment of correct unicode code points for Arabic characters

Hello!

So I’m using Glyphs 3 (to be more precise, 3.2.3 [3260]) there’s a lot of unicode code points missing in the glyph info for Arabic, which means that when I set up a new Arabic font, I have to go in and manually add the unicode code points for the majority of characters. Would it be possible to rectify this so the code points are assigned automatically when starting a new Arabic project, or adding new Arabic glyphs to an existing project??

Thanks!

What code points do you mean?

I produce Arabic fonts occasionally and never needed to add Unicode values manually.

Keep in mind that you should not encode positional alternates.

1 Like

Right. It was about the positional forms and the double encoding of isolated ones. Out of curiosity, why do you suggest it should be done this way? Why is encoding positional forms and double encoding bad in your view?

It was a hack to somehow support arabic in apps that didn’t support unicode. It isn’t need for a long time any more.

Wouldn’t Unicode values still be needed to be able to type positional forms out of their normal position (e.g., to make tables like the ones on Wikipedia’s list of Arabic letter components)?

What Janus asked above, yes, but also there are apps still in use out there that require this. One example is Kelk, which is widely used for book publishing in Iran.

Positional forms are supposed to be accessed through substitution features.

The Unicode values are only for backwards compatibility with outdated encodings (enabling round trip conversion). I don’t know about Kelk but if it doesn’t support OpenType, you can use a custom parameter in the exports, forgot the name now, will look it up in a bit and update here.

Be aware though: If you do this (encoding glyphs that are accessed through GSUB), you will effectively break the font, and very likely the character stream in documents that use the font. This will affect all text-related operations, including copying and pasting, searching, and things like changing the font.

I’ve had clients in Eastern Asia that work with gaming consoles and such that also have this issue, and because many companies/clients using Arabic make use of software that are not easy to account for in terms of OT support, I prefer to add these to my fonts. it’s not been my experience that it has caused issues, but the lack of it has.

I do appreciate the warning though. Thanks!

Okay, but how would that enable you to write something like, “The medial form of ﺡ is ﺤ”? How can substitution features alone possibly support that?

Could you expand on that? There are a number of standard GSUB substitutions that replace one encoded glyph with another (e.g., the fina replacement of U+03C3 σ by U+03C2 ς in Greek), and they don’t break fonts – why would encoding Arabic positional forms with their Unicode values break the font?

You type them with a zero-width joiner (200D). (That needs some support from the font).

Three things:

  1. Yes, there are exceptions. Most notably, subscript and superscript figures that are both encoded and are supposed to have a substitution feature. These exceptions are specifically listed in the OT spec.
  2. AFAIK, the two sigmas actually are not an exception, they are considered different characters, and are also typed separately (take a look at Greek keyboards). No OT feature involved.
  3. Unicode is a character encoding, not a glyph encoding. Different variants of the same character (in this case positional forms) are bound to the same Unicode value. Yes, there are presentation forms in Unicode, but they are for round-tripping with old encodings only, so that you can convert old data. They are not supposed to be used anymore.

An example. The word كتب is the sequence of Unicodes 0643, 062A, 0628, which conforms to the way it is typed (and perceived semantically). If you force the presentation form codes, somewhere in the FXXX’s, obviously, most text operations will fail. Fonts will not be compatible with each other because one enforces the encoding, the other does not have it. And the worst is that the underlying text may end up broken because of this. Instead of the correct 0643, 062A, 0628, I get a malencoded legacy string. To make matters worse, different engines handle representation forms differently.

I understand you have not noticed the trouble it causes. Do not add them to all your fonts by default. Better to provide a separate ‘legacy’ or ‘Kelk’ version of the fonts for the very clients with outdated or OpenType-incompatible software. Otherwise you are breaking things for many just to accommodate a few.

Not sure where the substitution actually happens, but if you take a look at most mobile Greek keyboard layouts, they don’t have separate keys for the two (except as a context menu option, the same way accented vowels etc. are typed on English mobile keyboards); rather, you type the letter by hitting the σ key, the final form ς appears, and this is then substituted automatically for σ if a non-spacing character is typed after it, similar to what happens when typing in Arabic.

What exactly do you mean by enforcing presentation codes? Are you referring to text input, or is merely providing a Unicode value for the glyphs in the font ‘enforcing’?

Obviously, no one is talking about trying to type out running text by inputting the FXXX values. The word should be represented by 0643, 062A, 0628, and the font should be set up so the engine can handle that correctly via substitutions.

I’m only talking about what happens when you actively input the positional form (i.e., the FXXX value). If the positional forms have no Unicode value, the result is a missing glyph (which can be seen by copying the medial form above into an InDesign document set in most Arabic fonts). You’d need to use the zero-width joiner workaround mentioned by Georg to get it to work.

Are you saying that giving the glyphs for the positional forms Unicode values (FXXX) in Glyphs will break the substitutions? Because from my testing using Adobe Arabic (which does just that), that doesn’t seem to be the case: the substitutions work just fine, but direct input of the FXXX characters also works just fine.

Are you sure there is some OT magic involved, not just a feature of the keyboard (or mobile OS)?

No, I’m not – as I said, I don’t know where exactly that substitution is happening; it may well be the OS doing it. In fact, since most Greek-supporting fonts probably don’t have any fina substitutions for sigma, it most likely is. But for those fonts that do, like Minion, setting OpenType positional forms to Automatic in InDesign will do it there as well.

Thanks Rainer. The thing is, Kelk is just an example. As I’m sure you know, not everyone in the world gets to work under ideal situations, and sometimes accommodations need to be included in order for things to work as expected, particularly for a writing system like Arabic. As to the code points breaking things for people using systems that follow current specs, my understanding is that opentype substitutions can’t have an effect on the Unicode input stream, even when substituting one unicode encoded glyph by another, the input stream is left untouched. The only situation I’ve heard of the original input stream being changed is when Acrobat distiller makes a .pdf out of a .ps file (and the issue there has more to do with the software relying on postscript glyph names rather than unicode stream, so that’s definitely a quirk of the software). Have you seen other situations where the user’s input stream is actually changed by a substitution?

In what circumstances are those presentation codes useful (other to show them individually)?
And how would someone input them?

Arabic presentation forms code points are a legacy for pre-OpenType Arabic fonts. They are a legacy in Unicode and are not supported to be used for anything. The proper way to force certain joining behavior is to use ZWJ (to force joining) or ZWNJ (to force non-joining).

They exist in Unicode because IBM Egypt insisted on having them in the early days because that is how some IBM systems supported Arabic. I’m actually not aware of any widely used pre-Unicode Arabic encoding that had them (they are not in MacArabic nor Windows 1256). Some IBM AIX code pages have a very small subset of them ( Code page 1006, 1008, and 1046), but I don’t think they were ever in wide use.

It seems that some software makers either mistakenly or lazily, avoided adding proper Arabic support by supporting OpenType (or AAT, or even Graphite) and resorted to using the presentation forms. It is basically a hack and most software makers move to using OpenType as they need it anyway for other scripts (e.g. Devanagari).

Adding presentation forms code points to positioning glyphs is mostly harmless, as OpenType engines will ignore it anyway, and the systems that will use it will otherwise render Arabic unreadable.

I personally recommend against encoding Arabic presentation forms, for two reasons:

  1. They mask missing OpenType support. If your font has any substitutions other than the positional forms and lam-alef ligatures, they will not be used and the users might not notice it. Positioning is completely out, so no mark attachment, cursive attachment or even kerning. The font ends up being rendered in degraded way (and depending on the font it can be a severe degradation). I often feel sad seeing fonts I spent countless time polishing them and adding all sorts of contextual alternates to be used in such a limited way and users are none the wiser. Without the presentation forms, the user will see right away that the font does not work (they will eventually blame me, but I prefer that to the font being used improperly).
  2. Certain PDF creators will prefer the code points assigned to the glyphs when adding textual data to PDF over the original input text, so users coping the text from PDF will get presentation forms, and in many readers search using regular Arabic characters might not work.

I’m not ware of any other breakage that can result from using the Arabic presentation forms.

2 Likes

Showing them individually would be the main use case (and speaking anecdotally, it’s not rare at all that I have the need to individually type unusual Unicode characters such as positional forms, Mediaeval ligatures, etc., out of context).

If it’s not a character that I can easily type with any of the keyboard layouts I have activated on my machine, I will usually either add them via the symbol picker (I have the :globe_with_meridians: key set to pop up the character palette for easy searching and picking) or by copy-pasting from somewhere like Wiktionary.

I don’t quite follow you here. Are these examples of missing OpenType support that encoding the positional forms would mask? Or are they examples of included OpenType support that encoding the positional forms would break? And if the latter, how/why does adding Unicode points to the positional forms ‘override’ (?) substitutions and break positioning? Shouldn’t such substitutions always be based on the glyphs, regardless of whether or not those glyphs happen to be associated with a Unicode point?

Arabic presentation forms only cover a subset of Arabic letters in Unicode. It is a fixed subset and no other Arabic presentation forms were added to Unicode for letters added after the first batch.

Adding presentation forms masks missing OpenType support.