How should this be identified then? Just by ranges or something?

Its a while since I came across this, but IIRC there is a separate indic setting in the unicode standard that says something about how it combines, because the rules of indic languages are complex (see comments above ;-)

If that's all we're after, I can keep the normalization step and remove the manual (incomplete?) combining character support, which should still do the right thing™ in the vast majority of cases

Well the NFKC¹ normalization should handle a lot of cases by itself. The extra combining character support adds the case when there is no pre-combined code point, so it will handle some more cases, but what proportion of additional cases it gets correct I can't say. So better to be simple even if it misses a few cases.

not counting the fact that it's currently terribly broken yet nobody complained before.

Yes, its hardly worth the effort to complicate a capability that appears to be little used, just being safe (ie select proper code points) is enough since it can always be manually overridden if the simple answer is wrong.

Why did glib use different names? There is a standard, NFC, NFKC etc so why did they invent their own names? Its a guess what standard name the glib names relate to, as usual its not documented!!! [end rant] Thats why my suggestion of G_NORMALIZE_ALL_COMPOSE was so tentative. ↩

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.

Footnotes