Fix utils_get_initials() reading past the end of the input string if that string was empty.
Also fix support for non-ASCII initials for which the UTF-8 character representation would have been truncated to the first byte only, leading to an invalid value.
The implementation here also tries to handle combining accents, although that part might not be really solid.
Fixes #3844.
---
This might get a bit over-the-top with the support for non-composable combining marks, but well, now it's there? :) You can view, comment on, or merge this pull request online at:
https://github.com/geany/geany/pull/3846
-- Commit Summary --
* Fix invalid memory access and Unicode support in utils_get_initials()
-- File Changes --
M src/utils.c (42) M tests/test_utils.c (23)
-- Patch Links --
https://github.com/geany/geany/pull/3846.patch https://github.com/geany/geany/pull/3846.diff
@elextr requested changes on this pull request.
Neat.
Its not perfect (eg Indic combining chars have class 0), but at least it should not result in invalid strings and the user can always override it with the preference.
@@ -766,21 +766,47 @@ gchar *utils_get_date_time(const gchar *format, time_t *time_to_use)
}
+/* Extracts initials from @p name, with some Unicode support */ +GEANY_EXPORT_SYMBOL
If its exported it should be documented ?
gchar *utils_get_initials(const gchar *name)
{ - gint i = 1, j = 1; - gchar *initials = g_malloc0(5); + GString *initials; + gchar *composed; + gboolean at_bound = TRUE; + gboolean prev_matched = FALSE; + + g_return_val_if_fail(name != NULL, NULL); + + composed = g_utf8_normalize(name, -1, G_NORMALIZE_DEFAULT_COMPOSE);
I suspect it should be `G_NORMALIZE_ALL_COMPOSE` so composite characters get split and other compatibility transforms. (==NFKC?).
I am unsure if (for example) ligatures can occur as first letters of names, and if they do is the "initial" the ligature or the first letter of it? Currently it selects the ligature as the "initial". Anyway its not critical so long as the user can override it with the "initials" preference.
@b4n commented on this pull request.
@@ -766,21 +766,47 @@ gchar *utils_get_date_time(const gchar *format, time_t *time_to_use)
}
+/* Extracts initials from @p name, with some Unicode support */ +GEANY_EXPORT_SYMBOL
No, it's only exported for tests, not `GEANY_API_SYMBOL`.
@elextr commented on this pull request.
@@ -766,21 +766,47 @@ gchar *utils_get_date_time(const gchar *format, time_t *time_to_use)
}
+/* Extracts initials from @p name, with some Unicode support */ +GEANY_EXPORT_SYMBOL
Ok.
@b4n commented on this pull request.
gchar *utils_get_initials(const gchar *name)
{ - gint i = 1, j = 1; - gchar *initials = g_malloc0(5); + GString *initials; + gchar *composed; + gboolean at_bound = TRUE; + gboolean prev_matched = FALSE; + + g_return_val_if_fail(name != NULL, NULL); + + composed = g_utf8_normalize(name, -1, G_NORMALIZE_DEFAULT_COMPOSE);
If you have more knowledge on combining, I'll take it, I only know accents myself :) (and then some ligatures that have assigned characters and that I feel like being variants, like Œ or Æ)
Its not perfect (eg Indic combining chars have class 0)
How should this be identified then? Just by ranges or something?
but at least it should not result in invalid strings
If that's all we're after, I can keep the normalization step and remove the manual (incomplete?) combining character support, which should still do the right thing™ in the vast majority of cases -- not counting the fact that it's currently terribly broken yet nobody complained before.
@elextr commented on this pull request.
gchar *utils_get_initials(const gchar *name)
{ - gint i = 1, j = 1; - gchar *initials = g_malloc0(5); + GString *initials; + gchar *composed; + gboolean at_bound = TRUE; + gboolean prev_matched = FALSE; + + g_return_val_if_fail(name != NULL, NULL); + + composed = g_utf8_normalize(name, -1, G_NORMALIZE_DEFAULT_COMPOSE);
Oooh, Æsop, there is a name with a ligature, so I guess its possible.
My rule is that if its inconvenient and annoying and inconsistent some natural language will do it!!!
How should this be identified then? Just by ranges or something?
Its a while since I came across this, but IIRC there is a separate indic setting in the unicode standard that says something about how it combines, because the rules of indic languages are complex (see comments above ;-)
If that's all we're after, I can keep the normalization step and remove the manual (incomplete?) combining character support, which should still do the right thing™ in the vast majority of cases
Well the NFKC[^1] normalization should handle a lot of cases by itself. The extra combining character support adds the case when there is no pre-combined code point, so it will handle some more cases, but what proportion of additional cases it gets correct I can't say. So better to be simple even if it misses a few cases.
not counting the fact that it's currently terribly broken yet nobody complained before.
Yes, its hardly worth the effort to complicate a capability that appears to be little used, just being safe (ie select proper code points) is enough since it can always be manually overridden if the simple answer is wrong.
[^1]: Why did glib use different names? There is a standard, NFC, NFKC etc so why did they invent their own names? Its a guess what standard name the glib names relate to, as usual its not documented!!! [end rant] Thats why my suggestion of `G_NORMALIZE_ALL_COMPOSE` was so tentative.
Well the NFKC normalization should handle a lot of cases by itself.
Yeah it should. And so you think I should use NFKC rather than NFC? I'm really not sure what makes more sense here.
Yes, its hardly worth the effort to complicate a capability that appears to be little used, just being safe (ie select proper code points) is enough since it can always be manually overridden if the simple answer is wrong.
Yeah, I guess I'll drop the manual combining character support if it's potentially not correct, as it makes things more complex for a very marginal gain.
@b4n pushed 1 commit.
53a42c6de7875b0f112b4ed22eb3f3e1c9cf967c Fix invalid memory access and Unicode support in utils_get_initials()
And so you think I should use NFKC rather than NFC?
I had another look at UAX#15 and its not really clear that either is better, both are likely to get different things wrong/right, selecting initials isn't what they are made for, so `(rand() % 100) > 50 ? "yes" : "no"` :-)
@elextr approved this pull request.
Merged #3846 into master.
github-comments@lists.geany.org