[geany/geany] Fix invalid memory access and Unicode support in utils_get_initials() (PR #3846)

List overview All Threads

newer

older

[geany/geany] Fix read past end of...

[geany/geany-themes] Where is the...

Colomban Wendling

19 Apr 2024 19 Apr '24

11:26 p.m.

Fix utils_get_initials() reading past the end of the input string if that string was empty.

Also fix support for non-ASCII initials for which the UTF-8 character representation would have been truncated to the first byte only, leading to an invalid value.

The implementation here also tries to handle combining accents, although that part might not be really solid.

Fixes #3844.

---

This might get a bit over-the-top with the support for non-composable combining marks, but well, now it's there? :) You can view, comment on, or merge this pull request online at:

https://github.com/geany/geany/pull/3846

-- Commit Summary --

* Fix invalid memory access and Unicode support in utils_get_initials()

-- File Changes --

M src/utils.c (42) M tests/test_utils.c (23)

-- Patch Links --

https://github.com/geany/geany/pull/3846.patch https://github.com/geany/geany/pull/3846.diff

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846 You are receiving this because you are subscribed to this thread. Message ID: <geany/geany/pull/3846@github.com>

Attachments:

attachment.html (text/html — 3.6 KB)

Show replies by date

elextr

20 Apr 20 Apr

3:20 a.m.

@elextr requested changes on this pull request.

Neat.

Its not perfect (eg Indic combining chars have class 0), but at least it should not result in invalid strings and the user can always override it with the preference.

...

@@ -766,21 +766,47 @@ gchar *utils_get_date_time(const gchar *format, time_t *time_to_use)

}

+/* Extracts initials from @p name, with some Unicode support */ +GEANY_EXPORT_SYMBOL

If its exported it should be documented ?

...

gchar *utils_get_initials(const gchar *name)

{ - gint i = 1, j = 1; - gchar *initials = g_malloc0(5); + GString *initials; + gchar *composed; + gboolean at_bound = TRUE; + gboolean prev_matched = FALSE; + + g_return_val_if_fail(name != NULL, NULL); + + composed = g_utf8_normalize(name, -1, G_NORMALIZE_DEFAULT_COMPOSE);

I suspect it should be `G_NORMALIZE_ALL_COMPOSE` so composite characters get split and other compatibility transforms. (==NFKC?).

I am unsure if (for example) ligatures can occur as first letters of names, and if they do is the "initial" the ligature or the first letter of it? Currently it selects the ligature as the "initial". Anyway its not critical so long as the user can override it with the "initials" preference.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#pullrequestreview-2012843161 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/review/2012843161@github.com

Colomban Wendling

1:03 p.m.

@b4n commented on this pull request.

...

@@ -766,21 +766,47 @@ gchar *utils_get_date_time(const gchar *format, time_t *time_to_use)

}

+/* Extracts initials from @p name, with some Unicode support */ +GEANY_EXPORT_SYMBOL

No, it's only exported for tests, not `GEANY_API_SYMBOL`.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#discussion_r1573247489 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/review/2013070881@github.com

elextr

1:08 p.m.

@elextr commented on this pull request.

...

@@ -766,21 +766,47 @@ gchar *utils_get_date_time(const gchar *format, time_t *time_to_use)

}

+/* Extracts initials from @p name, with some Unicode support */ +GEANY_EXPORT_SYMBOL

Ok.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#discussion_r1573248774 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/review/2013071467@github.com

Colomban Wendling

1:11 p.m.

@b4n commented on this pull request.

...

gchar *utils_get_initials(const gchar *name)

If you have more knowledge on combining, I'll take it, I only know accents myself :) (and then some ligatures that have assigned characters and that I feel like being variants, like Œ or Æ)

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#discussion_r1573249549 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/review/2013071806@github.com

Colomban Wendling

1:15 p.m.

...

Its not perfect (eg Indic combining chars have class 0)

How should this be identified then? Just by ranges or something?

...

but at least it should not result in invalid strings

If that's all we're after, I can keep the normalization step and remove the manual (incomplete?) combining character support, which should still do the right thing™ in the vast majority of cases -- not counting the fact that it's currently terribly broken yet nobody complained before.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#issuecomment-2067642126 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/c2067642126@github.com

elextr

1:18 p.m.

@elextr commented on this pull request.

...

gchar *utils_get_initials(const gchar *name)

Oooh, Æsop, there is a name with a ligature, so I guess its possible.

My rule is that if its inconvenient and annoying and inconsistent some natural language will do it!!!

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#discussion_r1573251565 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/review/2013072549@github.com

elextr

1:52 p.m.

...

How should this be identified then? Just by ranges or something?

Its a while since I came across this, but IIRC there is a separate indic setting in the unicode standard that says something about how it combines, because the rules of indic languages are complex (see comments above ;-)

...

If that's all we're after, I can keep the normalization step and remove the manual (incomplete?) combining character support, which should still do the right thing™ in the vast majority of cases

Well the NFKC[^1] normalization should handle a lot of cases by itself. The extra combining character support adds the case when there is no pre-combined code point, so it will handle some more cases, but what proportion of additional cases it gets correct I can't say. So better to be simple even if it misses a few cases.

...

not counting the fact that it's currently terribly broken yet nobody complained before.

Yes, its hardly worth the effort to complicate a capability that appears to be little used, just being safe (ie select proper code points) is enough since it can always be manually overridden if the simple answer is wrong.

[^1]: Why did glib use different names? There is a standard, NFC, NFKC etc so why did they invent their own names? Its a guess what standard name the glib names relate to, as usual its not documented!!! [end rant] Thats why my suggestion of `G_NORMALIZE_ALL_COMPOSE` was so tentative.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#issuecomment-2067650306 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/c2067650306@github.com

Colomban Wendling

2:08 p.m.

...

Well the NFKC normalization should handle a lot of cases by itself.

Yeah it should. And so you think I should use NFKC rather than NFC? I'm really not sure what makes more sense here.

...

Yes, its hardly worth the effort to complicate a capability that appears to be little used, just being safe (ie select proper code points) is enough since it can always be manually overridden if the simple answer is wrong.

Yeah, I guess I'll drop the manual combining character support if it's potentially not correct, as it makes things more complex for a very marginal gain.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#issuecomment-2067653720 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/c2067653720@github.com

Colomban Wendling

2:16 p.m.

@b4n pushed 1 commit.

53a42c6de7875b0f112b4ed22eb3f3e1c9cf967c Fix invalid memory access and Unicode support in utils_get_initials()

-- View it on GitHub: https://github.com/geany/geany/pull/3846/files/346ef35a59857f63e249419e95c5c... You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/push/18082793865@github.com

elextr

2:31 p.m.

...

And so you think I should use NFKC rather than NFC?

I had another look at UAX#15 and its not really clear that either is better, both are likely to get different things wrong/right, selecting initials isn't what they are made for, so `(rand() % 100) > 50 ? "yes" : "no"` :-)

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#issuecomment-2067659398 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/c2067659398@github.com

elextr

2:33 p.m.

@elextr approved this pull request.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#pullrequestreview-2013080468 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/review/2013080468@github.com

Colomban Wendling

5:27 p.m.

Merged #3846 into master.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3846#event-12548053851 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3846/issue_event/12548053851@github.com

251

Age (days ago)

252

Last active (days ago)

github-comments@lists.geany.org

12 comments

2 participants

tags (0)

participants (2)

Colomban Wendling
elextr