[geany/geany] Various encodings conversion fixes (PR #3716)

List overview All Threads

newer

older

[geany/geany] Fix in-file encoding...

[geany/geany-plugins] BUG during...

Colomban Wendling

8 Dec 2023 8 Dec '23

10:37 p.m.

* Fix detection of some non-UTF-8 files with a encoding stanza in them (e.g. `coding: iso-8859-2`) * Fix out-of-bounds reads loading an empty UTF-16 buffer with BOM (yeah, really) * Fix **silently** truncating at the first NUL in some case (part of #3700) * Fix handling of non-UTF-8 data that start with a UTF-8 BOM

Note: this still doesn't load data after the first NUL byte, but at least now it will always properly warn as we supposedly were already doing. You can view, comment on, or merge this pull request online at:

https://github.com/geany/geany/pull/3716

-- Commit Summary --

* Add some basic tests for encoding detection and conversion * encodings: Fix detection of non-UTF-8 inline encodings * Remove unused code * Add basic tests for UTF-16* and UTF-32* * Add some basic UTF-7 tests * Add some tests for empty data with BOM * Fix silently truncating files with NULs * Refactor to remove duplication, lower assumptions and improve safety

-- File Changes --

M src/encodings.c (151) M src/encodingsprivate.h (3) M src/libmain.c (2) M tests/Makefile.am (3) A tests/test_encodings.c (283)

-- Patch Links --

https://github.com/geany/geany/pull/3716.patch https://github.com/geany/geany/pull/3716.diff

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716 You are receiving this because you are subscribed to this thread. Message ID: <geany/geany/pull/3716@github.com>

Attachments:

attachment.html (text/html — 5.3 KB)

Show replies by date

elextr

9 Dec 9 Dec

2:45 a.m.

LGBI, I didn't inspect the test code more than a (very) quick skim. Compiles ok, but encoding tests not made because not in `tests/meson.build` and being an ASCII locale I have no other encodings, so untested.

The whole encodings thing is still a mess, but its now looks a more correct mess :-)

PS isn't the milestone a bit specific? :wink:

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1848078447 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1848078447@github.com

Colomban Wendling

9:07 p.m.

@b4n pushed 1 commit.

9b054d16b4ed54de2486c6762ea00478f041b08c meson: Add encodings test

-- View it on GitHub: https://github.com/geany/geany/pull/3716/files/cb68cdb2f1b5c544b3235edde3bb1... You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/push/16210614452@github.com

Colomban Wendling

9:10 p.m.

...

Compiles ok, but encoding tests not made because not in `tests/meson.build`

Oopsie, fixed. Aaah, that joy of having multiple build systems to maintain is back, I almost forgot it after Waf bit the dust…

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1848667721 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1848667721@github.com

elextr

10 Dec 10 Dec

10:37 a.m.

...

that joy of having multiple build systems to maintain is back

Feel free to remove autotools :wink:

Anyway, tests seem to pass, although with 10 billion[^1] `Geany-INFO` messages.

Several of these messages appear:

``` # Geany-INFO: Couldn't convert from ISO-8859-8-I to UTF-8 (Conversion from character set “ISO-8859-8-I” to “UTF-8” is not supported). ```

What that points out is that Geany has a fixed list of conversions, which are exposed to user selection, but not all systems support all conversions, as you might guess my `iconv -l` does not contain `ISO-8859-8-I`. If I try to open an ASCII file (which is valid for all 8859s) with `ISO-8859-8-I` I get `The file xxxx is not valid ISO-8859-8-I` which is wrong, does `g_convert()` return a better error we could show (like the one from the test)?

[^1]: slight exaggeration for effect

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1848910107 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1848910107@github.com

Colomban Wendling

1:54 p.m.

...

Several of these messages appear:
# Geany-INFO: Couldn't convert from ISO-8859-8-I to UTF-8 (Conversion from character set “ISO-8859-8-I” to “UTF-8” is not supported).
What that points out is that Geany has a fixed list of conversions, which are exposed to user selection, but not all systems support all conversions, as you might guess my `iconv -l` does not contain `ISO-8859-8-I`. If I try to open an ASCII file (which is valid for all 8859s) with `ISO-8859-8-I` I get `The file xxxx is not valid ISO-8859-8-I` which is wrong[2](#user-content-fn-2-75b2eae86374246fedd5de48e30f3431), does `g_convert()` return a better error we could show (like the one from the test)?

`g_convert()` reports the message you see in the info message inside the parentheses. So yes, we could display it, it's a *Mere Matter of Programming™*, to report that error all the way up to the caller that will show the error.

As of [ISO-8859-8-I](https://en.wikipedia.org/wiki/ISO/IEC_8859-8) it seems understandable not to be supported, as IIUC it's the same thing that should just be *rendered* differently (or not, which makes matter worse).

Anyhow, yes, this could be improved, and ideally we'd probably not display the encoding in the list if it's not supported by the library we use (iconv), as it makes little sense then. But I'd argue this should probably be a separate improvement :)

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1848957533 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1848957533@github.com

elextr

11 Dec 11 Dec

12:04 a.m.

...

Anyhow, yes, this could be improved, and ideally we'd probably not display the encoding in the list if it's not supported by the library we use (iconv), as it makes little sense then.

Would be nice but IIRC last time we discussed this there was no easy way of finding what was supported by `g_convert()`, and does that only use iconv, or does it use something else on Windows and Macos? And even if its iconv all the way, there is no programmatic way of determining what it supports. So probably it would involve shell testing `iconv -l` at install time and adapting the menu. Too hard.

...

So yes, we could display it, it's a Mere Matter of Programming™, to report that error all the way up to the caller that will show the error.

A perfectly good thing to put in a PR called "Various encodings conversion fixes" he says hopefully? Maybe the lowest level function could display the message if it was `force` otherwise don't worry about it because its just Geany trying all encodings it knows again ;-)

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1849117350 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1849117350@github.com

Colomban Wendling

9:26 p.m.

...

...
Anyhow, yes, this could be improved, and ideally we'd probably not display the encoding in the list if it's not supported by the library we use (iconv), as it makes little sense then.

Would be nice but IIRC last time we discussed this there was no easy way of finding what was supported by `g_convert()`, and does that only use iconv, or does it use something else on Windows and Macos?

It's (almost, but is for all practical purposes) documented as using `g_iconv()`, which is documented as using `iconv()` or libiconv as fallback.

...

And even if its iconv all the way, there is no programmatic way of determining what it supports.

Oops, I didn't know it was impossible so I did it in the meantime :smile: (not pushed here, but I have something locally) More seriously though, no there don't seem to be any way of querying the list of encodings it supports, but there is a way to query whether a particular combination is supported or not: just use `g_iconv_open(A, B)` and it'll error out with `EINVAL` if the conversion from B to A is not supported.

...

So probably it would involve shell testing `iconv -l` at install time and adapting the menu. Too hard.

Yeah, no, that's not gonna happen I don't think.

...

...
So yes, we could display it, it's a Mere Matter of Programming™, to report that error all the way up to the caller that will show the error.

A perfectly good thing to put in a PR called "Various encodings conversion fixes" he says hopefully?

Fair enough :smile: I should stop those fairly generic names even when I ended up fixing a various things on the way, I end up getting sidetracked forever :upside_down_face:

...

Maybe the lowest level function could display the message if it was `force` otherwise don't worry about it because its just Geany trying all encodings it knows again ;-)

Meh, that's a bit to ~~Geanyic~~as-hoc, but the error could definitely be propagated.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1850834644 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1850834644@github.com

elextr

11:31 p.m.

...

It's (almost, but is for all practical purposes) documented as using g_iconv(), which is documented as using iconv() or libiconv as fallback.

Ok, the [code-ocumentation](https://github.com/GNOME/glib/blob/main/glib/gconvert.c) shows its iconv all the way, not a #ifdef windows in sight :-)

...

just use g_iconv_open(A, B) and it'll error out with EINVAL if the conversion from B to A is not supported.

So we can use that in `init_encodings()` to remove unsupported ones from the list.

It won't add any not supported by whoever made the list in `encodings.c`, but hopefully in this time of Unicode that won't be many.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1850996725 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1850996725@github.com

Colomban Wendling

12 Dec 12 Dec

11:01 p.m.

...

A perfectly good thing to put in a PR called "Various encodings conversion fixes" he says hopefully? Maybe the lowest level function could display the message if it was `force` otherwise don't worry about it because its just Geany trying all encodings it knows again ;-)

Would you really like me to add this here, in addition to the already non-trivial changes, or should this be a follow-up?

I've got a few more cosmetic changes, including propagating the error better as you wished, as well as not showing knowingly unsupported encodings in the UI and removing a unnecessary control in the preferences regarding file encodings. Should I shove everything here because it's encoding-related, or should we keep the "fix encodings conversion" separate from the more cosmetic stuff?

...

It won't add any not supported by whoever made the list in `encodings.c`, but hopefully in this time of Unicode that won't be many.

:crossed_fingers:

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852882462 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852882462@github.com

elextr

11:39 p.m.

...

including propagating the error better as you wished, as well as not showing knowingly unsupported encodings in the UI

In the end thats all I am suggesting adding now.

...

removing a unnecessary control in the preferences regarding file encodings.

Which one? I only found encoding of new files and use fixed encoding both of which seem useful?

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852925749 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852925749@github.com

elextr

11:49 p.m.

...

or should we keep the "fix encodings conversion" separate from the more cosmetic stuff?

Well, ATM its just tests and fixes AFAICT, not sure what you see as "cosmetic"?

PS sure its several fixes, but they are all just part of making encoding work correctly, or as correctly as it can be made to work without rewriting.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852935282 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852935282@github.com

Colomban Wendling

11:52 p.m.

@b4n pushed 2 commits.

effe633ef710c690eeff4bb37ce5ce1d13d1c8fc Small code cleanup 053667e3a95650691f1a48673c94794d98b1da37 Report actual error to the caller in encodings_convert_to_utf8_auto()

-- View it on GitHub: https://github.com/geany/geany/pull/3716/files/9b054d16b4ed54de2486c6762ea00... You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/push/16254768474@github.com

Colomban Wendling

11:56 p.m.

...

In the end thats all I am suggesting adding now.

Done (plus a couple of weird loop I just couldn't let be). Let me know how I should update all strings :wink:

...

Which one? I only found "encoding of new files" and "use fixed encoding" both of which seem useful?

*Use fixed encoding when opening non-Unicode files*. It's useful, but we have code for adding a *Detect from file* entry directly in the combo box, as used in the file Open dialog. Using this there as well instead of having a separate checkbox strikes me as an improvement: simpler UI with less widgets and less code.

...

Well, ATM its just tests and fixes AFAICT, not sure what you see as "cosmetic"?

Cosmetic is the other this I have for the UI, and I count the more proper error message as cosmetic compared to the other fixes.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852942612 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852942612@github.com

elextr

13 Dec 13 Dec

12:06 a.m.

...

Use fixed encoding when opening non-Unicode files. It's useful, but we have code for adding a Detect from file entry directly in the combo box, as used in the file Open dialog. Using this there as well instead of having a separate checkbox strikes me as an improvement: simpler UI with less widgets and less code.

Well, IIUC the "use fixed encoding" is intended to save the user from having to switch the forced encoding on the open dialog from Unicode to fixed all the time, like on a Windows machine with a non-unicode default code page, but the user is also editing Unicode HTML. So if my understanding is correct its a useful convenience, not essential. But for sure moving it to the open dialog will make it more discoverable.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852951628 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852951628@github.com

Colomban Wendling

12:11 a.m.

@elextr it's already in the open dialog, and I'm not gonna move anything. I'm just gonna merge *Use fixed encoding when opening non-Unicode files* checkbox and *Default encoding (existing non-Unicode files)* combo, only keeping the combo and having an additional *Detect from file* item in it which takes over the current *Use fixed encoding when opening non-Unicode files* checkbox role.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852957174 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852957174@github.com

elextr

12:12 a.m.

...

I count the more proper error message as cosmetic compared to the other fixes.

And as you may have guessed I see them as fixes, so of course I say all in together :wink:

Will check the strings as soon as I find my purple editing pencil.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852959563 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852959563@github.com

elextr

12:21 a.m.

...

@elextr it's already in the open dialog, and I'm not gonna move anything. I'm just gonna merge Use fixed encoding when opening non-Unicode files checkbox and Default encoding (existing non-Unicode files) combo, only keeping the combo and having an additional Detect from file item in it which takes over the current Use fixed encoding when opening non-Unicode files checkbox role.

Ok, that one is cosmetic, removing a checkbox and adding a do nothing combo setting instead.

But I disagree that its the same as the encoding selection in the open dialog, the open dialog sets encoding to use for _all_ files, the one in prefs sets it for _non-unicode_ files IIUC. I didn't check the code, so if they in fact work the same then there is a bug because one of the functions is missing, or it was just described wrong and then the labels need fixing to say the right thing. Don't forget the manual if changing UI.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852982553 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852982553@github.com

Colomban Wendling

12:24 a.m.

...

But I disagree that its the same as the encoding selection in the open dialog

I'm not saying it's the same, just that we have the code for generating the combo box with the additional item as used in that dialog, so I don't even have to add this. All this commenting is for the future PR with it, no point in discussing thin air, it's confusing for no reason ;)

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852989850 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852989850@github.com

elextr

12:27 a.m.

...

All this commenting is for the future PR with it, no point in discussing thin air, it's confusing for no reason ;)

Ok, waiting for separate concrete :smile:

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1852998898 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1852998898@github.com

Colomban Wendling

12:32 a.m.

@elextr I'll pour it as soon as this gets merged ;) (unless I find out there's no dependency between the two, in case I can get those to cure in parallel)

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-1853012073 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c1853012073@github.com

Enrico Tröger

21 Apr 21 Apr

2:37 p.m.

There are actually some encodings which seem not support on Windows: ``` 15:31:32.356445: Geany INFO : Encoding ISO-8859-10 is not supported by the system 15:31:32.356445: Geany INFO : Encoding ISO-8859-14 is not supported by the system 15:31:32.356445: Geany INFO : Encoding ISO-8859-16 is not supported by the system 15:31:32.356445: Geany INFO : Encoding UTF-7 is not supported by the system 15:31:32.356445: Geany INFO : Encoding ARMSCII-8 is not supported by the system 15:31:32.356445: Geany INFO : Encoding EUC-TW is not supported by the system 15:31:32.356445: Geany INFO : Encoding GEORGIAN-ACADEMY is not supported by the system 15:31:32.356445: Geany INFO : Encoding HZ is not supported by the system 15:31:32.356445: Geany INFO : Encoding ISO-IR-111 is not supported by the system 15:31:32.356445: Geany INFO : Encoding TCVN is not supported by the system 15:31:32.356445: Geany INFO : Encoding TIS-620 is not supported by the system 15:31:32.356445: Geany INFO : Encoding VISCII is not supported by the syste ```

For the rest, I did some basic testing on Linux and works as expected and by inspection it looks (almost) fine. Even if there might be subtile bugs on edge cases, I would merge this in master already to get broader testing of the changes.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-2068028879 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c2068028879@github.com

Enrico Tröger

2:42 p.m.

@eht16 commented on this pull request.

...

	geany_debug("Couldn't convert from %s to UTF-8.", charset);

+ g_set_error(error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Data contains NULs"));

Do we always know for sure the error here is exclusively NUL bytes?

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#pullrequestreview-2013391366 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/review/2013391366@github.com

Colomban Wendling

7:33 p.m.

@b4n commented on this pull request.

...

	geany_debug("Couldn't convert from %s to UTF-8.", charset);

+ g_set_error(error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Data contains NULs"));

Yes, if the conversion didn't fail (`conv_error == NULL`), it means the conversion was OK, but `g_utf8_validate()` failed, which can only happen on NULs if the data is otherwise UTF-8 (which it ought to be, unless there's a glaring bug in `g_convert()` yielding invalid data without reporting an error).

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#discussion_r1573864000 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/review/2013486389@github.com

Colomban Wendling

9:33 p.m.

Merged #3716 into master.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#event-12551817780 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/issue_event/12551817780@github.com

Colomban Wendling

9:36 p.m.

@elextr Tell me if there are still strings to be updated or better error to report, I can make another follow-up :wink:

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-2068175821 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c2068175821@github.com

elextr

22 Apr 22 Apr

5:38 a.m.

@b4n the only one that I would question is the one that @eht16 had questioned, so I was waiting until that was decided. :-P

"Data contains NULs" is referring to the result of the conversion, but the user doesn't know that, they only have the input file and it may not contain NULs, eg it might have an overlong encoding of a nul.

So "Result of conversion to UTF-8 buffer contains NULs" might be better.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/pull/3716#issuecomment-2068425388 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/pull/3716/c2068425388@github.com

312

Age (days ago)

448

Last active (days ago)

github-comments@lists.geany.org

26 comments

3 participants

tags (0)

participants (3)

Colomban Wendling
elextr
Enrico Tröger