* Fix detection of some non-UTF-8 files with a encoding stanza in them (e.g. `coding: iso-8859-2`) * Fix out-of-bounds reads loading an empty UTF-16 buffer with BOM (yeah, really) * Fix **silently** truncating at the first NUL in some case (part of #3700) * Fix handling of non-UTF-8 data that start with a UTF-8 BOM
Note: this still doesn't load data after the first NUL byte, but at least now it will always properly warn as we supposedly were already doing. You can view, comment on, or merge this pull request online at:
https://github.com/geany/geany/pull/3716
-- Commit Summary --
* Add some basic tests for encoding detection and conversion * encodings: Fix detection of non-UTF-8 inline encodings * Remove unused code * Add basic tests for UTF-16* and UTF-32* * Add some basic UTF-7 tests * Add some tests for empty data with BOM * Fix silently truncating files with NULs * Refactor to remove duplication, lower assumptions and improve safety
-- File Changes --
M src/encodings.c (151) M src/encodingsprivate.h (3) M src/libmain.c (2) M tests/Makefile.am (3) A tests/test_encodings.c (283)
-- Patch Links --
https://github.com/geany/geany/pull/3716.patch https://github.com/geany/geany/pull/3716.diff
LGBI, I didn't inspect the test code more than a (very) quick skim. Compiles ok, but encoding tests not made because not in `tests/meson.build` and being an ASCII locale I have no other encodings, so untested.
The whole encodings thing is still a mess, but its now looks a more correct mess :-)
PS isn't the milestone a bit specific? :wink:
@b4n pushed 1 commit.
9b054d16b4ed54de2486c6762ea00478f041b08c meson: Add encodings test
Compiles ok, but encoding tests not made because not in `tests/meson.build`
Oopsie, fixed. Aaah, that joy of having multiple build systems to maintain is back, I almost forgot it after Waf bit the dust…
that joy of having multiple build systems to maintain is back
Feel free to remove autotools :wink:
Anyway, tests seem to pass, although with 10 billion[^1] `Geany-INFO` messages.
Several of these messages appear:
``` # Geany-INFO: Couldn't convert from ISO-8859-8-I to UTF-8 (Conversion from character set “ISO-8859-8-I” to “UTF-8” is not supported). ```
What that points out is that Geany has a fixed list of conversions, which are exposed to user selection, but not all systems support all conversions, as you might guess my `iconv -l` does not contain `ISO-8859-8-I`. If I try to open an ASCII file (which is valid for all 8859s) with `ISO-8859-8-I` I get `The file xxxx is not valid ISO-8859-8-I` which is wrong, does `g_convert()` return a better error we could show (like the one from the test)?
[^1]: slight exaggeration for effect
Several of these messages appear:
# Geany-INFO: Couldn't convert from ISO-8859-8-I to UTF-8 (Conversion from character set “ISO-8859-8-I” to “UTF-8” is not supported).
What that points out is that Geany has a fixed list of conversions, which are exposed to user selection, but not all systems support all conversions, as you might guess my `iconv -l` does not contain `ISO-8859-8-I`. If I try to open an ASCII file (which is valid for all 8859s) with `ISO-8859-8-I` I get `The file xxxx is not valid ISO-8859-8-I` which is wrong[2](#user-content-fn-2-75b2eae86374246fedd5de48e30f3431), does `g_convert()` return a better error we could show (like the one from the test)?
`g_convert()` reports the message you see in the info message inside the parentheses. So yes, we could display it, it's a *Mere Matter of Programming™*, to report that error all the way up to the caller that will show the error.
As of [ISO-8859-8-I](https://en.wikipedia.org/wiki/ISO/IEC_8859-8) it seems understandable not to be supported, as IIUC it's the same thing that should just be *rendered* differently (or not, which makes matter worse).
Anyhow, yes, this could be improved, and ideally we'd probably not display the encoding in the list if it's not supported by the library we use (iconv), as it makes little sense then. But I'd argue this should probably be a separate improvement :)
Anyhow, yes, this could be improved, and ideally we'd probably not display the encoding in the list if it's not supported by the library we use (iconv), as it makes little sense then.
Would be nice but IIRC last time we discussed this there was no easy way of finding what was supported by `g_convert()`, and does that only use iconv, or does it use something else on Windows and Macos? And even if its iconv all the way, there is no programmatic way of determining what it supports. So probably it would involve shell testing `iconv -l` at install time and adapting the menu. Too hard.
So yes, we could display it, it's a Mere Matter of Programming™, to report that error all the way up to the caller that will show the error.
A perfectly good thing to put in a PR called "Various encodings conversion fixes" he says hopefully? Maybe the lowest level function could display the message if it was `force` otherwise don't worry about it because its just Geany trying all encodings it knows again ;-)
Anyhow, yes, this could be improved, and ideally we'd probably not display the encoding in the list if it's not supported by the library we use (iconv), as it makes little sense then.
Would be nice but IIRC last time we discussed this there was no easy way of finding what was supported by `g_convert()`, and does that only use iconv, or does it use something else on Windows and Macos?
It's (almost, but is for all practical purposes) documented as using `g_iconv()`, which is documented as using `iconv()` or libiconv as fallback.
And even if its iconv all the way, there is no programmatic way of determining what it supports.
Oops, I didn't know it was impossible so I did it in the meantime :smile: (not pushed here, but I have something locally) More seriously though, no there don't seem to be any way of querying the list of encodings it supports, but there is a way to query whether a particular combination is supported or not: just use `g_iconv_open(A, B)` and it'll error out with `EINVAL` if the conversion from B to A is not supported.
So probably it would involve shell testing `iconv -l` at install time and adapting the menu. Too hard.
Yeah, no, that's not gonna happen I don't think.
So yes, we could display it, it's a Mere Matter of Programming™, to report that error all the way up to the caller that will show the error.
A perfectly good thing to put in a PR called "Various encodings conversion fixes" he says hopefully?
Fair enough :smile: I should stop those fairly generic names even when I ended up fixing a various things on the way, I end up getting sidetracked forever :upside_down_face:
Maybe the lowest level function could display the message if it was `force` otherwise don't worry about it because its just Geany trying all encodings it knows again ;-)
Meh, that's a bit to ~~Geanyic~~as-hoc, but the error could definitely be propagated.
It's (almost, but is for all practical purposes) documented as using g_iconv(), which is documented as using iconv() or libiconv as fallback.
Ok, the [code-ocumentation](https://github.com/GNOME/glib/blob/main/glib/gconvert.c) shows its iconv all the way, not a #ifdef windows in sight :-)
just use g_iconv_open(A, B) and it'll error out with EINVAL if the conversion from B to A is not supported.
So we can use that in `init_encodings()` to remove unsupported ones from the list.
It won't add any not supported by whoever made the list in `encodings.c`, but hopefully in this time of Unicode that won't be many.
A perfectly good thing to put in a PR called "Various encodings conversion fixes" he says hopefully? Maybe the lowest level function could display the message if it was `force` otherwise don't worry about it because its just Geany trying all encodings it knows again ;-)
Would you really like me to add this here, in addition to the already non-trivial changes, or should this be a follow-up?
I've got a few more cosmetic changes, including propagating the error better as you wished, as well as not showing knowingly unsupported encodings in the UI and removing a unnecessary control in the preferences regarding file encodings. Should I shove everything here because it's encoding-related, or should we keep the "fix encodings conversion" separate from the more cosmetic stuff?
It won't add any not supported by whoever made the list in `encodings.c`, but hopefully in this time of Unicode that won't be many.
:crossed_fingers:
including propagating the error better as you wished, as well as not showing knowingly unsupported encodings in the UI
In the end thats all I am suggesting adding now.
removing a unnecessary control in the preferences regarding file encodings.
Which one? I only found encoding of new files and use fixed encoding both of which seem useful?
or should we keep the "fix encodings conversion" separate from the more cosmetic stuff?
Well, ATM its just tests and fixes AFAICT, not sure what you see as "cosmetic"?
PS sure its several fixes, but they are all just part of making encoding work correctly, or as correctly as it can be made to work without rewriting.
@b4n pushed 2 commits.
effe633ef710c690eeff4bb37ce5ce1d13d1c8fc Small code cleanup 053667e3a95650691f1a48673c94794d98b1da37 Report actual error to the caller in encodings_convert_to_utf8_auto()
In the end thats all I am suggesting adding now.
Done (plus a couple of weird loop I just couldn't let be). Let me know how I should update all strings :wink:
Which one? I only found "encoding of new files" and "use fixed encoding" both of which seem useful?
*Use fixed encoding when opening non-Unicode files*. It's useful, but we have code for adding a *Detect from file* entry directly in the combo box, as used in the file Open dialog. Using this there as well instead of having a separate checkbox strikes me as an improvement: simpler UI with less widgets and less code.
Well, ATM its just tests and fixes AFAICT, not sure what you see as "cosmetic"?
Cosmetic is the other this I have for the UI, and I count the more proper error message as cosmetic compared to the other fixes.
Use fixed encoding when opening non-Unicode files. It's useful, but we have code for adding a Detect from file entry directly in the combo box, as used in the file Open dialog. Using this there as well instead of having a separate checkbox strikes me as an improvement: simpler UI with less widgets and less code.
Well, IIUC the "use fixed encoding" is intended to save the user from having to switch the forced encoding on the open dialog from Unicode to fixed all the time, like on a Windows machine with a non-unicode default code page, but the user is also editing Unicode HTML. So if my understanding is correct its a useful convenience, not essential. But for sure moving it to the open dialog will make it more discoverable.
@elextr it's already in the open dialog, and I'm not gonna move anything. I'm just gonna merge *Use fixed encoding when opening non-Unicode files* checkbox and *Default encoding (existing non-Unicode files)* combo, only keeping the combo and having an additional *Detect from file* item in it which takes over the current *Use fixed encoding when opening non-Unicode files* checkbox role.
I count the more proper error message as cosmetic compared to the other fixes.
And as you may have guessed I see them as fixes, so of course I say all in together :wink:
Will check the strings as soon as I find my purple editing pencil.
@elextr it's already in the open dialog, and I'm not gonna move anything. I'm just gonna merge Use fixed encoding when opening non-Unicode files checkbox and Default encoding (existing non-Unicode files) combo, only keeping the combo and having an additional Detect from file item in it which takes over the current Use fixed encoding when opening non-Unicode files checkbox role.
Ok, that one is cosmetic, removing a checkbox and adding a do nothing combo setting instead.
But I disagree that its the same as the encoding selection in the open dialog, the open dialog sets encoding to use for _all_ files, the one in prefs sets it for _non-unicode_ files IIUC. I didn't check the code, so if they in fact work the same then there is a bug because one of the functions is missing, or it was just described wrong and then the labels need fixing to say the right thing. Don't forget the manual if changing UI.
But I disagree that its the same as the encoding selection in the open dialog
I'm not saying it's the same, just that we have the code for generating the combo box with the additional item as used in that dialog, so I don't even have to add this. All this commenting is for the future PR with it, no point in discussing thin air, it's confusing for no reason ;)
All this commenting is for the future PR with it, no point in discussing thin air, it's confusing for no reason ;)
Ok, waiting for separate concrete :smile:
@elextr I'll pour it as soon as this gets merged ;) (unless I find out there's no dependency between the two, in case I can get those to cure in parallel)
There are actually some encodings which seem not support on Windows: ``` 15:31:32.356445: Geany INFO : Encoding ISO-8859-10 is not supported by the system 15:31:32.356445: Geany INFO : Encoding ISO-8859-14 is not supported by the system 15:31:32.356445: Geany INFO : Encoding ISO-8859-16 is not supported by the system 15:31:32.356445: Geany INFO : Encoding UTF-7 is not supported by the system 15:31:32.356445: Geany INFO : Encoding ARMSCII-8 is not supported by the system 15:31:32.356445: Geany INFO : Encoding EUC-TW is not supported by the system 15:31:32.356445: Geany INFO : Encoding GEORGIAN-ACADEMY is not supported by the system 15:31:32.356445: Geany INFO : Encoding HZ is not supported by the system 15:31:32.356445: Geany INFO : Encoding ISO-IR-111 is not supported by the system 15:31:32.356445: Geany INFO : Encoding TCVN is not supported by the system 15:31:32.356445: Geany INFO : Encoding TIS-620 is not supported by the system 15:31:32.356445: Geany INFO : Encoding VISCII is not supported by the syste ```
For the rest, I did some basic testing on Linux and works as expected and by inspection it looks (almost) fine. Even if there might be subtile bugs on edge cases, I would merge this in master already to get broader testing of the changes.
@eht16 commented on this pull request.
geany_debug("Couldn't convert from %s to UTF-8.", charset);
+ g_set_error(error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Data contains NULs"));
Do we always know for sure the error here is exclusively NUL bytes?
@b4n commented on this pull request.
geany_debug("Couldn't convert from %s to UTF-8.", charset);
+ g_set_error(error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE, + _("Data contains NULs"));
Yes, if the conversion didn't fail (`conv_error == NULL`), it means the conversion was OK, but `g_utf8_validate()` failed, which can only happen on NULs if the data is otherwise UTF-8 (which it ought to be, unless there's a glaring bug in `g_convert()` yielding invalid data without reporting an error).
Merged #3716 into master.
@elextr Tell me if there are still strings to be updated or better error to report, I can make another follow-up :wink:
@b4n the only one that I would question is the one that @eht16 had questioned, so I was waiting until that was decided. :-P
"Data contains NULs" is referring to the result of the conversion, but the user doesn't know that, they only have the input file and it may not contain NULs, eg it might have an overlong encoding of a nul.
So "Result of conversion to UTF-8 buffer contains NULs" might be better.
github-comments@lists.geany.org