By 8-bit ASCII I meant ASCII-compatible 8-bit extension the first time around.

As I explained there is no such thing, ASCII only defines what values less than 128 mean, it does not define what values with the most significant bit set mean. So no, its also not a suitable name for "no specified encoding", since the top 128 values are undefined. The ISO-8859 series encodings are examples of what the top 128 values mean, and there are 16 variants. Actually "no specified encoding" might be a good label for the setting since it alludes to falling back on searching for one.

So the encoding detection logic works sporadically even with longer files.

What you are actually seeing is "sporadically a longer file with random errors happens to be a valid file in some encoding". So that encoding is found. This is also why "Geany is perfectly able to handle some hybrid ASCII-binary files", the file happened to be a valid encoding. The encoding detection is not buggy, the files are 😁

So the solution would be to fix the file, possibly with a script that replaced non-ASCII values with a selected ASCII value, or mask off the MSB, or replace the non-ASCII with a valid and very visible UTF-8 character, possibly an Emoji like 👿. Thats something only you as the programmer can decide and do.

the file is sometimes opened properly, meaning 7-bit ASCII characters are displayed and other values are displayed with hex symbols. Other times it's an UTF-16 jungle because of a single 8-bit value occurrence or similar.

Thats again a manifestation of finding an encoding where the file is valid and converting that to UTF-8 in the buffer.

The display as hex values is something the font management does, not Geany, either generating a synthetic glyph when no font has one for that value, or a font in your stack has that glyph. Sometimes missing glyphs are shown as squares not hex. (to be precise its done by Harfbuzz used by Pango which is part of GTK which is used by Scintilla which is used by Geany, so its well buried behaviour and controlled by "many" things).

Why wouldn't user be able to manually set encoding

The current behaviour of searching for a valid encoding has evolved to handle a common use-case where files are in mixed encodings (is/was common on Windows in non-English speaking locales IIUC and many Geany contributors are in such locales). It would be possible to add a "use this encoding only" option (but somebody has to do it), but the result if the file was not valid in that encoding would have to be a refusal to load since Geany would have no idea what UTF-8 the file contents were meant to be converted to if the file was not valid in that encoding.

Also, I tried to set something other than Without encoding in Preferences > Files > Encodings > Default encoding (existing non-Unicode files) and iso88591.txt still opens as ISO-8859-1. Does this also not add up to you or am I missing something?

Again if you selected something other than a valid encoding for that file the behaviour is to fallback to searching for a working encoding. All the encoding settings are "try this first, then search" rather than "just try this or fail". Thats probably where the "None" for the default encoding comes from, meaning "Don't try anything first, just search".

Just to finally re-emphasise, the Geany buffer has to be valid UTF-8 with no embedded NULLs, all the editing and other functions assume it, and depend on it, so loading invalid UTF-8 sequences "without encoding" is not possible, the input must be valid in UTF-8 or an encoding that can be converted to UTF-8.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.