[Github-comments] [geany/geany] Geany encoding determination broken? (#2910)

Sun Oct 3 01:44:25 UTC 2021

All the discussion above has been about the Preferences->Files setting which sets a default.  

That made me forget (well its hidden normally and I don't edit anything but UTF-8 so I don't use it ... ever :-) about the setting on the open dialog, which _is_enforced, even `None`, but it applies to only the selected file(s) not to any file opened from the command line or goto symbol or other method, those continue to use the preference.

If its set there, all your example files open, although `breaks.txt` and `utf16le.txt` appear to be truncated as expected (no endline).

But as I said above, beware that its unknown what functionality will work with invalid UTF-8 in the buffer.  So if Geany or a plugin doesn't work or crashes you have been warned. (but report Geany crashes, we do try to fix those).  The Scintilla editing widget claims to handle illegal bytes and show them as lozenge shapes with the hex in them, so actions executed directly by Scintilla will likely work fine, but you have no real way of telling which actions those are, mostly simple editing.

> Oh, come on. As if Geany was a sticky notes application, not a versatile developer's tool.

As I said Geany is a volunteer project, people do what they need or want to do, and nobody needed or wanted it enough to do it.  Its not a corporate supported designed tool, and its not in competition with other IDEs or editors for feature completeness.  But the bigger it becomes, and the more rarely used features it gathers, the more work there is to support it.  So its better that rare use-cases be handled by more appropriate tools.

> I understand the NULL value limitation and even then I think the user should be actively notified and also given a chance to load file up to the first NULL occurrence, with a red data-loss warning (and possibly a Save As... shortcut) included of course.

Thats what the setting on the open dialog does, and save as will work.

> Setting the NULL limitation aside, we are talking about values 1-255, which are valid UTF-8. So theoretically any nonzero uint8_t data can be represented in such a buffer or am I mistaken somewhere?

Indeed I think you misunderstand UTF-8, yes it uses values between 0 and 255, but not randomly.  All ASCII code points are their own value, but any code point 128 or greater is encoded as a sequence of _more than one_ byte with a value >= 128.  The number of bytes increases as the code point value increases, up to four bytes, all with values >= 128. See [here](https://en.wikipedia.org/wiki/UTF-8#Encoding)

So your files with single bytes >= 128 are not UTF-8.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/issues/2910#issuecomment-932846762
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.geany.org/pipermail/github-comments/attachments/20211002/a3d1d357/attachment.htm>