From my early tests it works fine loading and saving embedded NUL bytes, but I expect many features not to work properly (although e.g. regex search seems happy, yet displaying results not so much). It however should help getting to a point where these could be handled properly, and is handy also for slightly broken files and alike.
Anyway for now such files are loaded read-only and display a warning to the user to avoid most problems.
Related to (and should fix for the most part) #618 and #1708 (and possibly others). You can view, comment on, or merge this pull request online at:
https://github.com/geany/geany/pull/1709
-- Commit Summary --
* Avoid weird length value * Add sci_set_text_with_length() * Don't cut the loaded data at the first NUL when passing it to Scintilla * Don't cut UTF-8 documents at the first NUL byte when saving them * encodings: Accept NULs when validating UTF-8 * Properly load files with embedded NULs * Show an infobar for files with embedded NULs
-- File Changes --
M src/document.c (58) M src/documentprivate.h (1) M src/encodings.c (118) M src/encodingsprivate.h (2) M src/sciwrappers.c (10) M src/sciwrappers.h (2)
-- Patch Links --
https://github.com/geany/geany/pull/1709.patch https://github.com/geany/geany/pull/1709.diff
Havn't reviewed yet, but just on the point of actually doing this at all.
Ok, despite what `g_utf8_validate ()` claims `NUL` is a UTF-8 code point. So I suppose there is an argument that it should be loaded. But due the number of C NTS functions used in Geany its pretty much going to be a lottery if it edits, so loading it read-only is a good idea (if display and parse and search and any other function that will access an readonly file is NUL safe). But thats only loading files as UTF-8, any other encoding is going to depend on the system iconv implementation underlying the `g_convert()`
Showing an infobar identifying that its readonly, and why, allows the user to understand why they can't edit it, and where the problem is an encoding error, to locate where that occurred so they might be able to fix it with a hex editor or similar seems useful.
But why saving it? Its readonly, so it won't change. If the problem is encoding, then LOAD it with the correct encoding, don't save a new encoding of an incorrectly decoded file. If the file has wrong encoding (or mixed encodings) then saving it again won't fix it.
As for editing it, I have yet to see a real use-case for editing such files in a text editor/IDE, and given the amount of changes to Geany it would take to safely allow editing I do not see the point of making any changes to support more than loading and displaying files with NULs until a real use-case is provided.
So unless a compelling use-case is presented I would say load it, and lock it read-only, and tell the user why to help them find the problem, but I do not support starting any changes to allow editing, or saving.
But due the number of C NTS functions used in Geany
Have you actually checked how many functions used in Geany would be affected? I haven't, but I ask because it might be worth actually checking to see, maybe it's not so many more as one might assume.
As for editing it, I have yet to see a real use-case for editing such files in a text editor/IDE
The obvious one that comes to mind is actually fixing the file by removing the NUL, which IIRC Scintilla helpfully displays using a special little icon/glyph.
But thats only loading files as UTF-8, any other encoding is going to depend on the system iconv implementation underlying the `g_convert()`
Indeed, but I feel like those are somewhat likely to support whichever bytes, as it's basically the point in converting encodings -- but indeed I don't know. And on my system it seems like it does work fine for what I tested.
But why saving it? Its readonly, so it won't change.
All I did is fix it so it works, mostly because it was so easy -- and there was a very simple but weird bug in there not only easy to fix but also very confusing when looking at the code. So no, my idea was not to suggest we can work fully with files embedding NUL bytes and that then we need to be able to save them, but that why not fix this very important part for any future possibility in that area if it's so easy anyway. And encoding issue when saving will be the very same with NULs as any other byte: if the converter is happy with them it'll work, otherwise it won't. The problem was that UTF-8 files were artificially truncated at the first NUL, but actually non-UTF-8 were fine embedding NULs – which actually is likely to be important because I would imagine some encodings might generate NUL bytes in their encoded stream anyway (or at least could anyway).
Anyway: what I fixed was simply that now we properly write the whole thing in any case, with no corner case doing otherwise.
The obvious one that comes to mind is actually fixing the file by removing the NUL, which IIRC Scintilla helpfully displays using a special little icon/glyph.
Yes (like #1708), and yes (it displays it like any other control character, in a little box with the character name). And again, regex search actually works pretty OK with NULs, you can actually search for NUL bytes with them. Non-regex search doesn't, however (could be interesting to fix, but for now it's out of the scope of this).
Anyway, despite the "WIP" label, I don't really plan on working on fixing everything so it accepts NULs. I started it to see how hard it would be to at least simply load such files properly so one could look at them, also thinking that if that part is done it's more likely anyone would work on fixing some other code to handle those. So @elextr I think the goal of this PR matches your hopes: loading files, but nothing else -- and warning users it's not properly supported and can cause issues. But if one day we end up with full support for those, I don't think it would be a bad thing.
@codebrainz
There are plenty of tools out there that can remove NULs if that is actually the correct thing to do to the file (which its probably not if its an incorrect encoding). So its not a good trade-off of effort vs return to add the capability to Geany for an occasional faulty file. Just because its possible to do, doesn't mean it should be done.
Have you actually checked how many functions used in Geany would be affected? I haven't, but I ask because it might be worth actually checking to see, maybe it's not so many more as one might assume.
I don't support adding editing so I won't do that. :)
@b4n
some encodings might generate NUL bytes in their encoded stream anyway (or at least could anyway)
Sure, for example an ASCII file saved in UTF-16 will be full of NUL bytes, and it would be a bug if we did not allow for that, but thats different to NUL in the UTF-8. (and saving a file with "foo" as UTF-16 indeed works fine despite every second byte being NUL :)
Anyway: what I fixed was simply that now we properly write the whole thing in any case, with no corner case doing otherwise.
Ok, a silent truncation on saving of a buffer with a NUL is a "bad thing" :tm: , for example if a buggy plugin made a NUL.
Non-regex search doesn't, however (could be interesting to fix, but for now it's out of the scope of this).
So long as there have been comprehensive tests that __nothing__ that accesses the readonly buffer with NULs can (1) crash (2) hang (3) cause problems for other buffers (and that includes testing all plugins).
Note I didn't say anything had to work right, just not cause problems.
So @elextr I think the goal of this PR matches your hopes: loading files, but nothing else -- and warning users it's not properly supported and can cause issues.
And to be clear I support that if there is a proposal for how to do the comprehensive testing I mentioned above, and will review this when I get a chance.
But if one day we end up with full support for those, I don't think it would be a bad thing.
And as I said above, just because we __can__ doesn't mean we __should__ waste time on it when there are many other things to be done. And don't forget plugins.
And as I said above, just because we can doesn't mean we should waste time on it
There are quite a few bugs/issues/questions about this, so it wouldn't really be a waste of time, relative to any other changes/fixes/etc.
I really need this feature and will be happy to help with testing. I have come across multiple occasions when Geany won't load a file because of some NUL byte within. I then had to open it in Notepad++ (Win) or Vim (Linux) and clean the file of these bytes.
github-comments@lists.geany.org