I realized today there can be an issue in Geany core, as text files which inadvertently contain character '\0' (NUL character) in the middle of it (for instance, a log file), are only read (or represented) until that point. I have tested both Geany 1.38.0 and Geany 2.0, and both of them behave in the very same way.
Comparatively, I have tried other editors, like vim, nano, Kate or the ancient NEdit, and no one of them suffer from this bug.
Could you fix it, please?
Closed #3700 as completed.
Closing as duplicate of #3686 and many more right back to #618
Thanks for your time evaluating and rejecting the bug report. From my point of view, it is a pity Geany diverges from the behavior of many other well established editors.
At least, Geany should show some warning telling it found a NUL character and it stopped reading.
Reopened #3700.
At least, Geany should show some warning telling it found a NUL character and it stopped reading.
It definitely should. Actually, it's supposed to *not* open the file, and show an error in the *message window*, but not load the file truncated… Can you reproduce with a clean configuration (and no plugins)? If so, could you elaborate on the file(s) having issues (type, example, file system, etc.) please?
@elextr not supporting NULs is one thing, but we should not truncate silently (and we think we don't do that, so if we do there's a problem somewhere).
I have stumbled on similar behavior just this week. I don't have the original file for which it happened, but it turns out it is really simple to generate file that reproduces this: ```bash echo "here comes the null:_more text here" | sed 's/_/\x00/' > /tmp/test.txt ``` Just to check that it is really written as intended: ``` $ hexdump -C /tmp/test.txt 00000000 68 65 72 65 20 63 6f 6d 65 73 20 74 68 65 20 6e |here comes the n| 00000010 75 6c 6c 3a 00 6d 6f 72 65 20 74 65 78 74 20 68 |ull:.more text h| 00000020 65 72 65 0a |ere.| 00000024 ``` Opening it with `geany -c /tmp/empty_dir -v /tmp/test.txt` shows only the text up to "here comes the null:" and the rest is truncated. The debug messages on the stdout appear as if the file passed as valid UTF-8.
Also: While I was testing this, sometimes pretty similar files opened as UTF-16LE, which totally scrambled the contents. Unfortunately, I can't reproduce that now, even if I try to generate the file exactly the same as before. It almost feels like it depends on how things happen to be layed out in memory or something else undefined behaviorish.
@elextr not supporting NULs is one thing, but we should not truncate silently (and we think we don't do that, so if we do there's a problem somewhere).
Agreed, but the OP doesn't mention that, like I said elsewhere the code has evolved to be a bit of a mess and I wouldn't be surprised if it has "undocumented features".
Like this one [here](https://github.com/geany/geany/blob/e5680fe85de536fc61ff0f2d4eadc54171d6c982...) talks about `temp_enc_idx` being various unicode values, but in fact `encodings_scan_unicode_bom()` returns [`GEANY_ENCODING_NONE`](https://github.com/geany/geany/blob/e5680fe85de536fc61ff0f2d4eadc54171d6c982...) if there is no BOM, so `partial` is not set. So when it is returned to [here](https://github.com/geany/geany/blob/e5680fe85de536fc61ff0f2d4eadc54171d6c982...) as `readonly` (see I told you it was a mess) the message is skipped [here](https://github.com/geany/geany/blob/e5680fe85de536fc61ff0f2d4eadc54171d6c982...).
And now I am going to have a lie down with a wet cloth on my forehead :frowning_face:
@dolik-rce
It almost feels like it depends on how things happen to be layed out in memory or something else undefined behaviorish.
Laid out in the file more specifically, Geany tries to be "helpful", unless the user specifies an encoding, Geany will try all the ones it knows about to see if they convert successfully. First one that works wins, so a random file could be decoded as something weird if the bytes in it just happen to be a legal combo for that encoding.
One good thing it does NOT do (AFAICT) is use locale, that could make the same file open differently between different members of this project, let alone the rest of the interwebs, yuk.
Unfortunately determining encoding is impossible, the same sequence of bytes may be interpreted as several different encodings successfully. It becomes UB when different applications try different encodings in different orders, or use different heuristics to guess the encoding.
Its all a HUUUUGE mess, @b4n @eht16 can we remove all encodings except Unicode, puhleeeeese!!!!
I performed some tests yesterday, and I found that when you allow Geany to guess the encoding, then it switches to UTF-16 (or something like that), showing lots of chinese-like characters. When UTF-8 or no encoding is forced on load, then its behavior is the one I reported.
I don't know whether it is helpful, but along these years I found some real scenarios where a NUL character can appear in a text file:
* A case I have found in the past were protein sequences in FASTA format from a non redundant database or program. Some of these were using (or allowed) the NUL character as a separator of concatenated descriptions in the header of a FASTA entry representing a cluster of protein sequences. * Another case where NUL character appears in an unintended way is a linux system hang, in the logs of a service. The service is sending its logs through system logging facility, and the system had a hard hang due any reason. The last block assigned to those files just after the hang can be "half baked", because the block was assigned but not written. Then, when the system is rebooted and restarted the service, the log will be appended, and the NUL characters are there.
@jmfernandez (and all the others asking the same thing) it doesn't matter how many use-cases are presented, "somebody" has to NUL safe all of Geany code, and all plugins, and only then will it be a responsible thing to load files containing NULs, simply loading files is easy, checking and correcting everything else is not going to misbehave or crash is potentially slow and tedious, and "somebody" has to do it.
I had a quick look at what we have for loading, and indeed it's a mess at least in that area. I'll try and have a stab at cleaning it up a tiny bit and avoid the loading of partial files without notifying. And I think we should probably *not* load truncated files, even if we'd warn better (and for that I think an infobar would indeed be better than a dialog that is skipped on startup), because it's just bound to loose data.
For the "file with NULs" case, I think there are only 3 options: 1) NOT open them at all and notify why that didn't open 2) open them read-only *including the NUL bytes* and *warn a **lot*** (possibly even preventing passing them non-read-only) -- here @elextr is worried it'd break a lot more things, I don't know if it's true if we don't allow editing, although possibly some features would be partly non-working indeed (but I don't expect crashes though) 3) as mentioned, fix all of Geany and plugins to properly handle NULs -- which is definitely doable, although a lot of work, and tricky to make it a bit foolproof not to risk introduce issues in the future too easily (which will anyway be a bit tricky, the C library not being very helpful there)
But again, I don't think having (partly working) code for "load truncated but read-only the user can choose to ignore" is a very good idea, especially as the warning can sometimes (even without a bug) be hard to notice.
I'll try and have a stab at cleaning it up a tiny bit and avoid the loading of partial files without notifying.
Good luck, have a supply of wet forehead cloths ready.
Totally agree the warning should be more visible, a nice fat infobar would be good, they didn't exist when Geany was written. Maybe flashing purple and orange stripes ;-P But its not much use if there is no tab to put it on, have to use dialogs when the file is not opened.
1. do this at least until 2. is done 2. your "partly non-working" is my "broken", and the first one is regex search where Geany tells g_regex to use null termination. And search is one of the first things a user is going to want to do with a logfile ... Hence my do 1. until there is some confidence in 2. As for crashes, well some call me pessimistic, but I know I'm realistic and you are hopelessly optimistic ;-P 3. std::string FTW!!!! :-) we could do a gcc, compile Geany with C++ and migrate over time??
I know I have no right to give an opinion on the code, as I'm not involved in the development. But if the loaded file was flagged as "textually impure" on load, features which are not or cannot be fixed (like it seems to happen to regex library) should be disabled based on that flag ... which would introduce its own batch of bugs and corner cases.
github-comments@lists.geany.org