[Geany-devel] [CODE REVIEW] Changes to encodings

Matthew Brush mbrush at xxxxx
Mon May 16 14:31:57 UTC 2011

On 05/16/11 00:18, Lex Trotman wrote:
>> I did think of this a little.  I think mostly other code related to
>> Scintilla will work fine since Scintilla doesn't usually assume that a NUL
>> character will terminate a string, it just treats it like any other valid
>> UTF-8 character, except displays a special character for control
>> characters[1].  See below for specifics.
> Yes, I expect Scintilla to work, but I expect problems with Geany code
> as things stop at NULs.  I would hope that nothing will crash, but I
> expect lots of truncation, which means that it doesn't work.  And that
> leads to user complaints no matter how much you document that it
> doesn't work, thats why I suggested that some things might need
> disabling, eg by forcing the file readonly (and preventing enabling
> changes).

IMO, it's far superior to the current way where it opens the file but 
chops off everything after the first NUL character.  At least with the 
changes, most things will work and the whole text file can be loaded.  I 
think this will cause less bug reports than there are currently on this 
NUL truncation bug.

>>> 1. editor.c and others including plugins use sci_add_text a number of
>>> times, will any of that break? If so it needs to be disabled or fixed.
>> Using the old function will just lop off the end of a string with embedded
>> NULs because it uses strlen(), but it will still add the truncated text
>> fine.
> Well truncated isn't right, so its not fine from a user standpoint.

Agreed that truncation isn't right, but it's how Geany is currently 
handling everything with NUL characters, my changes only improve this 
situation to some degree.

>>> 2. do searches work with files containing NULs?
>> Seem to work as expected.
> Hmmm, wonder how, the regex engine takes a NUL terminated string.  I
> would expect the Scintilla searches to work, but not regexes (well not
> past the first NUL anyway), so maybe they need disabling or need to be
> made to continue past the NUL.  Since one of the main reasons put
> forward in the bugs you listed for opening files with NULs is lightly
> corrupted log files, being able to search them is important.

Just tested searching with a regex, it works up until the first NUL 
character.  I'll have to see if it's something in the regex 
implementation or an assumption in Geany that NULs don't exist.  The 
regex search in SciTE works fine in this case.

>>> 3. How do selections work with embedded NULs, AFAICT all the selection
>>> calls use NUL termination?
>> Selecting with the mouse/keyboard works the same.  As far as selecting
>> programatically, assuming the Scintilla wrappers are more or less directly
>> wrapping the Scintilla messages, there shouldn't be a problem.  I'll check
>> these out shortly, but at glance, they look fine.
> But what we select gets truncated when pasted.

Just tested this, and when pasting text selected/copied with embedded 
NUL, it does truncate at the first NUL.  Again, I'll need to see if this 
is something can be fixed from the Geany side.  SciTE behaves the same 
way currently.  In theory, we know where the selection starts and ends, 
so it should be possible copy the text and know the length.

>>> 4. All the places sci_get_string, sci_get_line, sci_get_text,
>>> sci_get_contents and any others that get/set text using NUL
>>> termination need to be disabled or fixed.
>> The sci_get_string() function properly finds out the size of the string
>> using a Scintilla message and then ensuring a final NUL-terminator, so it
>> should work fine.  Any code calling these functions will either truncate the
>> returned string at the first NUL or handle it completely. Same goes for the
>> other sci_get_*() functions.
> Yes, but what does the code that used the call do, it has to stop at
> the first NUL in the file since it doesn't have any other length info.
>   Therefore it likely doesn't do all it should.

Agreed, but still an improvement over the current handling.  Functions 
like sci_get_contents() and sci_get_text() require a length parameter, 
so the calling code does have length info already.

>>> 5. All of the above goes for plugins too.
>> Aside from the issues identified with the API breakage, the worst case
>> scenario should be truncation of the string I think.
> Which again means that it is likely to not work right.

In the same way as Geany, probably yes.

>>> 6. Do lexers work with embedded NULs or does such a file have to be
>>> filetype None, if so it needs to be enforced.
>> At least the CPP lexer works fine, others to be tested, though I don't think
>> they rely too much (or at all) on NUL termination.
> On further thought I guess if they work with Scite they will be ok.

SciTE actually handles text files with NULs quite nicely, it doesn't 
even mark them as read-only or anything and it saves them no problems.

>> Probably a good idea would be to improve the warning dialog that pops up
>> when you open a file with embedded NULs that explains that some functions
>> may not work properly.
> Yes, I think that there would definitely need to be some warning.
> That means that you have to check the file for NUL, can this be coded
> in your encoding coding? :-D  So the file doesn't need scanning again.

It's done and working[1] in a similar fashion as before (sans the 
truncation), where it compares the full data size with strlen() of the 
data and if they're different, passes read-only back as true, which 
causes the warning dialog to show and the doc to be made read-only.

> If we let the user edit the file, the main thing that must work is
> finding and deleting the NULs so the user can get the file back to
> being a text file and the search capabilities.

Already works fine, you just turn read-only off, select the NUL 
character and delete it.  One note about "get the file back to being a 
text file" is that a file with embedded NULs is actually a perfectly 
valid text file.  The problem in Geany, is that it assumes that a text 
file is the same thing as a C string.  Any file that contains all valid 
characters in it's encoding (of which NUL is one of), is a valid text file.

Matthew Brush


More information about the Devel mailing list