[CODE REVIEW] Changes to encodings

List overview All Threads

newer

older

Can't compile Geany with -ansi...

Collecting patches inside wiki

Matthew Brush

16 May 2011 16 May '11

5:45 a.m.

Hi all,

I was wondering if anyone had time to review some changes I've been working on?

There's two commits involved (attached patches and linked below):

1) Allow embedded NUL chars in files (without truncating). 2) Add more regex patterns for guessing encoding.

The first one is kind of large and affects several files/functions. One thing I did which should change is in a few places I added a 'gsize tmp_size' var to pass to the encodings_convert_to_utf8* functions, where a simple NULL size parameter would've sufficed, in these cases, the string is known to be NUL-terminated so this parameter isn't required. Unfortunately I made a commit after this so fixing this will come afterwards (though it's not such a big deal). Also, I'm not completely satisfied with the name of the sci_add_text2() function added to the sciwrappers files, but there was already a function named sci_add_text() which truncates the text at the first NUL, and the new function is a more direct wrapper of the SCI_ADDTEXT message. These changes break the plugin API, and would require two small changes to the geany-plugins GeanyVC: geanyvc.c:480 and geanyvc.c:498 as well as GeanyGenDoc: ggd-plugin.c:343. It should be obvious why this breakage was required I hope.

The following bug reports are related to the first set of changes: [1][2][3][4]. A few of the comments seem to suggest that it's not desired to support text files containing NUL control bytes, but IMO, it's nice to be able to show these anyway, especially since Scintilla has a nice way of dealing with these. IMO, as long as the file gets marked read-only, displaying as much as possible is better than truncating at the first NUL byte.

The second one is a simple refactoring of the code around the regex-based encoding detection and adding a few more patterns to detect a few more types of files. Especially needing review are the actual pattern strings I used since, quite frankly, I suck at regexps. One thing I noticed looking at the bug report here[5], after creating these patches/commits, is that it seems the <?xml ... encoding="" ?> is already supported by the CODING regex, so you can ignore the new XML one I added if that truly is the case (I'll remove it from the proper patch I send in the future). Like I said, my regex skills aren't great :)

If you want to use the GitHub UI to review the commits [6][7] otherwise, patches are attached. If you have a GitHub account, you can leave comments on the commits and/or specific lines of the commits.

Any feedback would be much appreciated as I plan to submit these changes as proper patches once I've ensured they are good enough.

In the meanwhile, I will continue testing...

Cheers, Matthew Brush

[1] https://sourceforge.net/tracker/index.php?func=detail&aid=3232135&gr... [2] https://sourceforge.net/tracker/index.php?func=detail&aid=2893180&gr... [3] https://sourceforge.net/tracker/index.php?func=detail&aid=3232135&gr... [4] https://sourceforge.net/tracker/index.php?func=detail&aid=2695290&gr... [5] https://sourceforge.net/tracker/index.php?func=detail&aid=3183506&gr... [6] https://github.com/codebrainz/geany/commit/81c89cce736c88f235761ce5765f2361f... [7] https://github.com/codebrainz/geany/commit/5dec6db43b498378be3496d4e89496835...

Attachments:

0001-Make-encoding-not-care-about-embedded-NUL-control-ch.patch (text/x-patch — 19.1 KB)
0002-Add-regex-patterns-for-XML-CSS-and-HTTP-Mail-headers.patch (text/x-patch — 2.8 KB)

Show replies by date

Lex Trotman

16 May 16 May

7:09 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

Hi Matthew,

Comments from inspection only as I am still development machine challenged. Also only on the NUL handling, encodings are not my specialty (as in I avoid them like the plague).

...

not such a big deal). Also, I'm not completely satisfied with the name of the sci_add_text2() function added to the sciwrappers files, but there was

Seems ok to me, it adds text too :-)

[...]

...

The following bug reports are related to the first set of changes: [1][2][3][4]. A few of the comments seem to suggest that it's not desired to support text files containing NUL control bytes, but IMO, it's nice to be able to show these anyway, especially since Scintilla has a nice way of dealing with these. IMO, as long as the file gets marked read-only, displaying as much as possible is better than truncating at the first NUL byte.

I don't think anyone is actively *against* handling NULs, just against the work involved for little gain. But of course if you think its more important then ... oh you have :-)

...

If you want to use the GitHub UI to review the commits [6][7] otherwise, patches are attached. If you have a GitHub account, you can leave comments on the commits and/or specific lines of the commits.

Any feedback would be much appreciated as I plan to submit these changes as proper patches once I've ensured they are good enough.

The problem I see with opening files with embedded NULs is then ensuring that it doesn't break when code that expects NUL termination accesses the buffer.

1. editor.c and others including plugins use sci_add_text a number of times, will any of that break? If so it needs to be disabled or fixed. 2. do searches work with files containing NULs? 3. How do selections work with embedded NULs, AFAICT all the selection calls use NUL termination? 4. All the places sci_get_string, sci_get_line, sci_get_text, sci_get_contents and any others that get/set text using NUL termination need to be disabled or fixed. 5. All of the above goes for plugins too. 6. Do lexers work with embedded NULs or does such a file have to be filetype None, if so it needs to be enforced.

Cheers Lex

Matthew Brush

8:20 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

On 05/15/11 22:09, Lex Trotman wrote:

...

Hi Matthew,

Comments from inspection only as I am still development machine challenged. Also only on the NUL handling, encodings are not my specialty (as in I avoid them like the plague).

Thanks for taking a look!

...

The problem I see with opening files with embedded NULs is then ensuring that it doesn't break when code that expects NUL termination accesses the buffer.

I did think of this a little. I think mostly other code related to Scintilla will work fine since Scintilla doesn't usually assume that a NUL character will terminate a string, it just treats it like any other valid UTF-8 character, except displays a special character for control characters[1]. See below for specifics.

...

editor.c and others including plugins use sci_add_text a number of

times, will any of that break? If so it needs to be disabled or fixed.

Using the old function will just lop off the end of a string with embedded NULs because it uses strlen(), but it will still add the truncated text fine.

...

do searches work with files containing NULs?

Seem to work as expected.

...

How do selections work with embedded NULs, AFAICT all the selection

calls use NUL termination?

Selecting with the mouse/keyboard works the same. As far as selecting programatically, assuming the Scintilla wrappers are more or less directly wrapping the Scintilla messages, there shouldn't be a problem. I'll check these out shortly, but at glance, they look fine.

...

All the places sci_get_string, sci_get_line, sci_get_text,

sci_get_contents and any others that get/set text using NUL termination need to be disabled or fixed.

The sci_get_string() function properly finds out the size of the string using a Scintilla message and then ensuring a final NUL-terminator, so it should work fine. Any code calling these functions will either truncate the returned string at the first NUL or handle it completely. Same goes for the other sci_get_*() functions.

...

All of the above goes for plugins too.

Aside from the issues identified with the API breakage, the worst case scenario should be truncation of the string I think.

...

Do lexers work with embedded NULs or does such a file have to be

filetype None, if so it needs to be enforced.

At least the CPP lexer works fine, others to be tested, though I don't think they rely too much (or at all) on NUL termination.

Probably a good idea would be to improve the warning dialog that pops up when you open a file with embedded NULs that explains that some functions may not work properly.

Cheers, Matthew Brush

[1] http://www.scintilla.org/ScintillaDoc.html#SCI_SETCONTROLCHARSYMBOL

Lex Trotman

9:18 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

...

I did think of this a little. I think mostly other code related to Scintilla will work fine since Scintilla doesn't usually assume that a NUL character will terminate a string, it just treats it like any other valid UTF-8 character, except displays a special character for control characters[1]. See below for specifics.

Yes, I expect Scintilla to work, but I expect problems with Geany code as things stop at NULs. I would hope that nothing will crash, but I expect lots of truncation, which means that it doesn't work. And that leads to user complaints no matter how much you document that it doesn't work, thats why I suggested that some things might need disabling, eg by forcing the file readonly (and preventing enabling changes).

...

...

editor.c and others including plugins use sci_add_text a number of

times, will any of that break? If so it needs to be disabled or fixed.

Using the old function will just lop off the end of a string with embedded NULs because it uses strlen(), but it will still add the truncated text fine.

Well truncated isn't right, so its not fine from a user standpoint.

...

...

do searches work with files containing NULs?

Seem to work as expected.

Hmmm, wonder how, the regex engine takes a NUL terminated string. I would expect the Scintilla searches to work, but not regexes (well not past the first NUL anyway), so maybe they need disabling or need to be made to continue past the NUL. Since one of the main reasons put forward in the bugs you listed for opening files with NULs is lightly corrupted log files, being able to search them is important.

...

...

How do selections work with embedded NULs, AFAICT all the selection

calls use NUL termination?

Selecting with the mouse/keyboard works the same. As far as selecting programatically, assuming the Scintilla wrappers are more or less directly wrapping the Scintilla messages, there shouldn't be a problem. I'll check these out shortly, but at glance, they look fine.

But what we select gets truncated when pasted.

...

...

All the places sci_get_string, sci_get_line, sci_get_text,

sci_get_contents and any others that get/set text using NUL termination need to be disabled or fixed.

The sci_get_string() function properly finds out the size of the string using a Scintilla message and then ensuring a final NUL-terminator, so it should work fine. Any code calling these functions will either truncate the returned string at the first NUL or handle it completely. Same goes for the other sci_get_*() functions.

Yes, but what does the code that used the call do, it has to stop at the first NUL in the file since it doesn't have any other length info. Therefore it likely doesn't do all it should.

...

...

All of the above goes for plugins too.

Aside from the issues identified with the API breakage, the worst case scenario should be truncation of the string I think.

Which again means that it is likely to not work right.

...

...

Do lexers work with embedded NULs or does such a file have to be

filetype None, if so it needs to be enforced.

At least the CPP lexer works fine, others to be tested, though I don't think they rely too much (or at all) on NUL termination.

On further thought I guess if they work with Scite they will be ok.

...

Probably a good idea would be to improve the warning dialog that pops up when you open a file with embedded NULs that explains that some functions may not work properly.

Yes, I think that there would definitely need to be some warning. That means that you have to check the file for NUL, can this be coded in your encoding coding? :-D So the file doesn't need scanning again.

If we let the user edit the file, the main thing that must work is finding and deleting the NULs so the user can get the file back to being a text file and the search capabilities.

Cheers Lex

Matthew Brush

4:31 p.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

On 05/16/11 00:18, Lex Trotman wrote:

...

...
I did think of this a little. I think mostly other code related to Scintilla will work fine since Scintilla doesn't usually assume that a NUL character will terminate a string, it just treats it like any other valid UTF-8 character, except displays a special character for control characters[1]. See below for specifics.

Yes, I expect Scintilla to work, but I expect problems with Geany code as things stop at NULs. I would hope that nothing will crash, but I expect lots of truncation, which means that it doesn't work. And that leads to user complaints no matter how much you document that it doesn't work, thats why I suggested that some things might need disabling, eg by forcing the file readonly (and preventing enabling changes).

IMO, it's far superior to the current way where it opens the file but chops off everything after the first NUL character. At least with the changes, most things will work and the whole text file can be loaded. I think this will cause less bug reports than there are currently on this NUL truncation bug.

...

...
...

editor.c and others including plugins use sci_add_text a number of

times, will any of that break? If so it needs to be disabled or fixed.

Using the old function will just lop off the end of a string with embedded NULs because it uses strlen(), but it will still add the truncated text fine.

Well truncated isn't right, so its not fine from a user standpoint.

Agreed that truncation isn't right, but it's how Geany is currently handling everything with NUL characters, my changes only improve this situation to some degree.

...

...
...

do searches work with files containing NULs?

Seem to work as expected.

Hmmm, wonder how, the regex engine takes a NUL terminated string. I would expect the Scintilla searches to work, but not regexes (well not past the first NUL anyway), so maybe they need disabling or need to be made to continue past the NUL. Since one of the main reasons put forward in the bugs you listed for opening files with NULs is lightly corrupted log files, being able to search them is important.

Just tested searching with a regex, it works up until the first NUL character. I'll have to see if it's something in the regex implementation or an assumption in Geany that NULs don't exist. The regex search in SciTE works fine in this case.

...

...
...

How do selections work with embedded NULs, AFAICT all the selection

calls use NUL termination?

Selecting with the mouse/keyboard works the same. As far as selecting programatically, assuming the Scintilla wrappers are more or less directly wrapping the Scintilla messages, there shouldn't be a problem. I'll check these out shortly, but at glance, they look fine.

But what we select gets truncated when pasted.

Just tested this, and when pasting text selected/copied with embedded NUL, it does truncate at the first NUL. Again, I'll need to see if this is something can be fixed from the Geany side. SciTE behaves the same way currently. In theory, we know where the selection starts and ends, so it should be possible copy the text and know the length.

...

...
...

All the places sci_get_string, sci_get_line, sci_get_text,

sci_get_contents and any others that get/set text using NUL termination need to be disabled or fixed.

The sci_get_string() function properly finds out the size of the string using a Scintilla message and then ensuring a final NUL-terminator, so it should work fine. Any code calling these functions will either truncate the returned string at the first NUL or handle it completely. Same goes for the other sci_get_*() functions.

Yes, but what does the code that used the call do, it has to stop at the first NUL in the file since it doesn't have any other length info. Therefore it likely doesn't do all it should.

Agreed, but still an improvement over the current handling. Functions like sci_get_contents() and sci_get_text() require a length parameter, so the calling code does have length info already.

...

...
...

All of the above goes for plugins too.

Aside from the issues identified with the API breakage, the worst case scenario should be truncation of the string I think.

Which again means that it is likely to not work right.

In the same way as Geany, probably yes.

...

...
...

Do lexers work with embedded NULs or does such a file have to be

filetype None, if so it needs to be enforced.

At least the CPP lexer works fine, others to be tested, though I don't think they rely too much (or at all) on NUL termination.

On further thought I guess if they work with Scite they will be ok.

SciTE actually handles text files with NULs quite nicely, it doesn't even mark them as read-only or anything and it saves them no problems.

...

...
Probably a good idea would be to improve the warning dialog that pops up when you open a file with embedded NULs that explains that some functions may not work properly.

Yes, I think that there would definitely need to be some warning. That means that you have to check the file for NUL, can this be coded in your encoding coding? :-D So the file doesn't need scanning again.

It's done and working[1] in a similar fashion as before (sans the truncation), where it compares the full data size with strlen() of the data and if they're different, passes read-only back as true, which causes the warning dialog to show and the doc to be made read-only.

...

If we let the user edit the file, the main thing that must work is finding and deleting the NULs so the user can get the file back to being a text file and the search capabilities.

Already works fine, you just turn read-only off, select the NUL character and delete it. One note about "get the file back to being a text file" is that a file with embedded NULs is actually a perfectly valid text file. The problem in Geany, is that it assumes that a text file is the same thing as a C string. Any file that contains all valid characters in it's encoding (of which NUL is one of), is a valid text file.

Cheers, Matthew Brush

[1] https://github.com/codebrainz/geany/blob/improve_encodings/src/encodings.c#L...

Lex Trotman

17 May 17 May

3:05 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

...

IMO, it's far superior to the current way where it opens the file but chops off everything after the first NUL character. At least with the changes, most things will work and the whole text file can be loaded. I think this will cause less bug reports than there are currently on this NUL truncation bug.

I agree its better to load the file, but I don't know that it will get less bug reports if I know users (being one) :-)

My main concern with all these comments is to get a situation where there are as few surprises as possible, surprises are what confuses and upsets users.

...

...
...
...

editor.c and others including plugins use sci_add_text a number of

times, will any of that break? If so it needs to be disabled or fixed.

Using the old function will just lop off the end of a string with embedded NULs because it uses strlen(), but it will still add the truncated text fine.

Well truncated isn't right, so its not fine from a user standpoint.

Agreed that truncation isn't right, but it's how Geany is currently handling everything with NUL characters, my changes only improve this situation to some degree.

And I might add that even though we see it as an improvement and are thankful, users just take it as a given and go on to the next issue. Sigh...

...

...
...
...

do searches work with files containing NULs?

Seem to work as expected.

Hmmm, wonder how, the regex engine takes a NUL terminated string. I would expect the Scintilla searches to work, but not regexes (well not past the first NUL anyway), so maybe they need disabling or need to be made to continue past the NUL. Since one of the main reasons put forward in the bugs you listed for opening files with NULs is lightly corrupted log files, being able to search them is important.

Just tested searching with a regex, it works up until the first NUL character. I'll have to see if it's something in the regex implementation or an assumption in Geany that NULs don't exist. The regex search in SciTE works fine in this case.

Scite works because it uses Scintilla's built in but limited regex engine, we use the superior system or GNU regex engine, but it only takes a NUL terminated string to search. I don't think this one is fixable, lets put it on the list of things that don't work to be documented. Search is limited to text only.

...

...
...
...

How do selections work with embedded NULs, AFAICT all the selection

calls use NUL termination?

Selecting with the mouse/keyboard works the same. As far as selecting programatically, assuming the Scintilla wrappers are more or less directly wrapping the Scintilla messages, there shouldn't be a problem. I'll check these out shortly, but at glance, they look fine.

But what we select gets truncated when pasted.

Just tested this, and when pasting text selected/copied with embedded NUL, it does truncate at the first NUL. Again, I'll need to see if this is something can be fixed from the Geany side. SciTE behaves the same way currently. In theory, we know where the selection starts and ends, so it should be possible copy the text and know the length.

No, cut/paste happens in Scintilla we can't do anything about it, but my guess would be that it uses the GTK clipboard, and that only takes NUL terminated strings. Again I think, document and leave.

...

...
...
...

All the places sci_get_string, sci_get_line, sci_get_text,

sci_get_contents and any others that get/set text using NUL termination need to be disabled or fixed.

The sci_get_string() function properly finds out the size of the string using a Scintilla message and then ensuring a final NUL-terminator, so it should work fine. Any code calling these functions will either truncate the returned string at the first NUL or handle it completely. Same goes for the other sci_get_*() functions.

Yes, but what does the code that used the call do, it has to stop at the first NUL in the file since it doesn't have any other length info. Therefore it likely doesn't do all it should.

Agreed, but still an improvement over the current handling. Functions like sci_get_contents() and sci_get_text() require a length parameter, so the calling code does have length info already.

So long as truncation doesn't do something destructive.

...

...
...
...

All of the above goes for plugins too.

Aside from the issues identified with the API breakage, the worst case scenario should be truncation of the string I think.

Which again means that it is likely to not work right.

In the same way as Geany, probably yes.

So long as they don't do something unexpectedly destructive or crash. You've identified the VCS plugin already.

...

...
...
...

Do lexers work with embedded NULs or does such a file have to be

filetype None, if so it needs to be enforced.

At least the CPP lexer works fine, others to be tested, though I don't think they rely too much (or at all) on NUL termination.

On further thought I guess if they work with Scite they will be ok.

SciTE actually handles text files with NULs quite nicely, it doesn't even mark them as read-only or anything and it saves them no problems.

Well the main reason for the read-only is the truncation, so you don't write a short file over the long one. Now you've fixed that, but I still think its worth leaving it read-only.

...

...
...
Probably a good idea would be to improve the warning dialog that pops up when you open a file with embedded NULs that explains that some functions may not work properly.

Yes, I think that there would definitely need to be some warning. That means that you have to check the file for NUL, can this be coded in your encoding coding? :-D So the file doesn't need scanning again.

It's done and working[1] in a similar fashion as before (sans the truncation), where it compares the full data size with strlen() of the data and if they're different, passes read-only back as true, which causes the warning dialog to show and the doc to be made read-only.

Ok.

...

...
If we let the user edit the file, the main thing that must work is finding and deleting the NULs so the user can get the file back to being a text file and the search capabilities.

Already works fine, you just turn read-only off, select the NUL character and delete it. One note about "get the file back to being a text file" is that a file with embedded NULs is actually a perfectly valid text file.

Agree, I meant C's definition of "text file".

The

...

problem in Geany, is that it assumes that a text file is the same thing as a C string. Any file that contains all valid characters in it's encoding (of which NUL is one of), is a valid text file.

Agree.

Cheers Lex

Matthew Brush

4:17 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

On 05/16/11 18:05, Lex Trotman wrote:

...

I agree its better to load the file, but I don't know that it will get less bug reports if I know users (being one) :-)

My main concern with all these comments is to get a situation where there are as few surprises as possible, surprises are what confuses and upsets users.

Maybe the dialog warning that pops up on non-standard text files should say "Warning: Unlike most text editors, Geany is going to open this file despite it containing questionable text. Please do not report any bugs on issues occurring while this document is open." ... hahaha.

...

And I might add that even though we see it as an improvement and are thankful, users just take it as a given and go on to the next issue. Sigh...

Heh. See warning above :)

...

Scite works because it uses Scintilla's built in but limited regex engine, we use the superior system or GNU regex engine, but it only takes a NUL terminated string to search. I don't think this one is fixable, lets put it on the list of things that don't work to be documented. Search is limited to text only.

Yeah, I can't find any info on the GNU regex stuff that indicates it can handle embedded NULs. I'm pretty sure other, arguably superior, regex engines support it (according to BSD man pages and PCRE docs). At this point though, documenting it would be best, at least until we see if there's a workaround. A quick glance at Geany's source and it seems like we're using a mix of the built-in regex engine and the GNU/stdlib stuff, but not by replacing Scintilla regex backend as it suggests to do in the docs. Perhaps this could be something to look at.

...

No, cut/paste happens in Scintilla we can't do anything about it, but my guess would be that it uses the GTK clipboard, and that only takes NUL terminated strings. Again I think, document and leave.

Haven't checked the source much, but I would assume SCI_COPYRANGE and SCI_COPYTEXT would work since they accept positions/length. According to the GtkClipboard docs, the gtk_clipboard_set_text() take a length argument as well. Further study is required.

...

So long as truncation doesn't do something destructive.

I think by definition truncation is destructive :)

...

So long as they don't do something unexpectedly destructive or crash. You've identified the VCS plugin already.

I haven't looked at much of the plugins' source, so I'm not too sure what they are doing. The issues with GeanyVC and GeanyGenDoc aren't related to NUL chars or anything, just that they use a function for which I changed the signature in these changes, so they would need to be updated accordingly.

...

Well the main reason for the read-only is the truncation, so you don't write a short file over the long one. Now you've fixed that, but I still think its worth leaving it read-only.

Agreed. The user can always choose to ignore the warning message, set it read/write and modify it anyway.

Cheers, Matthew Brush

Lex Trotman

7:32 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

On 17 May 2011 12:17, Matthew Brush mbrush@codebrainz.ca wrote:

...

On 05/16/11 18:05, Lex Trotman wrote:

...
I agree its better to load the file, but I don't know that it will get less bug reports if I know users (being one) :-)

My main concern with all these comments is to get a situation where there are as few surprises as possible, surprises are what confuses and upsets users.

Maybe the dialog warning that pops up on non-standard text files should say "Warning: Unlike most text editors, Geany is going to open this file despite it containing questionable text. Please do not report any bugs on issues occurring while this document is open." ... hahaha.

Well since 90% of the stuff edited is questionable text (a definition that is independent of containing NULs) maybe we just try applying it to all files :-)

...

...
And I might add that even though we see it as an improvement and are thankful, users just take it as a given and go on to the next issue. Sigh...

Heh. See warning above :)

...
Scite works because it uses Scintilla's built in but limited regex engine, we use the superior system or GNU regex engine, but it only takes a NUL terminated string to search. I don't think this one is fixable, lets put it on the list of things that don't work to be documented. Search is limited to text only.

Yeah, I can't find any info on the GNU regex stuff that indicates it can handle embedded NULs. I'm pretty sure other, arguably superior, regex engines support it (according to BSD man pages and PCRE docs). At this point though, documenting it would be best, at least until we see if there's a workaround. A quick glance at Geany's source and it seems like we're using a mix of the built-in regex engine and the GNU/stdlib stuff, but not by replacing Scintilla regex backend as it suggests to do in the docs. Perhaps this could be something to look at.

Later.

...

...
No, cut/paste happens in Scintilla we can't do anything about it, but my guess would be that it uses the GTK clipboard, and that only takes NUL terminated strings. Again I think, document and leave.

Haven't checked the source much, but I would assume SCI_COPYRANGE and SCI_COPYTEXT would work since they accept positions/length. According to the GtkClipboard docs, the gtk_clipboard_set_text() take a length argument as well. Further study is required.

Yes but ScintillaGTK.cxx gets the length to pass to the gtk_selection_set_text using strlen, oops...only part is copied to the selection.

And there are some worrying comments about use of \0 in rectangular selections.

...

...
So long as truncation doesn't do something destructive.

I think by definition truncation is destructive :)

Heh heh, I meant silently do something destructive that the user might not notice.

...

...
So long as they don't do something unexpectedly destructive or crash. You've identified the VCS plugin already.

I haven't looked at much of the plugins' source, so I'm not too sure what they are doing. The issues with GeanyVC and GeanyGenDoc aren't related to NUL chars or anything, just that they use a function for which I changed the signature in these changes, so they would need to be updated accordingly.

Ok, I havn't either.

The things that I am still worried about is code that might get a buffer using strlen, and then it is too short for whats put in it and that causes a crash. Losing all your editor tabs because you opened a file with NULs in it would be bad.

Cheers Lex

Matthew Brush

16 May 16 May

8:37 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

On 05/15/11 20:45, Matthew Brush wrote:

...

The second one is a simple refactoring of the code around the regex-based encoding detection and adding a few more patterns to detect a few more types of files. Especially needing review are the actual pattern strings I used since, quite frankly, I suck at regexps. One thing I noticed looking at the bug report here[5], after creating these patches/commits, is that it seems the <?xml ... encoding="" ?> is already supported by the CODING regex, so you can ignore the new XML one I added if that truly is the case (I'll remove it from the proper patch I send in the future). Like I said, my regex skills aren't great :)

I've made a commit[1] to fix the issue mentioned here, which removes extra regex patterns and uses two very general ones (the coding one is almost the same as before). I need to test these better still but a few quick tests seem to suggest they work ok.

Cheers, Matthew Brush

[1] https://github.com/codebrainz/geany/commit/53c4761b38b9ac454b2ae5bf855e1c9ac...

Colomban Wendling

17 May 17 May

5:40 p.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

Le 16/05/2011 08:37, Matthew Brush a écrit :

...

On 05/15/11 20:45, Matthew Brush wrote:

...
The second one is a simple refactoring of the code around the regex-based encoding detection and adding a few more patterns to detect a few more types of files. Especially needing review are the actual pattern strings I used since, quite frankly, I suck at regexps. One thing I noticed looking at the bug report here[5], after creating these patches/commits, is that it seems the <?xml ... encoding="" ?> is already supported by the CODING regex, so you can ignore the new XML one I added if that truly is the case (I'll remove it from the proper patch I send in the future). Like I said, my regex skills aren't great :)

I've made a commit[1] to fix the issue mentioned here, which removes extra regex patterns and uses two very general ones (the coding one is almost the same as before). I need to test these better still but a few quick tests seem to suggest they work ok.

Oops, didn't saw this one when answering to the previous one.

Well, encoding detection it's better, but:

1) it could even become a single re, with "(coding|charset)" 2) the space is useless in "[:= ]" since it'd be matched just after 3) the geany_debug() call is redundant with encodings.c:357

I'm just wondering whether these aren't a little too loose in real world, but they look OK (not like the ones of the previous patch ^^).

and the changes in socket.c shouldn't be in the same commit ;)

Cheers, Colomban

Matthew Brush

18 May 18 May

2:15 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

On 05/17/11 08:40, Colomban Wendling wrote:

...

Le 16/05/2011 08:37, Matthew Brush a écrit :

...
I've made a commit[1] to fix the issue mentioned here, which removes extra regex patterns and uses two very general ones (the coding one is almost the same as before). I need to test these better still but a few quick tests seem to suggest they work ok.

Oops, didn't saw this one when answering to the previous one.

Well, encoding detection it's better, but:

it could even become a single re, with "(coding|charset)"

Good point!

...

the space is useless in "[:= ]" since it'd be matched just after

I'll trust you on this :)

...

the geany_debug() call is redundant with encodings.c:357

That was an accident leaving that in there, it wasn't even meant to be there permanently, just for my own testing (before I realized the similar debug message from elsewhere).

...

I'm just wondering whether these aren't a little too loose in real world, but they look OK (not like the ones of the previous patch ^^).

I had this exact same thought, but I think it's OK since it will catch any other types of files that may use a similar scheme, and also, AFAIK, it still needs to find a real encoding/charset in order to truly "work", so I think it's safe and flexible. Also getting rid of that nasty <meta> regex would be pleasant to readers of the code :)

...

and the changes in socket.c shouldn't be in the same commit ;)

True enough.

Like I said in my last message, I really appreciate you taking the time to review this code. I'm going to get hacking on the improvements shortly.

Cheers, Matthew Brush

Colomban Wendling

3:29 p.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

Le 18/05/2011 02:15, Matthew Brush a écrit :

...

On 05/17/11 08:40, Colomban Wendling wrote:

...

the space is useless in "[:= ]" since it'd be matched just after

I'll trust you on this :)

Hum, actually you shouldn't I was wrong, it has not the same meaning. The space in that character range allowed for space as the separator, witch you actually meant to do, so keep the space. Sorry for the noise ^^

...

...
I'm just wondering whether these aren't a little too loose in real world, but they look OK (not like the ones of the previous patch ^^).

I had this exact same thought, but I think it's OK since it will catch any other types of files that may use a similar scheme, and also, AFAIK, it still needs to find a real encoding/charset in order to truly "work", so I think it's safe and flexible. Also getting rid of that nasty

<meta> regex would be pleasant to readers of the code :)

Yeah, and as said these regex are less loose than was the previous ones, since they need "charset" or "encoding" to be directly followed by something useful.

Cheers, Colomban

Colomban Wendling

17 May 17 May

5:30 p.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

Le 16/05/2011 05:45, Matthew Brush a écrit :

...

Hi all,

Hey. A bit later, but I finally managed to find some time reading your patches. Below is basically only some comments on the patches, you already discussed some points with Lex, I won't add anything on what was said when I don't think I have something to add.

...

I was wondering if anyone had time to review some changes I've been working on?

No really as you see, times is rare these days :/ But I'm trying to do my best...

...

There's two commits involved (attached patches and linked below):

Allow embedded NUL chars in files (without truncating).

I did a fast and inaccurate review of the patch, see my comments on it: https://github.com/codebrainz/geany/commit/81c89cce736c88f235761ce5765f2361f...

yeah, I'm trying GitHub to see what it actually gives. But I'm afraid it'll make my comments less visible for the ML readers, so unlikely fr them to comment back…

...

Add more regex patterns for guessing encoding.

This one too, see https://github.com/codebrainz/geany/commit/5dec6db43b498378be3496d4e89496835...

...

The first one is kind of large and affects several files/functions. One thing I did which should change is in a few places I added a 'gsize tmp_size' var to pass to the encodings_convert_to_utf8* functions, where a simple NULL size parameter would've sufficed, in these cases, the string is known to be NUL-terminated so this parameter isn't required. Unfortunately I made a commit after this so fixing this will come afterwards (though it's not such a big deal). Also, I'm not completely satisfied with the name of the sci_add_text2() function added to the sciwrappers files, but there was already a function named sci_add_text() which truncates the text at the first NUL, and the new function is a more direct wrapper of the SCI_ADDTEXT message.

sci_add_text_full() ?

having counterpart for set_text() would probably be good, too.

...

These changes break the plugin API, and would require two small changes to the geany-plugins GeanyVC: geanyvc.c:480 and geanyvc.c:498 as well as GeanyGenDoc: ggd-plugin.c:343. It should be obvious why this breakage was required I hope.

yeah, sounds reasonable for such a change. However, prefer breaking the API once (or at least it a single row), so I wouldn't see these committed before further checking and other checks to whether other APIs have to be updated.

...

The following bug reports are related to the first set of changes: [1][2][3][4]. A few of the comments seem to suggest that it's not desired to support text files containing NUL control bytes, but IMO, it's nice to be able to show these anyway, especially since Scintilla has a nice way of dealing with these. IMO, as long as the file gets marked read-only, displaying as much as possible is better than truncating at the first NUL byte.

As you discussed with Lex, it's something that may please some, but supporting this partially may simply move the user complaints :D

...

The second one is a simple refactoring of the code around the regex-based encoding detection and adding a few more patterns to detect a few more types of files. Especially needing review are the actual pattern strings I used since, quite frankly, I suck at regexps. One thing I noticed looking at the bug report here[5], after creating these patches/commits, is that it seems the <?xml ... encoding="" ?> is already supported by the CODING regex, so you can ignore the new XML one I added if that truly is the case (I'll remove it from the proper patch I send in the future). Like I said, my regex skills aren't great :)

As said on GitHub, I'm puzzled with this one.

...

Any feedback would be much appreciated as I plan to submit these changes as proper patches once I've ensured they are good enough.

In the meanwhile, I will continue testing...

That's the most important part: daily testing ^^

Cheers, Colomban

Matthew Brush

18 May 18 May

2:05 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

On 05/17/11 08:30, Colomban Wendling wrote:

...

Le 16/05/2011 05:45, Matthew Brush a écrit :

...
I was wondering if anyone had time to review some changes I've been working on?

No really as you see, times is rare these days :/ But I'm trying to do my best...

Thanks a lot for taking the time to review these changes, and also for using GitHub since it makes it *very* easy for me. I will review them all and make the appropriate fixes after I finish checking my emails.

...

I did a fast and inaccurate review of the patch, see my comments on it: https://github.com/codebrainz/geany/commit/81c89cce736c88f235761ce5765f2361f...

yeah, I'm trying GitHub to see what it actually gives. But I'm afraid it'll make my comments less visible for the ML readers, so unlikely fr them to comment back…

I guess if more people were using it and watching each others repositories, everyone would be on the same page by getting notifications of such things.

...

...
afterwards (though it's not such a big deal). Also, I'm not completely satisfied with the name of the sci_add_text2() function added to the sciwrappers files, but there was already a function named sci_add_text() which truncates the text at the first NUL, and the new function is a more direct wrapper of the SCI_ADDTEXT message.

sci_add_text_full() ?

having counterpart for set_text() would probably be good, too.

I like the _full idea like GLib uses. Regarding set_text(), it's actually one of the Scintilla functions that expects NUL termination. The reason for the add_text2() one is that add_text() was wrapped "wrong" (it's supposed to take a length param like in Scintilla) and I didn't want to clobber it for existing code. That's not to say there couldn't be a set_text_full() which internally handles the case of embedded nulls by doing a clear_all() then a add_text() in the same way my changes do in document.c

...

...
These changes break the plugin API, and would require two small changes to the geany-plugins GeanyVC: geanyvc.c:480 and geanyvc.c:498 as well as GeanyGenDoc: ggd-plugin.c:343. It should be obvious why this breakage was required I hope.

yeah, sounds reasonable for such a change. However, prefer breaking the API once (or at least it a single row), so I wouldn't see these committed before further checking and other checks to whether other APIs have to be updated.

Agreed, if I understand you correctly.

...

...
The following bug reports are related to the first set of changes: [1][2][3][4]. A few of the comments seem to suggest that it's not desired to support text files containing NUL control bytes, but IMO, it's nice to be able to show these anyway, especially since Scintilla has a nice way of dealing with these. IMO, as long as the file gets marked read-only, displaying as much as possible is better than truncating at the first NUL byte.

As you discussed with Lex, it's something that may please some, but supporting this partially may simply move the user complaints :D

Agreed, although I think an appropriate warning message change when such a file is opened will go a little further to help with user complaints. Still, you're both right on this.

...

...
The second one is a simple refactoring of the code around the regex-based encoding detection and adding a few more patterns to detect a few more types of files. Especially needing review are the actual pattern strings I used since, quite frankly, I suck at regexps. One thing I noticed looking at the bug report here[5], after creating these patches/commits, is that it seems the<?xml ... encoding="" ?> is already supported by the CODING regex, so you can ignore the new XML one I added if that truly is the case (I'll remove it from the proper patch I send in the future). Like I said, my regex skills aren't great :)

As said on GitHub, I'm puzzled with this one.

When I write regex code it looks more like my cat walking across the keyboard than it does me actually writing code :)

Cheers, Matthew Brush

Colomban Wendling

3:28 p.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

Le 18/05/2011 02:05, Matthew Brush a écrit :

...

On 05/17/11 08:30, Colomban Wendling wrote:

...
Le 16/05/2011 05:45, Matthew Brush a écrit :

...
I was wondering if anyone had time to review some changes I've been working on?

No really as you see, times is rare these days :/ But I'm trying to do my best...

Thanks a lot for taking the time to review these changes, and also for using GitHub since it makes it *very* easy for me. I will review them all and make the appropriate fixes after I finish checking my emails.

No problem, as far as I actually find the time to ^^

...

...
I did a fast and inaccurate review of the patch, see my comments on it: https://github.com/codebrainz/geany/commit/81c89cce736c88f235761ce5765f2361f...

yeah, I'm trying GitHub to see what it actually gives. But I'm afraid it'll make my comments less visible for the ML readers, so unlikely fr them to comment back…

I guess if more people were using it and watching each others repositories, everyone would be on the same page by getting notifications of such things.

Yeah probably, but that's not (yet?) the case ^^

...

...
...
afterwards (though it's not such a big deal). Also, I'm not completely satisfied with the name of the sci_add_text2() function added to the sciwrappers files, but there was already a function named sci_add_text() which truncates the text at the first NUL, and the new function is a more direct wrapper of the SCI_ADDTEXT message.

sci_add_text_full() ?

having counterpart for set_text() would probably be good, too.

I like the _full idea like GLib uses. Regarding set_text(), it's actually one of the Scintilla functions that expects NUL termination. The reason for the add_text2() one is that add_text() was wrapped "wrong" (it's supposed to take a length param like in Scintilla) and I didn't want to clobber it for existing code. That's not to say there couldn't be a set_text_full() which internally handles the case of embedded nulls by doing a clear_all() then a add_text() in the same way my changes do in document.c

That's what I meant for set_text_full(). Basically we don't need to show the real SCI api here, just what's useful to us. If it means wrapping to calls, I'd say it's just fine. And having such a wrapper will help user (here meaning "client" code) to properly give a length parameter where set_text() used to be used.

...

...
...
These changes break the plugin API, and would require two small changes to the geany-plugins GeanyVC: geanyvc.c:480 and geanyvc.c:498 as well as GeanyGenDoc: ggd-plugin.c:343. It should be obvious why this breakage was required I hope.

yeah, sounds reasonable for such a change. However, prefer breaking the API once (or at least it a single row), so I wouldn't see these committed before further checking and other checks to whether other APIs have to be updated.

Agreed, if I understand you correctly.

Hum, yeah, that must be the worst sentence I written since years :D I just mean I don't want to break the ABI now, and again in two days (or whatever short delay) because we figure out we have to change some other API to support NULs in another place, and again and again.

Cheers, Colomban

Lex Trotman

19 May 19 May

2:07 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

...

That's what I meant for set_text_full(). Basically we don't need to show the real SCI api here, just what's useful to us. If it means wrapping to calls, I'd say it's just fine. And having such a wrapper will help user (here meaning "client" code) to properly give a length parameter where set_text() used to be used.

The sciwrapper functions that just wrap scintilla messages are documented by the scintilla docs.

If functions are added to sciwrapper that don't just wrap Scintilla messages or change the names then they need to be documented. Since we use doxygen comments for API generation the only option for wrappers that are not part of the API is a plain source code comment and hope people see it :-(

So if possible the sciwrapper functions should have the same name as the message. In this particular instance backward compatibility means that we can't do that, but IIUC the new function is part of the API anyway.

...

Hum, yeah, that must be the worst sentence I written since years :D I just mean I don't want to break the ABI now, and again in two days (or whatever short delay) because we figure out we have to change some other API to support NULs in another place, and again and again.

Good point.

Cheers Lex

Matthew Brush

2:33 a.m.

New subject: [Geany-devel] [CODE REVIEW] Changes to encodings

On 05/18/11 17:07, Lex Trotman wrote:

...

...
That's what I meant for set_text_full(). Basically we don't need to show the real SCI api here, just what's useful to us. If it means wrapping to calls, I'd say it's just fine. And having such a wrapper will help user (here meaning "client" code) to properly give a length parameter where set_text() used to be used.

The sciwrapper functions that just wrap scintilla messages are documented by the scintilla docs.

If functions are added to sciwrapper that don't just wrap Scintilla messages or change the names then they need to be documented. Since we use doxygen comments for API generation the only option for wrappers that are not part of the API is a plain source code comment and hope people see it :-(

So if possible the sciwrapper functions should have the same name as the message. In this particular instance backward compatibility means that we can't do that, but IIUC the new function is part of the API anyway.

Hi Lex,

These are the related sci_*_text_full() commits[1][2] I'll be sending as patches to the ML once the rest of the encodings stuff is cleaned up and tested more.

The first commit adds the functions with the doc comments starting with /* for all 3 functions, and then the second commit which makes 2 of the 3 functions part of the API, makes the comments start with /** so doxygen sees them. I did it in two commits like this in case maybe these weren' to become part of the plugin API. All 3 functions are well commented (mostly wording direct from Scintilla docs).

Just wanted to update here since I didn't link to this new branch yet where I'm putting cleaner versions of the changes. If you click 'Switch Branches' then 'encodings_wip' you can see the other cleaner versions of the changes so far. Feedback welcome.

[1] https://github.com/codebrainz/geany/commit/b5a5bc171d7550561ad3c5eceaa590f1a... [2] https://github.com/codebrainz/geany/commit/cf86804594e4ab55f117e25c25adf9612...

Cheers, Matthew Brush

5035

Age (days ago)

5038

Last active (days ago)

devel@lists.geany.org

16 comments

3 participants

tags (0)

participants (3)

Colomban Wendling
Lex Trotman
Matthew Brush