Hi,
I'm having some random crashes, possibly due to bad tag files? Patch attached "fixes" the segfault, but I'm pretty sure it's not the "correct" fix.
P.S. I also noticed in this block of code::
encodings_convert_to_utf8_from_charset(utf8_name, (gsize) -1, ...)
Is it OK the cast a negative number to `gsize` and will it have the desired effect for that function? The `-1` here is supposed to tell the encoding function that the string is nul terminated and is to be measured with `strlen()` IIUC.
Cheers, Matthew Brush
Hi,
Le 02/01/2012 06:01, Matthew Brush a écrit :
Hi,
I'm having some random crashes, possibly due to bad tag files? Patch attached "fixes" the segfault, but I'm pretty sure it's not the "correct" fix.
Well, you spotted a quite complex bug in tag encoding management... what we currently do is to assume tag's strings are encoded in the document's encoding, which is probably true for document tags but not necessarily for global tags.
I'm not sure what would be the right fix... we either need to always have the same encoding in a TMTag (better be UTF8), or we need to have a way to find what's the tag's encoding (e.g. a new TMTag field, ouch).
Your change at least prevents a crash, but it skips completely the tag. It's probably better than crashing, but still not perfect. Another bad solution would be to do something like [1].
This said, I'm pretty sure that if you get this bug you have some tags with non-ASCII characters, which is huh... weird. The only language I know would work with such mixed-encoding non-ASCII names is Python3 -- and even there it's considered bad practice to use the feature :D
So, I'm not sure what we should do here, but we probably do something. Any idea?
P.S. I also noticed in this block of code::
encodings_convert_to_utf8_from_charset(utf8_name, (gsize) -1, ...)
Is it OK the cast a negative number to `gsize` and will it have the desired effect for that function? The `-1` here is supposed to tell the encoding function that the string is nul terminated and is to be measured with `strlen()` IIUC.
It's ugly, but AFAIK it'll work fine everywhere since it's casted back to gssize in that function. But you're right, it should definitely be a gssize parameter. Unfortunately this function is in the plugin API, and I'm not sure of the implications of changing that parameter's type...
Cheers, Colomban
Cheers, Matthew Brush
On 02/01/2012 14:33, Colomban Wendling wrote:
encodings_convert_to_utf8_from_charset(utf8_name, (gsize) -1, ...)
Is it OK the cast a negative number to `gsize` and will it have the desired effect for that function? The `-1` here is supposed to tell the encoding function that the string is nul terminated and is to be measured with `strlen()` IIUC.
It's ugly, but AFAIK it'll work fine everywhere since it's casted back to gssize in that function. But you're right, it should definitely be a gssize parameter. Unfortunately this function is in the plugin API, and I'm not sure of the implications of changing that parameter's type...
That should be fixed. I'm not sure whether it needs an ABI break or not (guess not), but if the parameter can be -1 then make it gssize.
Nick
On 02/01/2012 15:54, Nick Treleaven wrote:
On 02/01/2012 14:33, Colomban Wendling wrote:
encodings_convert_to_utf8_from_charset(utf8_name, (gsize) -1, ...)
Is it OK the cast a negative number to `gsize` and will it have the desired effect for that function? The `-1` here is supposed to tell
the
encoding function that the string is nul terminated and is to be measured with `strlen()` IIUC.
It's ugly, but AFAIK it'll work fine everywhere since it's casted back to gssize in that function. But you're right, it should definitely be a gssize parameter. Unfortunately this function is in the plugin API, and I'm not sure of the implications of changing that parameter's type...
That should be fixed. I'm not sure whether it needs an ABI break or not (guess not), but if the parameter can be -1 then make it gssize.
Now fixed in Git. I decided an ABI break wasn't needed.
Le 03/01/2012 14:36, Nick Treleaven a écrit :
On 02/01/2012 15:54, Nick Treleaven wrote:
On 02/01/2012 14:33, Colomban Wendling wrote:
encodings_convert_to_utf8_from_charset(utf8_name, (gsize) -1, ...)
Is it OK the cast a negative number to `gsize` and will it have the desired effect for that function? The `-1` here is supposed to tell
the
encoding function that the string is nul terminated and is to be measured with `strlen()` IIUC.
It's ugly, but AFAIK it'll work fine everywhere since it's casted back to gssize in that function. But you're right, it should definitely be a gssize parameter. Unfortunately this function is in the plugin API, and I'm not sure of the implications of changing that parameter's type...
That should be fixed. I'm not sure whether it needs an ABI break or not (guess not), but if the parameter can be -1 then make it gssize.
Now fixed in Git. I decided an ABI break wasn't needed.
OK, cool. Let's just hope it don't lead to weird behavior, I'm still not 100% sure of what would happen even though I'm quite confident it'll work at least on x86 (and actually a test of mine shows it does on my x86_64 box :) ).
Cheers, Colomban
On Tue, Jan 3, 2012 at 1:33 AM, Colomban Wendling lists.ban@herbesfolles.org wrote:
Hi,
Le 02/01/2012 06:01, Matthew Brush a écrit :
Hi,
I'm having some random crashes, possibly due to bad tag files? Patch attached "fixes" the segfault, but I'm pretty sure it's not the "correct" fix.
Well, you spotted a quite complex bug in tag encoding management... what we currently do is to assume tag's strings are encoded in the document's encoding, which is probably true for document tags but not necessarily for global tags.
I'm not sure what would be the right fix... we either need to always have the same encoding in a TMTag (better be UTF8), or we need to have a way to find what's the tag's encoding (e.g. a new TMTag field, ouch).
I agree the only way is to have all tag files have the same UTF-8 encoding. Since we only read tag files we write that shouldn't be a problem.
[...] Cheers Lex