In-file encoding detection fails when the first 512 bytes isn't a valid UTF-8 string. Function `g_regex_match_full`, used [here](https://github.com/geany/geany/blob/06acd17cbeab6c0154e752b584aaf5001d9af1e8...), expects a valid UTF-8 string unless `G_REGEX_RAW` option is used - see [documentation](https://docs.gtk.org/glib/method.Regex.match_full.html). Possible solution is to add `G_REGEX_RAW` to function `g_regex_new`, used [here](https://github.com/geany/geany/blob/06acd17cbeab6c0154e752b584aaf5001d9af1e8...).
Indeed the file contents must be a byte sequence for this to work, UTF-16 and a number of other encodings won't work because the byte pattern simply doesn't exist in the file. What was the encoding that didn't work? Also an encoding written in the file may be overridden by a BOM, the encodings code is pretty "messy" (technical term ;-) and "somebody" would have to check.
`RAW` is not a [PCRE2](https://www.pcre.org/current/doc/html/pcre2_compile.html) option, I wonder what Glib translates it into? And how it interacts with `CASELESS` which is fairly important.
Here's a sample text file where the detection doesn't work - encoding CP932. [test.txt](https://github.com/geany/geany/files/14471001/test.txt)
Detection works when I add `G_REGEX_RAW` to function `g_regex_new`.
From Glib source: if G_REGEX_RAW is not used then it uses the compile option PCRE2_UTF and if G_REGEX_RAW is used then it uses match option PCRE2_NO_UTF_CHECK.
Well, removing the PCRE2_UTF option does change CASELESS behaviour, but ASCII should still work which is all we need for this purpose. Does it work with RAW if you change the case of "encoding" in your file? If so a PR adding `G_REGEX_RAW` would probably be accepted.
Yes, the detection works when I change the case. I'll make a pull request later.
Fixed in #3716 (b23201c01390e6ed6d631cf232e404691ffe91ec)
Closed #3777 as completed.
github-comments@lists.geany.org