[geany/geany] In-file encoding detection not always working (Issue #3777)

List overview All Threads

newer

older

[geany/geany] meson: correctly set...

[geany/geany] Fix in-file encoding...

M-HT

2 Mar 2024 2 Mar '24

11:46 a.m.

In-file encoding detection fails when the first 512 bytes isn't a valid UTF-8 string. Function `g_regex_match_full`, used [here](https://github.com/geany/geany/blob/06acd17cbeab6c0154e752b584aaf5001d9af1e8...), expects a valid UTF-8 string unless `G_REGEX_RAW` option is used - see [documentation](https://docs.gtk.org/glib/method.Regex.match_full.html). Possible solution is to add `G_REGEX_RAW` to function `g_regex_new`, used [here](https://github.com/geany/geany/blob/06acd17cbeab6c0154e752b584aaf5001d9af1e8...).

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/issues/3777 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/issues/3777@github.com

Attachments:

attachment.html (text/html — 2.1 KB)

Show replies by date

elextr

2 Mar 2 Mar

5:28 p.m.

Indeed the file contents must be a byte sequence for this to work, UTF-16 and a number of other encodings won't work because the byte pattern simply doesn't exist in the file. What was the encoding that didn't work? Also an encoding written in the file may be overridden by a BOM, the encodings code is pretty "messy" (technical term ;-) and "somebody" would have to check.

`RAW` is not a [PCRE2](https://www.pcre.org/current/doc/html/pcre2_compile.html) option, I wonder what Glib translates it into? And how it interacts with `CASELESS` which is fairly important.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/issues/3777#issuecomment-1974925432 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/issues/3777/1974925432@github.com

M-HT

6:13 p.m.

Here's a sample text file where the detection doesn't work - encoding CP932. [test.txt](https://github.com/geany/geany/files/14471001/test.txt)

Detection works when I add `G_REGEX_RAW` to function `g_regex_new`.

From Glib source: if G_REGEX_RAW is not used then it uses the compile option PCRE2_UTF and if G_REGEX_RAW is used then it uses match option PCRE2_NO_UTF_CHECK.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/issues/3777#issuecomment-1974937457 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/issues/3777/1974937457@github.com

elextr

6:36 p.m.

Well, removing the PCRE2_UTF option does change CASELESS behaviour, but ASCII should still work which is all we need for this purpose. Does it work with RAW if you change the case of "encoding" in your file? If so a PR adding `G_REGEX_RAW` would probably be accepted.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/issues/3777#issuecomment-1974941534 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/issues/3777/1974941534@github.com

M-HT

8:13 p.m.

Yes, the detection works when I change the case. I'll make a pull request later.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/issues/3777#issuecomment-1974966308 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/issues/3777/1974966308@github.com

Colomban Wendling

22 Apr 22 Apr

3:55 p.m.

Fixed in #3716 (b23201c01390e6ed6d631cf232e404691ffe91ec)

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/issues/3777#issuecomment-2070843882 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/issues/3777/2070843882@github.com

Colomban Wendling

3:55 p.m.

Closed #3777 as completed.

-- Reply to this email directly or view it on GitHub: https://github.com/geany/geany/issues/3777#event-12565004868 You are receiving this because you are subscribed to this thread. Message ID: geany/geany/issue/3777/issue_event/12565004868@github.com

Age (days ago)

Last active (days ago)

github-comments@lists.geany.org

6 comments

3 participants

tags (0)

participants (3)

Colomban Wendling
elextr
M-HT