I have many files on my webpage, which I wanted to standardize to utf-8. To automatically detect which ones contain problems I made a small script which detects the actual encoding (using `chardet` for the detection).
- A file containing the `ñ` is detected by `chardet` as ISO-8859-9 (as I understand it - correctly).
- The file contains a 0xF1 byte (coming from the ISO encoding or Latin-9)
- If I use `geany` to convert the document to UTF-8, the file doesn't seem to change (even though the tab changes to indicate the change). The 0xF1 is still present.
I suspect that geany (or the called library) doesn't feel 0xF1 is _not_ utf-8, hence doesn't translate it.
Is there any work-around?
What does Geany detect the encoding as when it opens the file?
The `0xF1` is a valid byte in more than one ISO-8859 encoding, eg its `ntilde` in ISO-8859-1. So which encoding the file opens with is pseudo random (its implementation dependent) which succeeds first.
This is why setting the encoding explicitly (`File->Open->More Options->SE&SW Asian->Turkish (ISO 8859-9)`) is available, for when encodings are ambiguous.
Closed #3141.
Sorry for the delay...
`File | Properties` says the file is utf-8, which is the same encoding as is selected in `Document | Set encoding...`. Running `chardet` on it returns `ISO-8859-9` (which I understand is as valid as ISO-8859-1 and others).
If I look inside the file, I find: (the `ñ` is at 0x122)
``` 120 20 61 c3 b1 6f 73 3a 20 22 20 2e 0a 20 20 20 20 a..os: " .. 130 20 20 77 33 5f 6c 69 6e 6b 32 73 28 22 77 65 65 w3_link2s("wee ``` The strange thing is that `chardet` claims `ISO-8859-9` which should have 'ñ' as a one-byte code. I load the file in Python3 as `open('weekly.php', rb)` and pass the contents as `bytes` to `chardet...
So, the issue does not seem to be Geany's.
Thanks for the info though!
The `c3 b1` shown is the correct encoding of `F1` in UTF-8, so Geany has loaded the file correctly.
I'm afraid we have no responsibility for what mistakes chardet makes :smile:.
github-comments@lists.geany.org