Hi,
I'm being redirected here from GLib bug tracker:
I'm using the text editor 'Geany' version 2.0 under Windows 11. According to the "About" dialog, Geany version 2.0 is based on GLib 2.78.0. I'm trying to convert a large UTF-8 encoded text file into ISO-8859-1 because I'm not skilled enough to properly handle UTF-8 strings in a program I'm making and I currently do not need to support anything but Spanish. However, when I attempt to change the file encoding, it throws an error during save:
`GLib.GException: Hay una secuencia de bytes no válida en la entrada de conversión.`
The supposedly bad character is:
``` 'ἀ' 'Greek Small Letter Alpha with Psili' U+1F00 UTF-8 bytes: 0xE1 0xBC 0x80 ```
This is correct. The file I'm processing contains linguistic information and contains a lot of unusual characters such as Greek letters. I've verified the file by hand in an hex editor and the bytes are properly encoded. Therefore, I've determined that this must be a bug in the GLib library. Or, at least, in the way Geany handles GLib exceptions.
My hypothesis is that this particular character is outside the range of characters ISO-8859-1 supports. Therefore, not finding an 1:1 equivalent, it throws a warning to alert the program it's going to lose information in the conversion process. What I don't understand is why it says "There is an invalid byte sequence in the conversion input" if the sequence is actually valid. If I'm right, it should use a different message.
Glib is correct, `Greek Small Letter Alpha with Psili` is not a character in ISO-8859-1, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1.
ISO-8859-1 only encodes 256 characters of the million or so that Unicode does and Greek characters are not included, so it can't be converted. If the file contains a Greek character then your claim that you only need Spanish is false (at least as far as that file goes).
Closed #3792 as completed.
Yes, I know it's going to lose some information in the process. But as I said, the message shown to the user is wrong. There is no "invalid byte sequence" in the original text. It should say something like "the original text contains symbols that can't be translated into the requested encoding". Giving misleading information during an error is a bug in itself.
Besides, there should be a way to allow the user to proceed with the conversion anyway because the only other option is to remove those symbols by hand.
But as I said, the message shown to the user is wrong. There is no "invalid byte sequence" in the original text. It should say something like "the original text contains symbols that can't be translated into the requested encoding". Giving misleading information during an error is a bug in itself.
The error message is from Glib as your OP said, not Geany, so you need to raise that issue there, but I suspect its probably actually from the underlying iconv or ICU library, don't know what Glib uses on windows. Maybe the error message makes sense at that level where it is converting byte sequences and the byte sequence in the input is invalid in the output?
Geany spends all its time trying to preserve the users data, so its not going to let you save in a way that does not preserve content. Save in an encoding that supports the relevant characters then convert with specialised tools if you want to.
the only other option is to remove those symbols by hand.
Correct, Geany is not a transliteration program, changes can only be made by users.
For what you are trying to do you really need a transliteration program like [uconv](https://linux.die.net/man/1/uconv) (part of [ICU](https://en.wikipedia.org/wiki/International_Components_for_Unicode) IIUC) that can be told what to do with untranslatable data. Adding actions that are rarely used to Geany just complicates it more, and makes more work for the volunteers that support it, better to use the right tool for rare operations. Beat it with a :hammer: not a :screwdriver: :grinning:
github-comments@lists.geany.org