[Github-comments] [geany/geany] fails to open Microsoft UTF-16LE file (MSO Word CUSTOM.DIC dictionary file) (#1238)

Colomban Wendling notifications at xxxxx
Mon Sep 19 10:11:21 UTC 2016


```
$ iconv -f UTF-16LE -t UTF-8 < CUSTOM-utf16le-2016.dic > CUSTOM-utf16le-2016.dic_utf8
iconv: illegal input sequence at position 34076
```

Apparently that file has bytes `\xca \xde` near the end, and that doesn't seem to be a sequence accepted by iconv (so I'd guess it really is invalid).
And trying and reading how UTF-16 works, it indeed seems invalid: there is the sequence `0xD7B8 0xDECA` in the input: `0xD7B8` is [HANGUL JUNGSEONG YU-O](http://www.charbase.com/d7b8-unicode-hangul-jungseong-yu-o), which is encoded on a single word (below `U+FFFD`, and not in the range `U+D800 - U+DFFF`). Next word is `0xDECA`, which, as being in the range `0xDC00 - 0xDFFF`, should be the second word of a two-word pair.  It is not (the previous word not being in the `0xD800 - 0xDBFF` range), so it is invalid.

I tested other editors, like GEdit and `vim`, and they both exhibit the same issue failing to properly open the file.  GEdit opens it almost correctly, but only up to the invalid sequence, warning that the file might be truncated and that saving it could result in data loss.  Vim shows plain garbage.

All I can imagine is that the file is broken, and the other editors you try either truncate it, or are more forgiving and leaving the invalid bytes as-is.  As @elextr explained, we can't really do that because we need UTF-8 encoding in the buffer, so need be able to convert to and from it.  With invalid sequences, it wouldn't be possible to restore it.

I'm actually fairly curious as to what the editors you see it working with actually do with those byte, and if they really don't break the file.

Also, there are fairly odd things even in the part fully valid UTF-16.  Is the file really supposed to contain things like `B風e-de-mer` on line 194, `C岡r` on line 326`, `d诡rtement` on line 453, or `Ŵolian` on line 1737 (penultimate line, and the last before the invalid sequence)?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/issues/1238#issuecomment-247955937
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.geany.org/pipermail/github-comments/attachments/20160919/1acfccb4/attachment.html>


More information about the Github-comments mailing list