[Github-comments] [geany/geany] fails to open Microsoft UTF-16LE file (MSO Word CUSTOM.DIC dictionary file) (#1238)

Mon Sep 19 10:49:34 UTC 2016

On Mon, Sep 19, 2016 at 03:11:22AM -0700, Colomban Wendling wrote:
> All I can imagine is that the file is broken, and the other editors
> you try either truncate it, or are more forgiving and leaving the
> invalid bytes as-is.

To be fair, the only program I've used to edit this file (from memory)
is MS Word. But it is interesting as you point out. Perhaps there was an
edit some years ago in an editor which did not have encoding awareness.

> As @elextr explained, we can't really do that
> because we need UTF-8 encoding in the buffer, so need be able to
> convert to and from it.  With invalid sequences, it wouldn't be
> possible to restore it.

Very sensible.

Perhaps an editing mode that somehow shows all invalid sequences,
in-situ, and making it clear to the user that the file may already be
corrupted or something, and therefore allowing the user to try to fix
the file. As it is, I'll go to either MS Word on Windows, or Vim on
Linux.

> I'm actually fairly curious as to what the editors you see it working
> with actually do with those byte, and if they really don't break the
> file.

Only opened it. Checking now...

OK:

- Notepad++ shows the invalid sequences as black rectangles with
  numbers, but stops at one line after "Wolian" and does not allow me to
  copy that last line.

- MSO Word has a dozen or so extra entries after that line "Wolian", and
  seems to somehow display most entries correctly, except for the very
  last entry.

- Akelpad also stops at the infamous "Ŵolian" line

- Microsoft Wordpad also stops at "wolian" line

So it seems that either:
- the encoding of this file is something other than the
  proclaimed UTF-16LE

- or the file is simply corrupted and Word does a better job
  of displaying the errors

- or what is proclaimed as UTF-16LE is a bastardized version of
  UTF-16LE,
  - or some other Word-specific encoding pretending to be UTF-16LE

I don't really know what to do now, except to purge all the
"interesting" entries, and go from there. And that will be adequate...

> Also, there are fairly odd things even in the part fully valid UTF-16.
> Is the file really supposed to contain things like `B風e-de-mer` on
> line 194, `C岡r` on line 326`, `d诡rtement` on line 453, or `Ŵolian`
> on line 1737 (penultimate line, and the last before the invalid
> sequence)?

I'm going to assume corruption.

Thank you for your patience and interest.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/issues/1238#issuecomment-247962453
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.geany.org/pipermail/github-comments/attachments/20160919/5d330548/attachment.html>