[Github-comments] [geany/geany] fails to open Microsoft UTF-16LE file (MSO Word CUSTOM.DIC dictionary file) (#1238)

Zenaan Harkness notifications at xxxxx
Mon Sep 19 11:12:41 UTC 2016


On Mon, Sep 19, 2016 at 03:30:09AM -0700, Colomban Wendling wrote:
> It's pretty messy but fair enough.  However, we probably won't do
> that, because being able to have a fixed encoding in the data we load
> means that we have to handle encoding conversion in a single place,
> instead of everywhere something touches the data -- and there are a
> lot of code that does that, it's and editor after all.

> Also, as UTF-8 can represent virtually any textual data (anything
> inside Unicode), it would only help with invalid input (like here) or
> binary data (which probably would better be handled with a hex
> filter).  So I'm afraid it won't happen.

> If someone has a nice solution though, I'd love to be proven wrong.

Well, thinking about it, if it was a wanted feature, I would do this as
follows:

- have the raw valid text as a UTF-8 (of course) "linear array"
  (might be a window onto disk for large files, etc)

- indexing layers above this, to quickly identify graphemes, word
  boundaries, line boundaries and any other points of interest, such as:

- "invalid bytes insertion points" along with the corresponding invalid
  byte sequences
  - this way, those parts of the program (most of them) that don't want
    or need to handle invalid bytes, don't have to
  - and you have an easy index to re-insert the invalid sequences on
    saving, or some display/view onto the file that can represent
    invalid bytes
  - and you can offer easy options to the user such as "save without
    invalid bytes" "encode invalid bytes according to some format" etc

Should be easy, and should also be how the program is implemented.

At least, that's how a superior programmer would implement it ;)

See for reference:
https://zenaan.github.io/zen/javadoc/zen/lang/string.html

Regards, and thanks again,


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/issues/1238#issuecomment-247966199
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.geany.org/pipermail/github-comments/attachments/20160919/230219e5/attachment.html>


More information about the Github-comments mailing list