On Mon, Sep 19, 2016 at 03:30:09AM -0700, Colomban Wendling wrote:<br>

> It's pretty messy but fair enough.  However, we probably won't do<br>

> that, because being able to have a fixed encoding in the data we load<br>

> means that we have to handle encoding conversion in a single place,<br>

> instead of everywhere something touches the data -- and there are a<br>

> lot of code that does that, it's and editor after all.<br>

<br>

> Also, as UTF-8 can represent virtually any textual data (anything<br>

> inside Unicode), it would only help with invalid input (like here) or<br>

> binary data (which probably would better be handled with a hex<br>

> filter).  So I'm afraid it won't happen.<br>

<br>

> If someone has a nice solution though, I'd love to be proven wrong.<br>

<br>

Well, thinking about it, if it was a wanted feature, I would do this as<br>

follows:<br>

<br>

- have the raw valid text as a UTF-8 (of course) "linear array"<br>

  (might be a window onto disk for large files, etc)<br>

<br>

- indexing layers above this, to quickly identify graphemes, word<br>

  boundaries, line boundaries and any other points of interest, such as:<br>

<br>

- "invalid bytes insertion points" along with the corresponding invalid<br>

  byte sequences<br>

  - this way, those parts of the program (most of them) that don't want<br>

    or need to handle invalid bytes, don't have to<br>

  - and you have an easy index to re-insert the invalid sequences on<br>

    saving, or some display/view onto the file that can represent<br>

    invalid bytes<br>

  - and you can offer easy options to the user such as "save without<br>

    invalid bytes" "encode invalid bytes according to some format" etc<br>

<br>

Should be easy, and should also be how the program is implemented.<br>

<br>

At least, that's how a superior programmer would implement it ;)<br>

<br>

See for reference:<br>

https://zenaan.github.io/zen/javadoc/zen/lang/string.html<br>

<br>

Regards, and thanks again,<br>


<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="https://github.com/geany/geany/issues/1238#issuecomment-247966199">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/ABDrJxIOpdajRpUQ4dZqXTVhxPryVcs-ks5qrm4pgaJpZM4KAGgV">mute the thread</a>.<img alt="" height="1" src="https://github.com/notifications/beacon/ABDrJ0tTIevVrYuAaD052K4XVTkPyR3Gks5qrm4pgaJpZM4KAGgV.gif" width="1" /></p>

<div itemscope itemtype="http://schema.org/EmailMessage">

<div itemprop="action" itemscope itemtype="http://schema.org/ViewAction">

  <link itemprop="url" href="https://github.com/geany/geany/issues/1238#issuecomment-247966199"></link>

  <meta itemprop="name" content="View Issue"></meta>

</div>

<meta itemprop="description" content="View this Issue on GitHub"></meta>

</div>


<script type="application/json" data-scope="inboxmarkup">{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/geany/geany","title":"geany/geany","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/geany/geany"}},"updates":{"snippets":[{"icon":"PERSON","message":"@zenaan in #1238: On Mon, Sep 19, 2016 at 03:30:09AM -0700, Colomban Wendling wrote:\n\u003e It's pretty messy but fair enough.  However, we probably won't do\n\u003e that, because being able to have a fixed encoding in the data we load\n\u003e means that we have to handle encoding conversion in a single place,\n\u003e instead of everywhere something touches the data -- and there are a\n\u003e lot of code that does that, it's and editor after all.\n\n\u003e Also, as UTF-8 can represent virtually any textual data (anything\n\u003e inside Unicode), it would only help with invalid input (like here) or\n\u003e binary data (which probably would better be handled with a hex\n\u003e filter).  So I'm afraid it won't happen.\n\n\u003e If someone has a nice solution though, I'd love to be proven wrong.\n\nWell, thinking about it, if it was a wanted feature, I would do this as\nfollows:\n\n- have the raw valid text as a UTF-8 (of course) \"linear array\"\n  (might be a window onto disk for large files, etc)\n\n- indexing layers above this, to quickly identify graphemes, word\n  boundaries, line boundaries and any other points of interest, such as:\n\n- \"invalid bytes insertion points\" along with the corresponding invalid\n  byte sequences\n  - this way, those parts of the program (most of them) that don't want\n    or need to handle invalid bytes, don't have to\n  - and you have an easy index to re-insert the invalid sequences on\n    saving, or some display/view onto the file that can represent\n    invalid bytes\n  - and you can offer easy options to the user such as \"save without\n    invalid bytes\" \"encode invalid bytes according to some format\" etc\n\nShould be easy, and should also be how the program is implemented.\n\nAt least, that's how a superior programmer would implement it ;)\n\nSee for reference:\nhttps://zenaan.github.io/zen/javadoc/zen/lang/string.html\n\nRegards, and thanks again,\n"}],"action":{"name":"View Issue","url":"https://github.com/geany/geany/issues/1238#issuecomment-247966199"}}}</script>