<pre><code>$ iconv -f UTF-16LE -t UTF-8 < CUSTOM-utf16le-2016.dic > CUSTOM-utf16le-2016.dic_utf8

iconv: illegal input sequence at position 34076

</code></pre>


<p>Apparently that file has bytes <code>\xca \xde</code> near the end, and that doesn't seem to be a sequence accepted by iconv (so I'd guess it really is invalid).<br>

And trying and reading how UTF-16 works, it indeed seems invalid: there is the sequence <code>0xD7B8 0xDECA</code> in the input: <code>0xD7B8</code> is <a href="http://www.charbase.com/d7b8-unicode-hangul-jungseong-yu-o">HANGUL JUNGSEONG YU-O</a>, which is encoded on a single word (below <code>U+FFFD</code>, and not in the range <code>U+D800 - U+DFFF</code>). Next word is <code>0xDECA</code>, which, as being in the range <code>0xDC00 - 0xDFFF</code>, should be the second word of a two-word pair.  It is not (the previous word not being in the <code>0xD800 - 0xDBFF</code> range), so it is invalid.</p>


<p>I tested other editors, like GEdit and <code>vim</code>, and they both exhibit the same issue failing to properly open the file.  GEdit opens it almost correctly, but only up to the invalid sequence, warning that the file might be truncated and that saving it could result in data loss.  Vim shows plain garbage.</p>


<p>All I can imagine is that the file is broken, and the other editors you try either truncate it, or are more forgiving and leaving the invalid bytes as-is.  As <a href="https://github.com/elextr" class="user-mention">@elextr</a> explained, we can't really do that because we need UTF-8 encoding in the buffer, so need be able to convert to and from it.  With invalid sequences, it wouldn't be possible to restore it.</p>


<p>I'm actually fairly curious as to what the editors you see it working with actually do with those byte, and if they really don't break the file.</p>


<p>Also, there are fairly odd things even in the part fully valid UTF-16.  Is the file really supposed to contain things like <code>B風e-de-mer</code> on line 194, <code>C岡r</code> on line 326<code>,</code>d诡rtement<code>on line 453, or</code>Ŵolian` on line 1737 (penultimate line, and the last before the invalid sequence)?</p>


<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="https://github.com/geany/geany/issues/1238#issuecomment-247955937">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/ABDrJyr_8DzhHyEAPujsPs2udWv8uPEsks5qrl_JgaJpZM4KAGgV">mute the thread</a>.<img alt="" height="1" src="https://github.com/notifications/beacon/ABDrJ1JtsCWe-WCmrovuKps7QpUmMWq2ks5qrl_JgaJpZM4KAGgV.gif" width="1" /></p>

<div itemscope itemtype="http://schema.org/EmailMessage">

<div itemprop="action" itemscope itemtype="http://schema.org/ViewAction">

  <link itemprop="url" href="https://github.com/geany/geany/issues/1238#issuecomment-247955937"></link>

  <meta itemprop="name" content="View Issue"></meta>

</div>

<meta itemprop="description" content="View this Issue on GitHub"></meta>

</div>


<script type="application/json" data-scope="inboxmarkup">{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/geany/geany","title":"geany/geany","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/geany/geany"}},"updates":{"snippets":[{"icon":"PERSON","message":"@b4n in #1238: ```\r\n$ iconv -f UTF-16LE -t UTF-8 \u003c CUSTOM-utf16le-2016.dic \u003e CUSTOM-utf16le-2016.dic_utf8\r\niconv: illegal input sequence at position 34076\r\n```\r\n\r\nApparently that file has bytes `\\xca \\xde` near the end, and that doesn't seem to be a sequence accepted by iconv (so I'd guess it really is invalid).\r\nAnd trying and reading how UTF-16 works, it indeed seems invalid: there is the sequence `0xD7B8 0xDECA` in the input: `0xD7B8` is [HANGUL JUNGSEONG YU-O](http://www.charbase.com/d7b8-unicode-hangul-jungseong-yu-o), which is encoded on a single word (below `U+FFFD`, and not in the range `U+D800 - U+DFFF`). Next word is `0xDECA`, which, as being in the range `0xDC00 - 0xDFFF`, should be the second word of a two-word pair.  It is not (the previous word not being in the `0xD800 - 0xDBFF` range), so it is invalid.\r\n\r\nI tested other editors, like GEdit and `vim`, and they both exhibit the same issue failing to properly open the file.  GEdit opens it almost correctly, but only up to the invalid sequence, warning that the file might be truncated and that saving it could result in data loss.  Vim shows plain garbage.\r\n\r\nAll I can imagine is that the file is broken, and the other editors you try either truncate it, or are more forgiving and leaving the invalid bytes as-is.  As @elextr explained, we can't really do that because we need UTF-8 encoding in the buffer, so need be able to convert to and from it.  With invalid sequences, it wouldn't be possible to restore it.\r\n\r\nI'm actually fairly curious as to what the editors you see it working with actually do with those byte, and if they really don't break the file.\r\n\r\nAlso, there are fairly odd things even in the part fully valid UTF-16.  Is the file really supposed to contain things like `B風e-de-mer` on line 194, `C岡r` on line 326`, `d诡rtement` on line 453, or `Ŵolian` on line 1737 (penultimate line, and the last before the invalid sequence)?"}],"action":{"name":"View Issue","url":"https://github.com/geany/geany/issues/1238#issuecomment-247955937"}}}</script>