Windows filenames (NTFS) may contain unicode characters. Passing the filename to `geany.exe` as command line argument does not open the file.
At startup, it can get all arguments in UTF-16, and convert them to UTF-8. So filenames containing unicode characters can be passed to `geany.exe`. You can view, comment on, or merge this pull request online at:
https://github.com/geany/geany/pull/1258
-- Commit Summary --
* Provide utf-8 command line arguments on Windows
-- File Changes --
M src/main.c (36)
-- Patch Links --
https://github.com/geany/geany/pull/1258.patch https://github.com/geany/geany/pull/1258.diff
Does this work if the locale is a non-unicode code page?
Should work, regardless of locale settings.
I'm not sure if it's necessary but have you checked you don't have to handle argv[0] the same way ? (Eg geany is in an directory containing unicode characters)
I created the pull request mainly for the convenience of "open with" Geany on Windows
On the other hand it should be very rare for Geany's installation path to contain unicode character. And I'd like to let geany launch as fast as possible when I press Win+R and type `geany`. So I avoided the conversion logic when `argc == 0`.
I guess maintainers can make the final decision.
I think we should handle the argv[0] issue as well. But only if it is necessary at all, see below.
Can you provide a test case for this? I can open files on a NTFS drive with non-ASCII characters in their path in Geany either from the command line (cmd.exe) as well as using the "Open With" Explorer menu item.
So, there must be some more criteria which involve the behaviour.
This can be reproduced when filename character is not supported by system locale's encoding. E.g. When system locale is "English (United States)", double clicking `read我.txt` opens an empty tab in Geany with name `read?.txt`. The file is not really opened.
In the initial description you talk about "unicode". Unicode characters can be saved using different encodings, this can be UTF-8, UTF-16 and so on. We cannot know which encoding the filename of a file on disk was used.
Yes, we can assume UTF-16 or we can assume UTF-8, maybe even both by trying. But then the next user wants UTF-32 BE, then UTF-32 LE and whatever else.
The behaviour is probably very similar on non-Windows platforms. I think usually filenames should follow the system's locale. Mixed charsets are never a good idea.
Anyway, I don't know how I could create a file on Windows with a filename not in the system's locale and so cannot really test it. I just tested your changes with non-ASCII filenames in the system's locale and it still works.
Before this could get merged, two remarks: - could you try to handle argv[0] as well as mentioned above - it'd be nice if you could move the Windows specific code into a new function in src/win32.c, then src/main.c gets less cluttered. Usually we try to have the weird Windows-specific ~~hidden~~abstracted in src/win32.c :)
@cshu pushed 1 commit.
e9e7c11 most-logic-in-win32
I moved most code to win32.c, and included argv[0] in conversion.
Different file systems can use various encoding for filenames. (NTFS, FAT32, etc.) But `CommandLineToArgvW` always provide filenames in UTF-16. So in the case of windows api functions, underlying file system usually should not be a problem.
It's easy to get some random CJK characters from typing code unit. (Ctrl+Shift+I on firefox/chrome, in the console, type '\u6666')
eht16 commented on this pull request.
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv)
Maybe `win32_convert_argv_to_utf8()` would be a better name.
+{
+ int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i) + { + i--; + utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL); + } + *pargc = num_arg; + *pargv = utf8argv; + LocalFree(szarglist); +} + +void win32_free_argv_made_in_utf8(gint argc, gchar **argv)
The code is quite generic and not Windows-specific, so it could go into src/utils.c as `utils_free_argv()`, similar to `utils_free_pointers()`.
@@ -1231,6 +1230,9 @@ gint main_lib(gint argc, gchar **argv)
#endif
gtk_main(); +#ifdef G_OS_WIN32 + win32_free_argv_made_in_utf8(argc, argv);
Maybe the call including the G_OS_WIN32 guard could go into a seperate function in src/libmain.c as it should be also in https://github.com/geany/geany/blob/master/src/libmain.c#L1090. Strictly there are more places where it should be called, e.g. every place where we call `exit()` but this would require some refactoring before which is out of scope of this PR (and technically it is not very necessary as the memory in cleaned up anyway on `exit()`).
eht16 requested changes on this pull request.
Thanks for the hints for reproducing. After being able to reproduce the issue and testing your changes, it worked fine.
Thanks.
b4n requested changes on this pull request.
The idea seems good, and without having actually tested the PR I can confirm from other tests I did yetserday that the combination of `GetCommandLineW()` and `CommandLineToArgvW()` seems to work just fine.
+{
+ int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i) + { + i--; + utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL); + } + *pargc = num_arg; + *pargv = utf8argv; + LocalFree(szarglist); +} + +void win32_free_argv_made_in_utf8(gint argc, gchar **argv)
should actually use `g_strfreev()` as the `argv[]` array ends with a `NULL` element.
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv) +{ + int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i)
I'd prefer a `for()` loop like ```C for (int i = 0; i < num_arg; i++) utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL); ```
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv) +{ + int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i) + { + i--; + utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL);
Checking for error might be needed, as [Windows seems to allows invalid UTF-16 in filenames](https://en.wikipedia.org/wiki/UTF-8#WTF-8), and I'm afraid `g_utf16_to_utf8()` doesn't allow this. Another solution might be a custom "broken UTF-16 to WTF-8" converter, like I played with in https://github.com/b4n/wtf8tools. It seems most places in GLib don't have problem with WTF8, as it's structurally fine and "just" encodes weird stuff.
Anyway, something should be done because only the last element of `argv[]` should ever be `NULL`, and I wouldn't be surprised it'd crash somewhere if it wasn't respected.
OK, this is a low-risk issue I guess (invalid UTF-16 filenames), but still a theoretically possible one.
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv)
Agreed
@@ -1231,6 +1230,9 @@ gint main_lib(gint argc, gchar **argv)
#endif
gtk_main(); +#ifdef G_OS_WIN32 + win32_free_argv_made_in_utf8(argc, argv);
I'm not sure it's worth having a function which is a no-op on non-Windows just for not duplicating one guarding of a free; if we really don't want duplication here we find a way to only have one exit path :)
@@ -1231,6 +1230,9 @@ gint main_lib(gint argc, gchar **argv)
#endif
gtk_main(); +#ifdef G_OS_WIN32 + win32_free_argv_made_in_utf8(argc, argv);
Alternatively the conversion and freeing could be done in `main()` itself (from *main.c*), so `main_lib()` simply receives a nice argument list directly:
```C int main(int argc, char **argv) { int ret; #ifdef G_OS_WIN32 /* on Windows, get the Unicode argulment list and convert it to UTF-8 into argv[] */ LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &argc);
argv = g_new(char *, argc + 1); for (int i = 0; i < argc; i++) argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL); argv[argc] = NULL; LocalFree(szarglist); #endif
ret = main_lib(argc, argv);
#ifdef G_OS_WIN32 g_strfreev(argv); #endif
return ret; } ```
b4n commented on this pull request.
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv)
Actually recent GLib (2.40+) have `g_win32_get_command_line()` which basically does just that.
(and doesn't bother with checking for error in the conversion, so it might be "alright" -- or maybe they find it OK because they don't return arg count and expect callers to use `g_strv_length()` which that would simply truncate the list, not sure)
Hum, the [docs on `g_win32_get_command_line()`](https://developer.gnome.org/glib/unstable/glib-Windows-Compatibility-Functio...) mention that [`g_option_context_parse()`](https://developer.gnome.org/glib/unstable/glib-Commandline-option-parser.htm...) (that we use) expects locale-encoded strings, not UTF-8 ones. GLib 2.40+ has [`g_option_context_parse_strv()`](https://developer.gnome.org/glib/unstable/glib-Commandline-option-parser.htm...) which expects *GLib filename encoding* (that is, UTF-8 on Windows), but it's fairly recent.
(Or we could please @codebrainz and rewrite using GApplication which does this for us… but that would be a lot of changes :))
(Or we could please @codebrainz and rewrite using GApplication which does this for us… but that would be a lot of changes :))
or maybe qapplication ;-P
codebrainz commented on this pull request.
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv) +{ + int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i) + { + i--; + utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL);
IMO, it would be cleaner if it didn't hardcode UTF-16 assumptions. While it's highly unlikely to change any time soon, Win32 does provide functions to convert to/from various encodings to its native "wide" encoding without hard-coding what that is.
b4n commented on this pull request.
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv) +{ + int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i) + { + i--; + utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL);
@codebrainz myeah, theoretically it might, but in practice it won't change (nobody can afford that honestly), and that Windows API is a lot more annoying to use for us. But if it does handle the invalid Unicode filenames and `g_utf16_to_utf8()` doesn't, it might be worth it.
codebrainz commented on this pull request.
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv) +{ + int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i) + { + i--; + utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL);
that Windows API is a lot more annoying to use for us
It's not _a lot_ more annoying, just `WideCharToMultiByte` with `CPUTF8`. Could even write a little function like `win32_wide_to_utf8` that hides the pre-flight pass and allocates the memory with GLib's allocator. It's maybe 5-10 lines of code to write that function.
b4n commented on this pull request.
@@ -1032,4 +1032,31 @@ gchar *win32_get_user_config_dir(void)
return g_build_filename(g_get_user_config_dir(), "geany", NULL); }
+ +void win32_make_argc_and_argv_in_utf8(gint *pargc, gchar ***pargv) +{ + int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i) + { + i--; + utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL);
Well it's totally doable, but it involves calling `WideCharToMultiByte()` twice (or over-alloc `len*4+1` bytes), allocating manually and the like. Indeed nothing complicated, but at least 4 lines or so worth of WINAPI code, where we basically have access to a 1-line thing. So if it gives us something, yeah sure, otherwise I'm not really sure it's worth it.
Hell, let's see how it'd look: ```C static gchar *wcstr_to_utf8(wchar_t *wcstr) { int len = WideCharToMultiByte(CP_UTF8, 0, wcstr, -1, NULL, 0, NULL, NULL); gchar *utf8str = g_malloc(len + 1); WideCharToMultiByte(CP_UTF8, 0, wcstr, -1, utf8str, len + 1, NULL, NULL); utf8str[len] = 0; // FIXME: is that useful? return utf8str; } ``` And according to the [docs](https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%...) it probably doesn't really help:
**WC_ERR_INVALID_CHARS**: **Windows Vista and later:** Fail if an invalid input character is encountered. If this flag is not set, the function silently drops illegal code points. A call to GetLastError returns ERROR_NO_UNICODE_TRANSLATION. Note that this flag only applies when CodePage is specified as CP_UTF8 or 54936 (for Windows Vista and later). It cannot be used with other code page values.
So my understanding would be that in case of invalid UTF-16 (as it is possible in filenames), it would result in uninteresting UTF-8, while we'd actually want "WTF-8".
eht16 commented on this pull request.
@@ -1231,6 +1230,9 @@ gint main_lib(gint argc, gchar **argv)
#endif
gtk_main(); +#ifdef G_OS_WIN32 + win32_free_argv_made_in_utf8(argc, argv);
Yeah, also nice. But I'd still vote to put the Windows specific code in the first G_OS_WIN32 block into src/win32.c.
eht16 commented on this pull request.
+{
+ int num_arg; + LPWSTR *szarglist = CommandLineToArgvW(GetCommandLineW(), &num_arg); + char **utf8argv = g_new0(char *, num_arg + 1); + int i = num_arg; + while(i) + { + i--; + utf8argv[i] = g_utf16_to_utf8((gunichar2 *)szarglist[i], -1, NULL, NULL, NULL); + } + *pargc = num_arg; + *pargv = utf8argv; + LocalFree(szarglist); +} + +void win32_free_argv_made_in_utf8(gint argc, gchar **argv)
Absolutely, I missed the point that the array is NULL-terminated. Sorry.
@cshu still interested in this PR? I think it would be cool and the sooner we get it merged (regarding the next release) the better it is.
@eht16 No, I leave it to you. This PR is a small change.
I wonder, what makes this Windows-specific? On-disk file names can be arbitrary encoding on all platforms, right?
github-comments@lists.geany.org