Without this patch, openSUSE's glfw package would always differ in `/usr/share/geany/tags/glfw.c.tags` because inode numbers differ between builds (that happen in disposable VMs)
This was previously discussed in https://bugzilla.opensuse.org/show_bug.cgi?id=1049382 You can view, comment on, or merge this pull request online at:
https://github.com/geany/geany/pull/1989
-- Commit Summary --
* Make geany -g tags output reproducible
-- File Changes --
M src/tagmanager/tm_workspace.c (19)
-- Patch Links --
https://github.com/geany/geany/pull/1989.patch https://github.com/geany/geany/pull/1989.diff
codebrainz commented on this pull request.
@@ -477,7 +460,7 @@ static GList *lookup_includes(const gchar **includes, gint includes_count)
size_t idx_glob; #endif
- table = g_hash_table_new_full(tm_file_inode_hash, g_direct_equal, NULL, g_free); + table = g_hash_table_new_full(g_str_hash, g_direct_equal, NULL, g_free);
Should the 2nd parameter be `g_str_equal`, or is it meant to compare the address/pointer of the keys?
bmwiedemann commented on this pull request.
@@ -477,7 +460,7 @@ static GList *lookup_includes(const gchar **includes, gint includes_count)
size_t idx_glob; #endif
- table = g_hash_table_new_full(tm_file_inode_hash, g_direct_equal, NULL, g_free); + table = g_hash_table_new_full(g_str_hash, g_direct_equal, NULL, g_free);
It probably does not make much of a difference, so I updated it to `g_str_equal`
elextr commented on this pull request.
@@ -477,7 +460,7 @@ static GList *lookup_includes(const gchar **includes, gint includes_count)
size_t idx_glob; #endif
- table = g_hash_table_new_full(tm_file_inode_hash, g_direct_equal, NULL, g_free); + table = g_hash_table_new_full(g_str_hash, g_direct_equal, NULL, g_free);
It does in fact matter, the key is a newly copied string, and you need to compare the strings, not the pointers to the strings so `g_str_equal` is right.
You don't say what the actual problem is, but I presume your problem is that the order of tags changes because the order of files in the list `lookup_includes()` returns changes as the inode changes. If you want reproducible builds, instead of relying on application internals and globbing orders you should list the files explicitly instead of using globs.
This change will also fail to de-dup paths that are linked to the same file, but hopefully that will be rare.
If we can live with that, performance wise this should be better (stated without benchmarking of course :) because it no longer stats the filesystem to make a hash key.
@elextr we did list files explicitly in https://build.opensuse.org/package/view_file/openSUSE:Factory/glfw/glfw.spec line 88 (still missing a `| sort` in that version), but even then, the hashing influenced entries in the output.
the hashing influenced entries in the output.
Geany either:
1. if `-P` creates a temporary file full of `#include`s one for each file in the list and uses the C preprocessor to combine them, or
2. if no `-P` just copys the content of all the files in the list together into a temporary file and parses that.
Both of these do it in the hash table list order.
So, maybe to ensure reproducible output you can pre-combine the files you want in the order you want by running the C preprocessor first and then pass the result of that to Geany.
If you can get reproducible output without depending on implementation details that might change to support other languages besides C, that would be good.
@elextr I do not understand 'This change will also fail to de-dup paths that are linked to the same file' I thought, the old code did not do that either, but just hashed it into the same bucket.
Diffs looked like http://rb.zq1.de/compare.factory-20180730/glfw-compare.out ```diff --- old//usr/share/geany/tags/glfw.c.tags 2017-02-12 12:00:00.000000000 +0000 +++ new//usr/share/geany/tags/glfw.c.tags 2017-02-12 12:00:00.000000000 +0000 @@ -525,23 +525,23 @@ DN_MULTISHOTÌ65536Ö0 DN_RENAMEÌ65536Ö0 DTTOIFÌ131072Í(dirtype)Ö0 -DT_BLKÌ4Îanon_enum_26Ö0 +DT_BLKÌ4Îanon_enum_29Ö0 DT_BLKÌ65536Ö0 -DT_CHRÌ4Îanon_enum_26Ö0 +DT_CHRÌ4Îanon_enum_29Ö0 DT_CHRÌ65536Ö0 [...] -KHR_xlib_surfaceÌ64Î_GLFWlibrary::anon_struct_20Ö0ÏGLFWbool +KHR_xlib_surfaceÌ64Î_GLFWlibrary::anon_struct_4Ö0ÏGLFWbool ```
So is there any reason against merging this change?
@elextr I do not understand 'This change will also fail to de-dup paths that are linked to the same file'
I thought, the old code did not do that either, but just hashed it into the same bucket.
Yeah, it looks like that, which seems absurd. My guess would be that the goal was to do what @elextr expects, that is using the inode itself as a key, which indeed would have the nice property of avoid re-computing the same file even under a different path, but as is it actually leads to *never* match any entry (as `g_direct_equal()` on a different string pointer will always deem those different). So your changes are better than the previous behavior, as at least identical paths will match.
However, this function seems a bit absurd. I don't see why it doesn't simply builds the list directly, because traversing a GHashTable is not very fast, and IIUC it wouldn't change anything. So we could probably change it to just do that, which would be simpler and fix everyone's concerns.
Also, relying on the order of traversal of a GHashTable seems fragile, I don't think it is guaranteed to be stable across GLib versions, architectures or whatnot.
@b4n, no the current code works just fine, the hash key is the [inode](https://github.com/geany/geany/blob/97547edb52e34b8f78f44a13548a0fb55baf1a44...), which is an `int` stored in the pointer, so `g_direct_equal()` is correct for the current system. But it has to be changed to `g_str_equal()` if the key becomes the filename string.
Originally a list was [used](https://github.com/geany/geany/blob/d6c16742b769947914b63a3a24e0b1440593d43d...) but then it was [de-duped](https://github.com/geany/geany/blob/d6c16742b769947914b63a3a24e0b1440593d43d...), so likely thats why it got changed to a hash but that change is copied from [Anjuta](https://github.com/geany/geany/blob/d3dfe1dd1d7aa96be821ba175624843c65fe970b...), so who knows.
Not sure why duplicates are a problem, TM should handle multiple definitions of the same tagname? @b4n TM-spurtese needed.
@bmwiedemann did you try running `cpp` yourself? Or you could just `cat` the files together if you don't want recursive inclusion, thats all Geany is doing internally.
@b4n, no the current code works just fine, the hash key is the [inode](https://github.com/geany/geany/blob/97547edb52e34b8f78f44a13548a0fb55baf1a44...)
Not it's not, it's *hashed* as an inode, but the [key is](https://github.com/geany/geany/blob/97547edb52e34b8f78f44a13548a0fb55baf1a44...) [a filename](https://github.com/geany/geany/blob/97547edb52e34b8f78f44a13548a0fb55baf1a44...), so the equal function will compare the string pointers.
Originally a list was [used](https://github.com/geany/geany/blob/d6c16742b769947914b63a3a24e0b1440593d43d...) but then it was [de-duped](https://github.com/geany/geany/blob/d6c16742b769947914b63a3a24e0b1440593d43d...)
Well, if we used a hash table as a mean of deduping and a list as a mean of well, building the list, it would not be a problem.
Not it's not, it's hashed as an inode, but the key is a filename, so the equal function will compare the string pointers.
ahhh, ok. So a) it doesn't work now because its wrong, and b) comparing strings won't actually work either if both paths resolve to the same inode, the hash table will see the hash collide but the strings being different and add it again.
So using inode as the hash is just a slow way of making a hash of a file path. In fact Anjuta seems to have made a right hash of the whole thing [pun intended]. Maybe thats why it switched to sqlite for its tags later.
Well, if we used a hash table as a mean of deduping and a list as a mean of well, building the list, it would not be a problem.
Yeah, since 2.40 `g_hash_table_insert()` returns if it inserted or not, so the filename could be added to the list on the fly. But Glib 2.40 didn't exist back when this was originally written.
So using inode as the hash is just a slow way of making a hash of a file path.
Well, the inode gets a goodish hash, but the equal check makes it worse than a string hash and comparison on the file path yeah. If really the inode thing was useful, it should be used as the key directly rather than a bad convoluted hash that's guaranteed to have a worse `stat()` call count anyway.
Yeah, since 2.40 `g_hash_table_insert()` returns if it inserted or not, so the filename could be added to the list on the fly. But Glib 2.40 didn't exist back when this was originally written.
The code [already checks if the hash table contains the element before inserting it](https://github.com/geany/geany/blob/97547edb52e34b8f78f44a13548a0fb55baf1a44...), so it should be OK no matter which version is used.
The code already checks if the hash table contains the element before inserting it, so it should be OK no matter which version is used.
Except that it will always fail to find it due to the bug, so it gets added anyway.
If really the inode thing was useful, it should be used as the key directly rather than a bad convoluted hash that's guaranteed to have a worse stat() call count anyway.
Well, as I said above, everything gets added because a hash collision won't stop different paths being added, and due to the bug all paths are seen as different at the moment anyway, so duplicates so far havn't caused any "trouble" (the word from the original de-dup comment). @b4n do you know what "trouble" they might cause?
So clearly the inode stuff is not useful, and we should simply use `g_str_hash()` or totally forget de-duping since so far the fact that its failing hasn't caused problems (except perhaps some tags files are silently bigger than they need to be, but nobody has noticed and complained).
Isn't the inode only unique within the same file system? We can't assume all files are on the same fs.
@kugel- true.
But it doesn't matter, its a hash not the key, they can collide, it just costs time if they do. Nothing should break. Also what do you get for "inode" for files served via Samba?
But the fact that the key comparisons are wrong in the current code means it just creates a list of all the files using a g_hash in pseudo random order, then copies that list to a g_list, no de-duping, nothing, nada, useless waste of time.
So unless somebody can identify the "trouble" having multiple copies of the same file will cause (apart from some space and time) then I think we should just expand the globs straight into the g_list and thats it.
I agree, keep it simple, especially if it allows this issue to be resolved.
Also waiting to see if the OP had success `cat`ing the files themselves, or running `cpp` themselves.
Even if we go the `g_list` route, reproducible output will still not be guaranteed, just an implementation side effect. That may be acceptable of course.
yes, cat'ing files on the caller side also helps:
```diff -geany -c geany_config -g glfw.c.tags $(find src ( ! -name CMakeFiles ) -type f ( -iname "*.c" -o -iname "*.h" ) ( ! -iname "win32*" ) ( ! -iname "cocoa*" ) | sort -) include/GLFW/glfw3.h +cat $(find src ( ! -name CMakeFiles ) -type f ( -iname "*.c" -o -iname "*.h" ) ( ! -iname "win32*" ) ( ! -iname "cocoa*" ) | sort) include/GLFW/glfw3.h > tags.in.h +geany -c geany_config -g glfw.c.tags tags.in.h ```
but having tools themselves more deterministic is still a plus, because it avoids having to patch all callers.
but having tools themselves more deterministic is still a plus, because it avoids having to patch all callers.
Sure, but its a workaround until Geany is changed.
Also in many cases the callers __should__ care which order files are included, for context sensitive languages like C/C++ knowing that an identifier is a type name is important to correct parsing. So you want to control order so that the declaration of the type name is before the use. Essentially you want things to be seen in the order that the compiler will see them.
So the files need to be explicitly specified in the correct order.
That means Geany should keep the order the same as the command line, except that where a command line item is a glob, it is expanded to all matching items.
This behaviour is deterministic so meets both requirements.
See https://github.com/geany/geany/pull/1991
Closed #1989 via fb39f67f90376f3334dd78a80ef984cc5373bae0.
github-comments@lists.geany.org