Hi,
Le 26/04/2012 17:02, Nick Treleaven a écrit :
On 24/04/2012 22:31, Colomban Wendling wrote:
Le 17/04/2012 18:20, Nick Treleaven a écrit : Sorry for the long delays -- and also small activity -- recently. I have/had a lot of non-Geany stuff to do and stuff, the whole story, you know.
No problem.
I finally committed it and pushed it so you can see it, comment, blame, flame& more: see https://github.com/b4n/geany/tree/wip%2Fctagsmanager
A few points, as they comes to my mind:
- it support asynchronous parsing (though not concurrent parsing);
What's the difference? Also, what does it buy us?
What I meant when saying it's asynchronous but not concurrent is that it supports launching the parsers in a separate thread, but it cannot launch several parsers at once. This is because CTags parsers aren't thread-safe (have a lot of global states and no locks).
What asynchronous parsing gives us is quite simple: no blocking. This means that a slow parser (like e.g. the HTML one on Windows before you changed the regex library) wouldn't lock Geany. Same for project plugins that want to parse thousands of files: this would still take a long time, but Geany would be still usable in the meantime.
- it is under the new ctagsmanager/ directory;
- it uses the same tag parsers tagmanager used, in ctagsmanager/ctags;
- all types have different names than the tagmanager ones, though currently the API is almost an 1:1 mapping -- and that's maybe a huge mistake?;
I don't understand why tagmanager has to be replaced, why not just replace the parts you want to improve? Rewriting it is likely to lead to a new set of bugs and be hard to review and merge changes from master.
I think Lex and Matthew probably summarize it quite correctly: it's not that TM is bad or has to be replaced; it is that (I'm) unable to understand it enough to fix anything in it, like scope completion. Maybe it's only me, but I tried hard :)
And the only reason why maybe TagManager would need replacement/large changes is asynchronous parsing. I'm not sure how hard it would be to get it with TM -- but again, maybe it's only that I don't understand it enough.
- there is 2 backends as of today:
- a "simple" one that is simple and that doesn't waste (too much) memory, but that searches in O(n);
Linear searching seems like a bad idea when we currently have O(log n). Removing x tags is O(n * x) by the look of it.
I agree it's not interesting regarding performances, and this backend isn't meant for it. I needed an implementation for early testing, so I wrote it simple. And it showed useful for testing later on too, since because of it's simplicity it isn't bugged -- just slow :)
BTW, how does TagManager do fast searches? I always though it did a sorting with new attributes each time they changed, so in such situations it's even worse than O(n), isn't it?
- a "multi-cache" one that, as its name suggests, maintains multiple caches (sorted tags arrays). it uses a little more memory and is slower on insertion since it maintains several sorted lists, but a search on an already existing cache is very fast.
Won't this be slow for adding many tags at once? How is this design better than tagmanager?
Well, adding a tag would require a bsearch() on each cache, yes. However, adding tags is mostly done in a separate thread (if async parsing it used), so it can be less of a problem.
And a search is simply a bsearch() (O(n log n), right?) given the cache already exists. If the cache doesn't exist, it has to be created so yeah on the first search it's slow.
Given my global tags merge change (see below), I think tagmanager needs some changes to avoid sorting the workspace tags array each time tags are updated, but this is certainly doable.
In practice I haven't yet tested anything big enough to see any difference in performances between these two backends, and that's something that probably isn't worth bothering about for now.
- this "backend" abstraction might be really overkill, and maybe we could do better without it?;
I don't see why having two is better. The memory overhead for a pointer array is not much vs. the size of the tag structures. Fast searching is important.
Multiple backends isn't really useful probably. As said above, the first was mostly a testing one, and I wanted to be able to keep both during development for better testing. At one point I also though maybe there could be some backends optimized for particular situations, like fast search, fast insertion, low memory consumption, etc., but I don't see much use for this anymore.
- tags (and most types) are reference-counted (so they aren't necessarily duplicated, e.g. in the multicache backend);
I don't really understand src/symbols.c since the real-time parsing change, so don't really understand why this is needed.
It's not strictly needed, but it makes some memory management easier, and fits well with GTK memory management. And this last point helps a lot to maintain the GtkTreeStore on src/symbols.c, now tags are updated and not removed. But that's not new, I already added this in TM.
BTW, what don't you understand in symbols.c? Maybe I could improve the docs.
- tag matching/finding is done using ctm_data_backend_find() (or a convenience wrapper), which takes 2 functions for performing the search:
- a sort function, used to, heh, sort the tags to search and/or the resulting list (the "simple" backend should use it to sort the result; and the "multi-cache" backend uses it to sort the caches)
It doesn't sound a good idea to have to sort the haystack on each lookup. Lookup should be very fast, like tagmanager. Adding/removing tags is always less important because real-time updates can be disabled anyway.
That's why the multicache backend (the one supposed to be somewhat fast) only creates new cache on lookup if there isn't compatible cache yet, which should happen roughly only once, on the first lookup.
However, I still don't understand how TM does address the issue, and I though it actually re-sorted the array on lookup if the attributes changed. And IIUC too, the attributes may change often if performing search A and the B, then A again, the A, B, A, B and so forth.
- a match function, use to check whether a tag should be included in the results. Like the sort function it returns a strcmp()-like value, with the only difference that it probably returns 0 for more tags.
It is somewhat similar to what tagmanager did, but it's more flexible -- and maybe complex, though many sort/match functions would already be provided.
- no pruning is done yet, so there is duplicate in the results;
- there is an (almost) working scope completeion implementation;
- ... I don't see anything to add, so I'll stop here :)
All this isn't of course written in stone: if we already redo a lot of things, let's get something nice -- though IMO we'll always be better than with tagmanager, since each time I wanted to touch it it took me hours, and sometimes I even haven't been able to fix the problem (thinking of e.g. scope completion...).
I don't really see what the problem understanding it is. I thought scope completion was just tm_workspace_find_scoped and related functions, not some tagmanager-wide problem.
It's probably not some TM-wide problem, but I tried to understand what was the problem, but never got to understand that thing. And certainly not how I would add some more precision in the scope search, like same file, close to the position, etc.
Also, later in the thread he says that performance problems with resorting global and workspace tags cannot be fixed with the design of tagmanager. I've been working yesterday on improving this significantly by merging the new tags each time instead of resorting *all* the tags. I hope to commit this in the next few days.
Cool! I haven't done any profiling on either tagmanager or may new attempt, so I can't tell what's actually slow, but any improvement is good to have :)
I finished the global tags merging change: https://github.com/geany/geany/commit/712cdd6aa0005f65bd77b15c34cd2906786462...
IMO this should make loading global tags fast enough, even for heavy use, as the whole array is now only traversed once.
That's a really good idea yeah, good job :)
For avoiding resorting of workspace tags when only reparsing one source object, we can remove the source object's old tags & merge the new tags. This is all O(n) for the workspace array. I haven't looked into implementing this yet because now it's clear you're working on a competing change.
I'm not sure how it is for now (actually i didn't check just now), but yes it looks sensible.
The other changes I've made were in response to a user's problem with generating C tag files, and some changes to the CTags files.
Good. However not stripping G_BEGIN_DECLS confuses the CTags parser and makes particularly makes it think the first identifier after it has "G_BEGIN_DECLS" type; which then may cause problems for e.g. scope completion.
Regards, Colomban