Regarding the regex ctags version of the parser vs Geany's version of the parser: we could use the regex ctags version, I was just thinking that since we have the hand-written version in Geany already, it might be a base for a hand-written parser that could eventually be submitted upstream so I kept the `geany_` parser. Hand-written parsers tend to offer better flexibility in parsing and are much faster than regex parsers. But before such a parser could be submitted upstream, it would have to offer all the functionality the current regex parser offers.
I think the custom parser is currently a bit messy and could use some restructuring, but your point is valid. Right now, I think the only real advantage of the regex version of the parser is its readability, but it doesn't seem to be really leveraging the full power of regexes -- it is rather simple and can probably be translated to "plain C" easily. (Also, at first glance, it seems that those regexes aren't too good either; for example, I believe the second one will match `functionality = 42` too.)
Re: speed, I see four "levels" in which the parser can be implemented: 1. Compare characters one by one 2. `strncmp()` and `strstr()` 3. **`sscanf()`** :bulb: 4. Regular expressions
I think the regex parser could be easily re-implemented using sscanf() as a faster alternative to regex, so if that's an option I think it'd be an elegant solution -- readable, efficient, and less prone to errors than options 1 and 2.
Something like ```c if (sscanf(line, " function [%*[^]%]] = %[A-Za-z0-9_]", buffer) == 1) ... if (sscanf(line, " function%*[ \t]%*[A-Za-z0-9_] = %[A-Za-z0-9_]", buffer) == 1) ... ``` etc. (where `" "` matches zero or more whitespace chars, `"%*[ \t]"` matches one or more spaces/tabs, `"%*[^]%]"` matches anything but `]` and `%`, etc -- it's not incredibly readable, but it's fast.)
So, what do you think? Would `sscanf()` be fast enough, or better to keep matching individual chars and substrings?
Probably could be done by checking if after
p=(const unsigned char*) strstr ((const char*) line, "struct");
`p-1` and `p+6` are not alnum (plus all the necessary range checks).
That still won't ignore words in strings (and maybe other corner cases). Honestly I think I'd entirely ditch parsing structs; knowing that a certain variable at a certain point in the program is a struct isn't really that relevant, and universal-ctags doesn't do it anyway. Class parsing would be more useful. For a similar reason, I'd avoid parsing all variables as universal-ctags does; having a list with EVERY variable assignment in EVERY function in the script seems excessive. (However, it might be a good idea to list `global` and `persistent` variables.)