Perhaps the sscanf method can find strings and comments first and not look for other stuff inside them?

The regices(?) defined in upstream ctags are meant to match from the beginning of the line (notice the ^ at the beginning), so they're not going to accidentally match things inside an end-of-line comment or a string.

If you want to ignore definitions inside block comments (and I considered modifying my current PR or creating a new one for that) then the good news is that that's relatively easy to do, since a block comment is always going to be delimited by lines containing only %{ and %} (possibly with some whitespace) and nothing else, so there's no risk to accidentally interpret the content of a string or comment as a block comment delimiter. The only difficulty in this regard is that block comments can be nested (think of C's #if 0 ... #endif), but that's easy to solve with a counter.

As for end-of-line comments, those always start with % or ... (excluding those in a string, but you won't find strings in function definitions), so they are easy to avoid. And strings don't span multiple lines. So I think that other than the rare %{...%} case, it should be straightforward.

Don't know how fast sscanf is, its possible it won't be any faster than regexes since its still scanning the input more than once. Thats what is slow for regex parsers, the fact that multiple regexes are applied to the same input, not the fact that a well optimised regex library like PCRE is slow. Repeat scans is one thing well written character by character parsers try not to do.

My understanding is that the scanf family functions only need to parse one character at a time, and never backtrack or look ahead more than one character, so they don't need to perform multiple passes on the input. For example, if you do scanf("%d", &i), that's going to read character by character from stdin, and as soon as it finds a character that doesn't match an integer (it's not a digit or a leading whitespace/sign), it'll put that one character back into stdin (see ungetc()) and return. Think of it as a possessive regex.
Regular expressions, on the other hand, have to try to match different combinations until one of them works (unless they're possessive), hence the need for backtracking and multiple passes and inefficiencies.

For example, when matching the string abcde123 against the regex ([a-z]+)([aeiou]+) it will succeed, because the first capture group will match abcd (even if it could match all of abcde) and the second part will match e. But the scanf pattern %[a-z]%[aeiou] will fail, because the first specifier will just eat all the letters and not leave any for the second.

So in other words, sscanf() is a glorified char-by-char parser that only needs to look ahead one character at most. And it's also probably very well optimized, so it might be even faster than writing the parsing state machine yourself. Maybe we could just write some tests and time them.

The writers of the Python parser took the "every assignment is a declaration" approach and the Julia parser writers took the approach "no assignment is a declaration". So Python is a precedent for having all the names available, and I haven't seen too many complaints about it.

I had never noticed this, and it feels kinda wrong that the same variable can be "declared" in multiple places, but then again, Python programs are usually a bunch of functions/classes with maybe a few "file-scope" variables, and the parser ignores assignments performed inside a function (which are local to the function). So for a typical Python file structure, it makes sense to assume that every assignment done outside of a function is some sort of "global" variable. But one could also make a Python script where most/all the code is outside of a function and is executed directly (this would look a bit ugly in Geany because of all the variables, but that's the price to pay if we want a "normal" module-like Python file to look good).

Similarly, Matlab files can be of two types: either "scripts" where all the code inside is executed or "functions" containing one or more function definitions. Maybe we can just disable variable assignment detection when inside a function (i.e., when a line defining a function has been scanned before), and then we'd have Python's behavior.

However, I'd argue that in the case of Matlab files, there's nothing similar to C's "file-scope variables" since you'll never mix function definitions and variable declarations, so maybe it's a bit pointless to parse variables. But I'm OK with it as long as the ones in functions are excluded.

Honestly I think I'd entirely ditch parsing structs;

Don't know matlab enough to comment, but if no Matlabbers object I guess its ok if upstream doesn't do it.

I have some experience with Matlab and I'd say structs aren't used that often, or at least I don't use them often (and they're far from the only type of data structure). Plus using struct explicitly isn't the only way to declare a struct (and I'd say it's rarely done); just doing a.b.c.d = 42 will declare a as a struct containing a struct containing a struct if a doesn't exist yet. I think in practice I'd only ever use struct explicitly if I wanted to "reset" a struct variable to the empty struct.

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.