Re: speed, I see four "levels" in which the parser can be implemented:

  1. Compare characters one by one

Definitely fastest but not very readable and if you wanted to go this way, it would be better to convert the parser into the token-based parser (i.e. the "proper" parser). Such parsers first split an input like

function y = func4(a, b)

into tokens like these

  1. function - keyword
  2. y - identifier
  3. = - = operator
  4. func4 - identifier
  5. (
  6. a - identifier
  7. ,
  8. b - identifer
  9. )

first and then perform analysis on top of these pre-parsed tokens. Also when creating these tokens, these parsers skip things like whitespace or comments so you don't have to worry about these in the rest of the code. When creating these tokens, the parsers read the input character by character and do the necessary comparisons character-wise so they are very fast. In ctags these are all the parsers that don't use readLineFromInputFile() or regular expressions.

This is definitely the way to go if you want the best possible parser - but they require more time to implement and you'd have to rewrite the current implementation of the Matlab parser from scratch.

readLine() based parsers are definitely shittier but often just fine if the language isn't too crazy.

  1. strncmp() and strstr()

This is used in most ctags readLine() based parsers.

  1. sscanf() 💡

Like Lex, I'm not entirely sure by the performance of this - even though you don't have to backtrack, I'm not sure how these rules are evaluated and if it's fast enough. Also, personally, I'd prefer just plain C code that does this stuff - it's more readable and it can be reused - you can remove the whole string behind % in C first and then the rest of the code doesn't have to care about this any more (this is one of the typical simplifications of readLine() based parsers - % could be inside of a string in which case you shouldn't do this).

What's sure is that ctags parsers don't really use this method.

  1. Regular expressions

Regexii (as the ancient Romans commonly called them) are probably slowest and also least flexible but fastest to write and better than no parser at all.

Honestly I think I'd entirely ditch parsing structs; knowing that a certain variable at a certain point in the program is a struct isn't really that relevant, and universal-ctags doesn't do it anyway. Class parsing would be more useful.

I'm not a Matlab user but I guess this is probably fine for Geany.

For a similar reason, I'd avoid parsing all variables as universal-ctags does; having a list with EVERY variable assignment in EVERY function in the script seems excessive. (However, it might be a good idea to list global and persistent variables.)

This is where you might run into a problem in universal ctags - the "kinds" ctags support is kind of an interface and dropping it means backwards-incompatible change. In any case, before you spend more time on this parser, I'd suggest opening an issue in the universal-ctags project describing which way you want to proceed and asking if it's fine to avoid some unnecessary work (the maintainer of the project tends to be very responsive and supportive).


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <geany/geany/pull/3358/c1370294939@github.com>