Make the ctags parser ignore function definitions in Matlab files written within block comments, like: ```matlab %{ function y = This(is, not, a, function, definition) %}
function y = But(this, is)
%{ not a block comment ("%{" needs to be on a line on its own) function y = And(also, this) %} ``` Block comments are useful for commenting out entire sections of code which may include function definitions, specially since they can be nested (which is also handled appropriately by this PR).
Additionally, this PR allows function declarations to be indented. This is quite common when defining class methods within a class.
This PR is a continuation of PR #3358, which I understand was ready to merge but has been sitting there for a while. So by merging this PR you also get that one. Two for the price of one! (I didn't make this a separate PR because it touches some of the same lines #3358 touches, so making it a separate PR would have led to merge conflicts in the future. Each individual feature is implemented as a separate commit though.) You can view, comment on, or merge this pull request online at:
https://github.com/geany/geany/pull/3563
-- Commit Summary --
* ctags: Matlab: function name only, without args * Modify (fix) test accordingly * Improve ctags tests for Matlab * ctags: Matlab: ignore defs within block comments * ctags: Matlab: ignore leading whitespace
-- File Changes --
M ctags/parsers/geany_matlab.c (52) M tests/ctags/matlab_backtracking.m.tags (4) M tests/ctags/matlab_test.m (14) M tests/ctags/matlab_test.m.tags (18)
-- Patch Links --
https://github.com/geany/geany/pull/3563.patch https://github.com/geany/geany/pull/3563.diff
@cousteaulecommandant pushed 5 commits.
56afcc71d19fb6ed26441b6eaf5adddc620999d1 ctags: Matlab: function name only, without args 568806ea78e98ac2e82b288fa15d443600631294 Modify (fix) test accordingly 19feb68c28427d423015b71428962ed766fc8c57 Improve ctags tests for Matlab 35d8ffa9953996bec6819406c1d47d7848ef194c ctags: Matlab: ignore defs within block comments 75ad0794ffa2824e4fa604efadfc5763c62655a2 ctags: Matlab: ignore leading whitespace
#3358 which I understand was ready to merge but has been sitting there for a while.
Geany does __NOT__ want to maintain parsers separate from ctags, we simply do not have the effort available to devote to language specifics, and making changes available upstream means more people benefit from them and more language expert eyes get to check and improve them.
So my understanding was that either that a version of the Geany C parser was to be upgraded to include any extra capability that the ctags regex parser had and was then merged upstream, or the regex based parser from upstream was merged into Geany. Not that changes were made solely on the Geany parser.
So my understanding was that either that a version of the Geany C parser was to be upgraded to include any extra capability that the ctags regex parser had and was then merged upstream, or the regex based parser from upstream was merged into Geany. Not that changes were made solely on the Geany parser.
But until that happens, shouldn't the existing parser at least be patched to work properly? I agree that the ideal solution will eventually involve throwing all this code in the garbage and replacing it with the new fancy tags anyway, but until then, let's at least have a Matlab parser that isn't broken, even if it is only going to be temporary until it gets replaced.
(As for making the changes available upstream, I understand that the upstream ctags parser for Matlab is already way more complete than this, so it wouldn't benefit from these changes anyway; as I said, this is just meant as a workaround to fix something that is currently broken in Geany.)
But until that happens
Well I was trying to _subtly_ hint that you should make it happen :-)
The existing upstream is from 2008 according to its copyright (didn't check git) so it is going to take one of you matlab users to do it for it to happen, clearly the originator is unlikely to improve it. It is currently a regex parser but a C parser can do things the regex one can't, and is usually faster (speed doesn't matter for uctags, but does for Geany where it has to run between keystrokes without causing delays).
As @techee said on #3358, the existing parser was left as he thought it would be a basis for making a "proper" parser for upstream, so no, the plan wasn't to throw it in the bin.
The existing upstream [...] is currently a regex parser but a C parser can do things the regex one can't, and is usually faster
My bad; I completely misunderstood the other discussion then. I thought you meant that you _didn't_ want to pull in the one from uctags (because it used regex and Geany didn't handle those nicely), but that the plan was to eventually use it (because there was work in making Geany support regex-based uctags).
Well I was trying to _subtly_ hint that you should make it happen :-)
Daaammit. OK, I'll look into it.
it is going to take one of you matlab users to do it for it to happen, clearly the originator is unlikely to improve it.
Oh no. I became the chosen one, and it's for something with Matlab 😨
because it used regex and Geany didn't handle those nicely
IIUC the @techee comments on #3557 Geany can handle regex parsers by importing the `.c` file, ie `matlab.c`.
So its all about quality and speed, thats where there are always concerns about any regex parser.
A parser written in C operating character by character can always be made better and faster than a regex one, and it can always be made more complete than the regex one, and it can handle languages that the regex one can't (eg C and C++), but of course its harder to write and maintain.
So the default preference is to not pull a regex parser unless there is nothing else available. But if the choice is a regex parser or nothing Geany will take the regex parser unless there are major issues with it. So for example on #3557 it was accepted that the regex based forth parser would be ok since it was as fast as the C language parser and no other forth parser was available.
But the "problem" for the new matlab expert guy (you ;-) is that there _is_ an alternative parser for matlab, the existing Geany parser. But it has a different set of capabilities than the ctags regex one (and a bug which you are trying to address). So my questions to you, the newly minted matlab parser expert :-) are:
- what features does the ctags regex parser have that the geany one does not (I have significant doubts about your previous description of the regex parser as "almost complete", but maybe matlab is extremely simple) - what features does the geany parser have that the ctags one does not have (if any) - how do they compare in speed (and how do they compare to the C language parser) - how much effort is it to add features from the Geany parser to the ctags regex one - how much effort is it to add features from the ctags regex parser to the Geany one
The answers to those questions are what should guide the best approach.
An example where a regex parser might not be as capable as a parser written in C is making tags for the properties and methods in a matlab classdef. Using the example in https://au.mathworks.com/help/matlab/ref/classdef.html. At the moment the ctags parser makes tags for the properties `CurrentSpeed` and `SpeedRange` but as stand alone variables, not associated with class `Motor`, and similarly functions `start()` and `stop()` are tagged as standalone functions not associated with `Motor`.
Obviously a parser written in C can associate members and methods with the containing entity since the C/C++ parser shows nested struct/classes to any depth. But it is "slightly" more [complex](https://github.com/geany/geany/tree/master/ctags/parsers/cxx) than ctags `matlab.c`.
IIUC there is a simple context (scope) stack for regex parsers, but I am not sure it can distinguish between a `function foo()` as a standalone function or as a method of a class?
* what features does the ctags regex parser have that the geany one does not (I have significant doubts about your previous description of the regex parser as "almost complete", but maybe matlab is extremely simple)
Not "almost complete"; just "way more complete". OK, maybe that was an overstatement anyway. My point was that it had a few improvements over Geany's current one. Specifically, it deals with the main issue I fixed in **#3358**, also deals with **leading spaces** (i.e. indented definitions) which I fixed here, and also parses **classdef** tags (class definitions) which I haven't implemented yet. Then again, it doesn't deal with comments that may be problematic in certain scenarios. It also parses **_variables_** (not just structs assigned with the `struct()` function as Geany does), but I think that's overkill since it'll catch EVERY assignment to a variable.
Looking back, it is an extremely basic parser. But it looked more complete and correct when I first looked at it.
* what features does the geany parser have that the ctags one does not have (if any)
I didn't find an obvious one. Well, it parses structs. (Or rather, it detects when you're assigning the output of the `struct()` function to a variable, but that's far from the only way to create a struct.)
* how do they compare in speed (and how do they compare to the C language parser)
I really should implement a timing function to answer this; it would be quite helpful for deciding.
* how much effort is it to add features from the Geany parser to the ctags regex one
Before #3358 very easy since there's only the `struct` thing. After it, still quite easy. However, I have no idea how to add the "ignore definitions inside block comments" since that'd require a stateful parser, and the one implemented in u.ctags doesn't seem to have that. (I'd be surprised if that couldn't be done in ctags at all; it's just that I don't know how.)
* how much effort is it to add features from the ctags regex parser to the Geany one
Not much. Most of the issues were already fixed by #3358 and the commit "Ignore leading whitespace" in this PR. The only missing feature is the classdef creation, which is a really naïve implementation anyway. I have a draft for that somewhere, but I parked the development of that when I was trying to figure out how to add classes.
functions `start()` and `stop()` are tagged as standalone functions not associated with `Motor`.
This is where it gets challenging. I can think of an implementation that remembers the last declared class, and shoves all functions declared afterwards into that class (there's usually a single class definition per file). But this means that once the class finishes, other functions declared outside of the class will erroneously be considered part of the class. (This is an unlikely scenario though, and I'm not even sure you can do that sort of thing in Matlab.) Also, I'm not sure if classes can be nested. (Wow, so much for the "Matlab expert" you just recruited!) If these two issues are a problem, then the proper way to go would be to learn where a class _ends_, and that implies implementing a proper parser that actually does some lexing, counting all the `<keyword>`...`end` blocks. This can be extra tricky since `end` can mean different things. So I think it'll be smarter to stick to the "line-oriented parser, with something that remembers the last declared class", even if it has some corner cases.
the ctags parser makes tags for the properties CurrentSpeed and SpeedRange
This is going to be a bit trickier, specially if the user decides to define multiple properties in the same line. Even neglecting that corner case, it's something I'll have to think of. This is the section where all the class attributes are defined so it's important to get it right.
ctags cheats here, and you see it work because it thinks you're assigning to regular variables (because of the `=`), but it would fail e.g. with ```matlab properties width, height, depth end ```
@cousteaulecommandant thanks for the analysis, couple of comments/suggestions
Then again, it doesn't deal with comments that may be problematic in certain scenarios.
This is more a note to me and @techee than you, we do need to be careful of theoretical situations and perfection, the most likely case for problems is slabs of commented out code, and maybe its tolerable for symbols to be parsed in that situation. Then again if comments can be detected by a regex early in the list it has the ability to skip further scanning of those line(s) which would avoid the problem.
It also parses variables (not just structs assigned with the struct() function as Geany does), but I think that's overkill since it'll catch EVERY assignment to a variable.
Ahh, another "assignment is declaration" language, the Python parser catches every assignment IIRC, but then again the Julia one does not, so either seems to be acceptable to users.
I really should implement a timing function to answer this; it would be quite helpful for deciding.
What you need is a big matlab file (IIUC they are mostly teeny, again making regex performance irrelevant) your Terabyte matlab model controlling the known universe should do ;-).
Ctags has the `--totals=extra` option that prints bytes per second scanning so scan a big C++ file (Scintilla Editor.cxx from this repository is ~8800 lines for eg) and your big matlab file and compare them. For Geany time `geany -g tags_file your_big_matlab_file.m` and the C++ file again but you will have to do the math yourself to get bytes per second.
Your analysis of the difficulty of adding `start()` and `stop()` and properties etc to a class looks right to me.
"line-oriented parser, with something that remembers the last declared class"
The scope stack functionality that ctags provides for optlib (regex) implementations is supposed to support that, except with matlab you have the scope classdef Motor but inside that you have scope `properties` and `events` and `methods`, but AFAICT the ctags scope only allows access to top of stack, so you can't access the `classdef` scope that you want to make the method tags a member of because it isn't on top of stack, `properties`, `events` or `methods` is, but I can't see how the tag kind can depend on that and the scope depend on the classdef. But maybe I'm missing something @techee heeeeeeeelp (appeal to the __real__ Geany ctags expert)?
Detecting `end` then pops the stack, so that would work, so long as every construct that has `end` gets a scope pushed on the stack it then doesn't matter which one `end` pops.
And as you point out `properties` can be a list of names, and so can `events` so thats another context that the tag has to depend on.
To summarise my last ramblings, unless we can identify how the regex parser can handle matlab context and scoping we have to conclude that it can't be expanded to do that, suggesting that adding the same naive `classdef` to the Geany parser and pushing it upstream will not lose ctags functionality and is better in the long run since it can support scoping in the future.
@cousteaulecommandant Sorry for the late response :-(
If it's not a big problem (and it shouldn't be, you should be able to simulate what the regex parser does), I'd suggest that you update your parser with all features that the universal-ctags regex parser has (otherwise it won't be accepted) and then submit it to the universal-ctags project. The maintainer is very responsive and I don't expect he wouldn't accept it. Or maybe even better, open an issue there and ask there and ask first to avoid some unnecessary work.
A readline-based parser will always be a compromise - it's easy to create something functional quickly but the more and more language features you add, the more messy it will be. So if you want to create some more complex parser, you'd have to create a token-based parser.
From my point of view, the Matlab parser in Geany fulfilled its purpose - it attracted a Matlab developer, you, who also knows C and who is able to improve it. Done :-). So I think in the next time we update ctags parsers, I'll grab whatever parser there is in the ctags project - if you manage to get your parser to ctags afterwards, just ping me, and I'll grab it for Geany.
Thanks! I'll see what I can do. I'll start with something that does exactly what their parser does, which shouldn't be hard since I already have the parsing done (although I need to understand how to integrate the parsing function with the rest of ctags, and whether something extra needs to be done), and then maybe add some extra functionality such as handling commented out code or an improved class parser.
Or maybe even better, open an issue there and ask there and ask first to avoid some unnecessary work.
Thanks, that's probably a good idea :)
A readline-based parser will always be a compromise - it's easy to create something functional quickly but the more and more language features you add, the more messy it will be. So if you want to create some more complex parser, you'd have to create a token-based parser.
Yeah, I don't expect to have a "real", fully featured, 100% compliant parser, which would need to be token-based and quite complex because Matlab can get quite weird, but just something that will get it right most of the time (easy to deceive on purpose if you're evil enough, but hard to break by accident).
PS: Although you've made it clear that you don't want to maintain your own ctags parser, I'd consider at least pulling #3358 since the current parser is pretty much broken (not to a "causes Geany to crash" degree, but definitely to a "hardly ever produces the right output" level) - i.e. it fixes a _bug_ in the parser that leaves it virtually useless, not a mere feature improvement.
I need to understand how to integrate the parsing function with the rest of ctags
I think the only thing you have to do is to overwrite the existing regex-based parser with the code of this parser. And then have a look at the unit tests to see what fails.
I'd consider at least pulling https://github.com/geany/geany/pull/3358
Done, thanks!
Closed #3563.
I'm closing this PR - Geany uses the regex-based parser from ctags now.
Yeah, I noticed that.
I might eventually contribute to upstream ctags instead (I've already tried my hand with Verilog), but I haven't used Matlab in a while so I'm not that much in a hurry now.
github-comments@lists.geany.org