Hi,
Le 29/01/2014 17:37, Larry Bradley a écrit :
[...]
I have the geany 1.23 source, and I've actually make some changes to >
the VHDL scintilla lexer and filetypes.vHDL to handle folding and
syntax highlighting properly.
You should use the development version (Git repository), so your changes would be easier to merge later.
However, I would like to do a better job of supporting Jal.
First of all, you should take a look at the file named HACKING in the source tree. It contains many generic and specific guidance how to hack on Geany source, and has a specific section for new filetypes.
In particular, I would like the symbol tree to be able to show the variables and constants defined in a Jal program. Using the VHDL filetype, Geany shows the functions and procedures (I did nothing to cause this to happen), but not the variables.
Only some filetypes actually display variables. Basic, for example, does, while Pascal does not.
The symbols are extracted with a CTags parser, e.g. tagmanager/ctags/vhdl.c. Whether or not a particular type of symbol appears in Geany depends on basically two things:
1) the ability of the relevant parser to generate "tags" for those symbols;
2) whether or not the type of those generated tags is mapped to a category displayed in the symbols list.
First point obviously requires the parser to be tuned to handle a particular thing. The second depends both on what type the parser reports for the tags, and whether this type is mapped to something for this language in src/symbols.c:add_top_level_items().
I've no problems with making changes to the Geany code, but I've no idea where to start with the display of variables and constants. The scintilla lexers that I've seen, and the scintilla documentation do not make it really obvious how one writes a lexer.
Scintilla lexers do not generate tags, this is CTags parsers.
The true difference between those in how they work is that the goal of a Scintilla lexer is to only properly highlight the code, which most generally only require basic knowledge of the syntax (e.g. what is a string, a comment, etc.) -- basically, only the first step of the general language understanding is required: identifying tokens. Having a very tolerant Scintilla lexer is a good thing, since it's definitely meant to highlight a document during modification.
On the other hand, since the CTags parser has to extract particular information from the data, it has to understand some parts. In general, this requires the first step (dividing into tokens), although sometimes only very basic differentiation is required [1]; but also the second step: understanding what those token actually mean to some extent. Whether or not it has to understand the whole language or not depends on how the language is constructed and how clever the programmer of that parser is to find tricks. For example, a language that use keywords to introduce everything the parser want to extract (PHP or Python pretty much fit) can pretty much simply search for those keywords and start extracting the relevant information from there and not care much for what is in-between. On the other hand, for languages with a more "free" syntax (like C, C++ and other crazy languages :), the parser may have to care more just to be able to find what is interesting (e.g. one could imagine a C or C++ parser to cut the input in statements, and then analyze those statement content).
In practice however, one will generally take as basis an existing parser or lexer for a language similar to the one she want to support.
For writing your Scintilla lexer, pick one for a similar language (here you picked VHDL IIUC), copy it and modify it. If the language in questing really only have small differences with an existing one, one might even simply tweak an existing lexer to handle both languages -- but this should be done with caution no to render things hard to follow, and should only be used for very similar languages. Note that in the context of a Scintilla lexer, "very similar" means more what syntactic elements exist and what is their syntax (comments, strings, etc.) than how the language works. For example C, C++, Java, JavaScript and a few other all use the same lexer, because most of their syntactic elements are the same. Also, Scintilla is a separate project, and we prefer new lexers to be integrated to it before we add them, so we don't diverge. But don't worry, Scintilla easily accept new lexers.
For the CTags parser, it's less commonly a good idea to have one single parser for different languages, because generally the changes are larger -- unless of course one language is a perfect superscript or subscript of another one. There are 2 types of CTags parsers: regular-expression based parser, and plain C ones.
1) Regular expression parsers are quite simple, and simply consist of a set of line-based regular expressions that extract the tags. These are limited (impossible to really handle multi-line constructs like comments or multi-line strings), but really simple.
2) Plain C parsers are more complex, but can handle anything the programmer can handle. They are just normal C code reading the data and handling it in any appropriate manner.
Ah, and don't take any example on the C parser (c.c) -- don't even look at it if you want don't want to become crazy ;) If you want a nice complete (and complex) parser for a relatively easy language, you can look at the ones for PHP and Rust.
Anyway, don't hesitate to ask any further question you have.
Regards, Colomban
[1] e.g. it's generally not important to differentiate an identifier from a number constant, because in most languages they are used the same, and if a number appears where an identifier is expected it only means malformed input.