[Geany-Devel] Syntax highlighting, folding, etc for a "new" language.

Wed Jan 29 17:34:01 UTC 2014

Hi,

Le 29/01/2014 17:37, Larry Bradley a écrit :
> [...]
>
> I have the geany 1.23 source, and I've actually make some changes to >
the VHDL scintilla lexer and filetypes.vHDL to handle folding and
> syntax highlighting properly.

You should use the development version (Git repository), so your changes
would be easier to merge later.

> However, I would like to do a better job of supporting Jal.

First of all, you should take a look at the file named HACKING in the
source tree.  It contains many generic and specific guidance how to hack
on Geany source, and has a specific section for new filetypes.

> In particular, I would like the symbol tree to be able to show the 
> variables and constants defined in a Jal program. Using the VHDL 
> filetype, Geany shows the functions and procedures (I did nothing to 
> cause this to happen), but not the variables.
>
> Only some filetypes actually display variables. Basic, for example, 
> does, while Pascal does not.

The symbols are extracted with a CTags parser, e.g.
tagmanager/ctags/vhdl.c.  Whether or not a particular type of symbol
appears in Geany depends on basically two things:

1) the ability of the relevant parser to generate "tags" for those symbols;

2) whether or not the type of those generated tags is mapped to a
category displayed in the symbols list.

First point obviously requires the parser to be tuned to handle a
particular thing.  The second depends both on what type the parser
reports for the tags, and whether this type is mapped to something for
this language in src/symbols.c:add_top_level_items().

> I've no problems with making changes to the Geany code, but I've no
> idea where to start with the display of variables and constants. The 
> scintilla lexers that I've seen, and the scintilla documentation do
> not make it really obvious how one writes a lexer.

Scintilla lexers do not generate tags, this is CTags parsers.

The true difference between those in how they work is that the goal of a
Scintilla lexer is to only properly highlight the code, which most
generally only require basic knowledge of the syntax (e.g. what is a
string, a comment, etc.) -- basically, only the first step of the
general language understanding is required: identifying tokens.  Having
a very tolerant Scintilla lexer is a good thing, since it's definitely
meant to highlight a document during modification.

On the other hand, since the CTags parser has to extract particular
information from the data, it has to understand some parts.  In general,
this requires the first step (dividing into tokens), although sometimes
only very basic differentiation is required [1];  but also the second
step: understanding what those token actually mean to some extent.
Whether or not it has to understand the whole language or not depends on
how the language is constructed and how clever the programmer of that
parser is to find tricks.  For example, a language that use keywords to
introduce everything the parser want to extract (PHP or Python pretty
much fit) can pretty much simply search for those keywords and start
extracting the relevant information from there and not care much for
what is in-between.  On the other hand, for languages with a more "free"
syntax (like C, C++ and other crazy languages :), the parser may have to
care more just to be able to find what is interesting (e.g. one could
imagine a C or C++ parser to cut the input in statements, and then
analyze those statement content).

In practice however, one will generally take as basis an existing parser
or lexer for a language similar to the one she want to support.

For writing your Scintilla lexer, pick one for a similar language (here
you picked VHDL IIUC), copy it and modify it.  If the language in
questing really only have small differences with an existing one, one
might even simply tweak an existing lexer to handle both languages --
but this should be done with caution no to render things hard to follow,
and should only be used for very similar languages.
Note that in the context of a Scintilla lexer, "very similar" means more
what syntactic elements exist and what is their syntax (comments,
strings, etc.) than how the language works.  For example C, C++, Java,
JavaScript and a few other all use the same lexer, because most of their
syntactic elements are the same.
Also, Scintilla is a separate project, and we prefer new lexers to be
integrated to it before we add them, so we don't diverge.  But don't
worry, Scintilla easily accept new lexers.

For the CTags parser, it's less commonly a good idea to have one single
parser for different languages, because generally the changes are larger
-- unless of course one language is a perfect superscript or subscript
of another one.
There are 2 types of CTags parsers:  regular-expression based parser,
and plain C ones.

1) Regular expression parsers are quite simple, and simply consist of a
set of line-based regular expressions that extract the tags.  These are
limited (impossible to really handle multi-line constructs like comments
or multi-line strings), but really simple.

2) Plain C parsers are more complex, but can handle anything the
programmer can handle.  They are just normal C code reading the data and
handling it in any appropriate manner.

Ah, and don't take any example on the C parser (c.c) -- don't even look
at it if you want don't want to become crazy ;)
If you want a nice complete (and complex) parser for a relatively easy
language, you can look at the ones for PHP and Rust.

Anyway, don't hesitate to ask any further question you have.

Regards,
Colomban

[1] e.g. it's generally not important to differentiate an identifier
from a number constant, because in most languages they are used the
same, and if a number appears where an identifier is expected it only
means malformed input.