[Geany-devel] New messages/output parsing proposition

Fri Aug 26 01:37:07 UTC 2011

Hi Dimitar,

Some good ideas.

On 26 August 2011 03:34, Dimitar Zhekov <dimitar.zhekov at gmail.com> wrote:
> Hi,
>
> 0. Not having the Lex imagination,

Dunno about imagination, its more experience [1] :-)

and not wanting to integrate a Real
> Parser into Geany and push it onto the users, I ended up with a pretty
> simple universal parsing proposition - mostly a combination of what
> we already have.

In general, I like the idea of being able to explicitly specify which
capture groups supply which information, but unfortunately a single
number doesn't work in many cases, eg

a rough C and Python file and line number capturing regex is:

(^([^:]+):([0-9]+):)|(, \('([^']+)', ([0-9]+))

for C file_idx = 2, line_idx = 3 and for Python file_idx = 5, line_idx
= 6 so a single number won't work.

Instead the capture groups should be identified by name (with the
option for non-unique names) eg:

(?J)(^(?<file>[^:]+):(?<line>[0-9]+):)|(, \('(?<file>[^']+)', (?<line>[0-9]+))

That assumes Geany uses a regex engine that supports named captures,
eg GRegex.

Now since Geany 0.21 will depend on glib >=2.16 and GRegex was defined
in 2.14 after 0.21 release we should remove the options for different
regex engines and use only GRegex, making life simpler all round.

Also this allows the information to be specified in the regex, no
extra fields needed in the UI dialogs to specify the indexes.

>
>
> 1. A message is parsed using:
>
> regex: Extended syntax regular expression with subexpressions.
>
> line_idx: 1-based index of the line number subexpression. If zero,
> Geany will read the first two (or three, if there is a third) matches,
> and check whether the first or second match is purely digits. If so,
> it'll try to use the matches as <line> [column] <filename>, or as
> <filename> <line> [column].

This can be the fallback if there are no named captures so existing
regexii will still work.

[...]
>
> 2. Some examples:
>
> "^([:]+):([0-9]+): "
> file_idx=1, line_idx=2
> the standard grep-line syntax
>

[:]+ S/B [^:]+  ie "one or more not a :" not "one or more :"

> "libtool *--mode=link| from ([:]+):([0-9]+)[,:]([0-9]+[,:])?$" \
>        "|^([:]+):([0-9]+):([0-9]+:)? (warning: )?"
> file_idx=1, line_idx=2, col_idx=3, warn_idx=4 :)
> gcc/g++ syntax, ignoring libtool --mode=link (see msgwindow.c)
>

Needs to escape the * after libtool, plus several instances of the
problem in example 1.

>
> 3. Recommendations for writing regular expressions:
>
> The default file_idx is 0, so if you specity line_idx > 0, be sure to
> set file_idx > 0 too (unless the message does not contain a filename).
>
> For more precise parsing, try to match the start or end of line, or
> some unambigous text.
>
> Exclude the delimiter following an subexpression: "[^:]+:" instead of
> ".+:" (note that Geany assumes colon-less document names).

Colon-less filenames is a reasonable assumption, although some stupid
tool keeps creating a directory called ":" on my system, which may
cause problems, but I havn't figured out which tool it is.

>
>
> 4. Rationalle for an explicitly specified line/column # to start with
> a number instead being purely digits:
>
> You can always specify a purely-digit subexpression.
>
> For some messages, it may be easier to define the line and column
> including some trailing text in the subexpression.

Could use non-capturing groups for the stuff you don't want, just put
the digits in a capturing group.

>
>
> 5. Initial implementation:
>
> The buildin parsing will match the current behaviour, with some
> small improvements.

Yep, user defined regexen must still work, we don't want everybody to
have to go back and edit all their regexen after they upgrade to 0.22.

>
> The list model should either be removed and left to filename-line-
> [column] parsing, or used to store the parsing results. Personally I
> prefer the former. Combining the parts into a string and parsing them
> back seems a bit awkward, but such things happen in programming.
>
>
> 6. Possible extensions:
>
> More than one user-defined regex and set of indexes: error_regex_1,
> line_idx_1 etc.

IMHO named is better, easier to understand and specify etc.

>
> When building, attempt all default expressions, and check if the
> filename matches the well-known extensions for this expression. For
> the 1st line in a build output, the current-file-type-expression is
> tried first. For the next lines, the last successful (best-matching)
> expression is tried first.

To be exact, the decoding isn't filetype dependent, its
compiler/linker/tool dependent.

So extracting the filetype by consensus from the error message and
using that to decode the message isn't necessarily right, eg when your
makefile is using a different compiler to the one that the standard
filetype is defined for.  Checking with multiple compilers is a good
idea, eg clang is (in my limited experience) more standards enforcing
than gcc so both should be tried.

Tool dependence means it depends on the command, which is why I
suggested in another post that per-command expressions might be
needed.  Whilst verbose thats very simple :-)

>
> copy_idx - Index of the text to be copied on right-click Copy, 0 if
> none. If the matching text is empty, the entire line will be copied.
>
> Allow the plugins to define their own expressions for the messages
> tab (which will make the list model obsolete).
>
>
> RFC.
>

Cheers
Lex

[1] experience is a nice way of saying have fallen into that trap,
made that mistake, implemented it the wrong way before etc.