Hi Dimitar,
Some good ideas.
On 26 August 2011 03:34, Dimitar Zhekov dimitar.zhekov@gmail.com wrote:
Hi,
- Not having the Lex imagination,
Dunno about imagination, its more experience [1] :-)
and not wanting to integrate a Real
Parser into Geany and push it onto the users, I ended up with a pretty simple universal parsing proposition - mostly a combination of what we already have.
In general, I like the idea of being able to explicitly specify which capture groups supply which information, but unfortunately a single number doesn't work in many cases, eg
a rough C and Python file and line number capturing regex is:
(^([^:]+):([0-9]+):)|(, ('([^']+)', ([0-9]+))
for C file_idx = 2, line_idx = 3 and for Python file_idx = 5, line_idx = 6 so a single number won't work.
Instead the capture groups should be identified by name (with the option for non-unique names) eg:
(?J)(^(?<file>[^:]+):(?<line>[0-9]+):)|(, ('(?<file>[^']+)', (?<line>[0-9]+))
That assumes Geany uses a regex engine that supports named captures, eg GRegex.
Now since Geany 0.21 will depend on glib >=2.16 and GRegex was defined in 2.14 after 0.21 release we should remove the options for different regex engines and use only GRegex, making life simpler all round.
Also this allows the information to be specified in the regex, no extra fields needed in the UI dialogs to specify the indexes.
- A message is parsed using:
regex: Extended syntax regular expression with subexpressions.
line_idx: 1-based index of the line number subexpression. If zero, Geany will read the first two (or three, if there is a third) matches, and check whether the first or second match is purely digits. If so, it'll try to use the matches as <line> [column] <filename>, or as <filename> <line> [column].
This can be the fallback if there are no named captures so existing regexii will still work.
[...]
- Some examples:
"^([:]+):([0-9]+): " file_idx=1, line_idx=2 the standard grep-line syntax
[:]+ S/B [^:]+ ie "one or more not a :" not "one or more :"
"libtool *--mode=link| from ([:]+):([0-9]+)[,:]([0-9]+[,:])?$" \ "|^([:]+):([0-9]+):([0-9]+:)? (warning: )?" file_idx=1, line_idx=2, col_idx=3, warn_idx=4 :) gcc/g++ syntax, ignoring libtool --mode=link (see msgwindow.c)
Needs to escape the * after libtool, plus several instances of the problem in example 1.
- Recommendations for writing regular expressions:
The default file_idx is 0, so if you specity line_idx > 0, be sure to set file_idx > 0 too (unless the message does not contain a filename).
For more precise parsing, try to match the start or end of line, or some unambigous text.
Exclude the delimiter following an subexpression: "[^:]+:" instead of ".+:" (note that Geany assumes colon-less document names).
Colon-less filenames is a reasonable assumption, although some stupid tool keeps creating a directory called ":" on my system, which may cause problems, but I havn't figured out which tool it is.
- Rationalle for an explicitly specified line/column # to start with
a number instead being purely digits:
You can always specify a purely-digit subexpression.
For some messages, it may be easier to define the line and column including some trailing text in the subexpression.
Could use non-capturing groups for the stuff you don't want, just put the digits in a capturing group.
- Initial implementation:
The buildin parsing will match the current behaviour, with some small improvements.
Yep, user defined regexen must still work, we don't want everybody to have to go back and edit all their regexen after they upgrade to 0.22.
The list model should either be removed and left to filename-line- [column] parsing, or used to store the parsing results. Personally I prefer the former. Combining the parts into a string and parsing them back seems a bit awkward, but such things happen in programming.
- Possible extensions:
More than one user-defined regex and set of indexes: error_regex_1, line_idx_1 etc.
IMHO named is better, easier to understand and specify etc.
When building, attempt all default expressions, and check if the filename matches the well-known extensions for this expression. For the 1st line in a build output, the current-file-type-expression is tried first. For the next lines, the last successful (best-matching) expression is tried first.
To be exact, the decoding isn't filetype dependent, its compiler/linker/tool dependent.
So extracting the filetype by consensus from the error message and using that to decode the message isn't necessarily right, eg when your makefile is using a different compiler to the one that the standard filetype is defined for. Checking with multiple compilers is a good idea, eg clang is (in my limited experience) more standards enforcing than gcc so both should be tried.
Tool dependence means it depends on the command, which is why I suggested in another post that per-command expressions might be needed. Whilst verbose thats very simple :-)
copy_idx - Index of the text to be copied on right-click Copy, 0 if none. If the matching text is empty, the entire line will be copied.
Allow the plugins to define their own expressions for the messages tab (which will make the list model obsolete).
RFC.
Cheers Lex
[1] experience is a nice way of saying have fallen into that trap, made that mistake, implemented it the wrong way before etc.