[Geany-devel] Use of Scintilla word boundaries for word searches

Sun Aug 21 03:07:28 UTC 2011

Hi Guys,

So to summarise the thread:

1. Natural language word breaks are hard, we don't want to go there.

Refer to http://www.unicode.org/reports/tr29/ from which I quote
"programmatic text boundaries can match user perceptions quite
closely, although sometimes the best that can be done is not to
surprise the user."

As Columban said and IIUC, the algorithm in this standard is used in
Pango and ICU.

It requires Unicode data tables which Geany and Scintilla do not have.

2. As Dimitar said Scintilla has its own definition of word start and
end, leading to a word being a contiguous sequence of word chars or a
contiguous sequence of punctuation chars.  Where a word char in UTF-8
mode is >=0x80 or ( (isalnum or _) or the setting from Geany ).   It
is going to use this definition when searching.

3. Scintilla lexers use the programming language definition to
highlight, AFAICT most don't use wordchars above.

4. Tagmanager parsers also use the programming language definition,
which should match the lexer in an ideal world since most programming
languages are precisely defined.

As Colomban said it is too much work to make tagmanager and lexer use
the same data, but IMHO if they disagree then its a bug, since
programming languages clearly define what is what Dimitar called
symbols.

For a programming language find usage should find the symbol, and if
Geany uses Scintilla words to decide what characters are part of the
symbol that means that the Scintilla wordchars definition must match
the programming language symbol definition.  So Geany may have to set
it based on the filetype.

I don't see much point in Geany having its own definition, thats a
fourth one to get out of sync and in fact the original bug is because
editor.c/read_current_word() has its own definition that doesn't match
the filetype.  Although I havn't examined the usages of this function
in detail, maybe it should be replaced by one that uses Scintilla.

For non-programming languages, where its a different filetype then it
can have a different set of wordchars to make an approximation to
natural language words, but for comments in a program, its going to be
stuck with the programming language definition.

So then we have to ensure that the filetype wordchars lists are right,
and that tagmanager and lexers have no obvious bugs. Thats all :-)

Cheers
Lex