[Geany-devel] Use of Scintilla word boundaries for word searches

Sun Aug 21 16:23:45 UTC 2011

Le 21/08/2011 05:07, Lex Trotman a écrit :
> Hi Guys,
> 
> So to summarise the thread:
> 
> 1. Natural language word breaks are hard, we don't want to go there.
> 
> Refer to http://www.unicode.org/reports/tr29/ from which I quote
> "programmatic text boundaries can match user perceptions quite
> closely, although sometimes the best that can be done is not to
> surprise the user."
> 
> As Columban said and IIUC, the algorithm in this standard is used in
> Pango and ICU.

Yeah, it's hard, and not necessarily useful to fit Geany's goals --
though it wouldn't be a problem to have it, just to implement it.

> It requires Unicode data tables which Geany and Scintilla do not have.

That's not (completely?) true however.  GLib has at least a part of the
Unicode tables, and both Scintilla and Geany depends on Pango through
GTK anyway.  However, I doubt Geany can do something in its side, and I
doubt Scintilla wants to have a hard dependency on Pango for non-GTK
platforms, nor to behave differently depending on the platform.

So basically yes, 1 is unlikely to be really fixed.

> 2. As Dimitar said Scintilla has its own definition of word start and
> end, leading to a word being a contiguous sequence of word chars or a
> contiguous sequence of punctuation chars.  Where a word char in UTF-8
> mode is >=0x80 or ( (isalnum or _) or the setting from Geany ).   It
> is going to use this definition when searching.
> 
> 3. Scintilla lexers use the programming language definition to
> highlight, AFAICT most don't use wordchars above.
> 
> 4. Tagmanager parsers also use the programming language definition,
> which should match the lexer in an ideal world since most programming
> languages are precisely defined.
> 
> As Colomban said it is too much work to make tagmanager and lexer use
> the same data, but IMHO if they disagree then its a bug, since
> programming languages clearly define what is what Dimitar called
> symbols.

Yes, if tagmanager and SCI lexer disagree on what is a "symbol", there
is probably a bug.  However as I already noted, I think SCI lexer don't
need to be as precise as tagmanager's one since they probably don't care
about words but for highlighting keywords.

> For a programming language find usage should find the symbol, and if
> Geany uses Scintilla words to decide what characters are part of the
> symbol that means that the Scintilla wordchars definition must match
> the programming language symbol definition.  So Geany may have to set
> it based on the filetype.
> 
> I don't see much point in Geany having its own definition, thats a
> fourth one to get out of sync and in fact the original bug is because
> editor.c/read_current_word() has its own definition that doesn't match
> the filetype.

If you allow the user to change the wordchars of Sinctilla (what the
settings wordchars and whitespace_chars allows), then it's likely not to
match tagmanager's one.   And here comes the problem when looking for a
tagmanager symbol from a Scintilla word.

So either we require the filetype to keep the wordchars to fit
tagmanager's ones (and more or less what the language sees as a symbol),
and thus don't allow what the user wanted to do in the first place (make
"$" to be part of a Scintilla word for navigation and selection facilities).

> Although I havn't examined the usages of this function
> in detail, maybe it should be replaced by one that uses Scintilla.

Depends on whether we allow the user to tune Scintilla's wordchars as
she wants or not, see above.

> For non-programming languages, where its a different filetype then it
> can have a different set of wordchars to make an approximation to

Actually here it's more whitespace_chars that'd need to include more
stuff, but yeah.

> natural language words, but for comments in a program, its going to be
> stuck with the programming language definition.

Which won't be a problem 90% of the time, since Good Practice (tm) wants
a programmer to write its comments in pure ASCII/English :)

> So then we have to ensure that the filetype wordchars lists are right,
> and that tagmanager and lexers have no obvious bugs. Thats all :-)

...and we don't support what the user wanted to do?

Cheers,
Colomban