Le 21/08/2011 05:07, Lex Trotman a écrit :
Hi Guys,
So to summarise the thread:
- Natural language word breaks are hard, we don't want to go there.
Refer to http://www.unicode.org/reports/tr29/ from which I quote "programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user."
As Columban said and IIUC, the algorithm in this standard is used in Pango and ICU.
Yeah, it's hard, and not necessarily useful to fit Geany's goals -- though it wouldn't be a problem to have it, just to implement it.
It requires Unicode data tables which Geany and Scintilla do not have.
That's not (completely?) true however. GLib has at least a part of the Unicode tables, and both Scintilla and Geany depends on Pango through GTK anyway. However, I doubt Geany can do something in its side, and I doubt Scintilla wants to have a hard dependency on Pango for non-GTK platforms, nor to behave differently depending on the platform.
So basically yes, 1 is unlikely to be really fixed.
- As Dimitar said Scintilla has its own definition of word start and
end, leading to a word being a contiguous sequence of word chars or a contiguous sequence of punctuation chars. Where a word char in UTF-8 mode is >=0x80 or ( (isalnum or _) or the setting from Geany ). It is going to use this definition when searching.
- Scintilla lexers use the programming language definition to
highlight, AFAICT most don't use wordchars above.
- Tagmanager parsers also use the programming language definition,
which should match the lexer in an ideal world since most programming languages are precisely defined.
As Colomban said it is too much work to make tagmanager and lexer use the same data, but IMHO if they disagree then its a bug, since programming languages clearly define what is what Dimitar called symbols.
Yes, if tagmanager and SCI lexer disagree on what is a "symbol", there is probably a bug. However as I already noted, I think SCI lexer don't need to be as precise as tagmanager's one since they probably don't care about words but for highlighting keywords.
For a programming language find usage should find the symbol, and if Geany uses Scintilla words to decide what characters are part of the symbol that means that the Scintilla wordchars definition must match the programming language symbol definition. So Geany may have to set it based on the filetype.
I don't see much point in Geany having its own definition, thats a fourth one to get out of sync and in fact the original bug is because editor.c/read_current_word() has its own definition that doesn't match the filetype.
If you allow the user to change the wordchars of Sinctilla (what the settings wordchars and whitespace_chars allows), then it's likely not to match tagmanager's one. And here comes the problem when looking for a tagmanager symbol from a Scintilla word.
So either we require the filetype to keep the wordchars to fit tagmanager's ones (and more or less what the language sees as a symbol), and thus don't allow what the user wanted to do in the first place (make "$" to be part of a Scintilla word for navigation and selection facilities).
Although I havn't examined the usages of this function in detail, maybe it should be replaced by one that uses Scintilla.
Depends on whether we allow the user to tune Scintilla's wordchars as she wants or not, see above.
For non-programming languages, where its a different filetype then it can have a different set of wordchars to make an approximation to
Actually here it's more whitespace_chars that'd need to include more stuff, but yeah.
natural language words, but for comments in a program, its going to be stuck with the programming language definition.
Which won't be a problem 90% of the time, since Good Practice (tm) wants a programmer to write its comments in pure ASCII/English :)
So then we have to ensure that the filetype wordchars lists are right, and that tagmanager and lexers have no obvious bugs. Thats all :-)
...and we don't support what the user wanted to do?
Cheers, Colomban