Hi Guys,
So to summarise the thread:
1. Natural language word breaks are hard, we don't want to go there.
Refer to http://www.unicode.org/reports/tr29/ from which I quote "programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user."
As Columban said and IIUC, the algorithm in this standard is used in Pango and ICU.
It requires Unicode data tables which Geany and Scintilla do not have.
2. As Dimitar said Scintilla has its own definition of word start and end, leading to a word being a contiguous sequence of word chars or a contiguous sequence of punctuation chars. Where a word char in UTF-8 mode is >=0x80 or ( (isalnum or _) or the setting from Geany ). It is going to use this definition when searching.
3. Scintilla lexers use the programming language definition to highlight, AFAICT most don't use wordchars above.
4. Tagmanager parsers also use the programming language definition, which should match the lexer in an ideal world since most programming languages are precisely defined.
As Colomban said it is too much work to make tagmanager and lexer use the same data, but IMHO if they disagree then its a bug, since programming languages clearly define what is what Dimitar called symbols.
For a programming language find usage should find the symbol, and if Geany uses Scintilla words to decide what characters are part of the symbol that means that the Scintilla wordchars definition must match the programming language symbol definition. So Geany may have to set it based on the filetype.
I don't see much point in Geany having its own definition, thats a fourth one to get out of sync and in fact the original bug is because editor.c/read_current_word() has its own definition that doesn't match the filetype. Although I havn't examined the usages of this function in detail, maybe it should be replaced by one that uses Scintilla.
For non-programming languages, where its a different filetype then it can have a different set of wordchars to make an approximation to natural language words, but for comments in a program, its going to be stuck with the programming language definition.
So then we have to ensure that the filetype wordchars lists are right, and that tagmanager and lexers have no obvious bugs. Thats all :-)
Cheers Lex