Use of Scintilla word boundaries for word searches

List overview All Threads

newer

older

git on the wiki

Re: [Geany-devel] SF.net SVN:...

Colomban Wendling

19 Aug 2011 19 Aug '11

6:10 p.m.

Hi,

I'm trying to address bug 3386129 [1], and I'd like comments & reviews about my fix, because the whole thing don't look obvious at all...

We already have 2 ways of determining what a "word" is: a manual one using GEANY_WORDCHARS or a caller-given list of wordchars, and one that uses Scintilla's word boundaries.

The former seems to make more sense when the caller code knows the kind of characters it wants (e.g. tags lookups), but the latter is better when getting the word to search for. Currently the latter is only used to select the current word, and the second everywhere else we want to get the current word.

Actually if we do a whole word search we *need* to get what Scintilla sees as a "word" since we leave the search to it (see bug report for a example of what's going wrong otherwise).

So in the attached patch, I added a alternative way to get the the current word (that uses the same algorithm as the word selection) and tries to use it whenever the word was fetched for a search.

I'd appreciate reviews & comments :)

Cheers, Colomban

[1] https://sourceforge.net/tracker/?func=detail&atid=787791&aid=3386129...

Attachments:

0000-Use-Scintilla-s-word-boundaries-for-word-searches.patch (text/x-patch — 7.2 KB)

Show replies by date

Dimitar Zhekov

19 Aug 19 Aug

7:40 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On Fri, 19 Aug 2011 18:10:42 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...

Hi,

I'm trying to address bug 3386129 [1], and I'd like comments & reviews about my fix, because the whole thing don't look obvious at all...

We already have 2 ways of determining what a "word" is: a manual one using GEANY_WORDCHARS or a caller-given list of wordchars, and one that uses Scintilla's word boundaries.

3? Shoudn't we have symbolchars for the current programming language ([A-Za-z_] if unknown), and wordchars that match the current locale? They don't have much in common.

...

The former seems to make more sense when the caller code knows the kind of characters it wants (e.g. tags lookups), but the latter is better when getting the word to search for.

Looks like symbolchars and wordchars to me; when writing 3rd party software for a specific country, or using Geany as a text editor, it's common to search for locale words.

...

Actually if we do a whole word search we *need* to get what Scintilla sees as a "word" since we leave the search to it (see bug report for a example of what's going wrong otherwise).

There is always a SCI_SETWORDCHARS... Hmmm, we even use it to set the sci wordchars to the filetype wordchars if we don't know the exact lexer or something? Well, I guess it's really non-trivial.

...

So in the attached patch, I added a alternative way to get the the current word (that uses the same algorithm as the word selection) and tries to use it whenever the word was fetched for a search.

Makes sense to me. Though I'm not sure about that SCI_SETWORDCHARS we use in highlighting:styleset_common().

...

I'd appreciate reviews & comments :)

+ guint start; + guint end;

We always use gint for positions. Nothing else suspicious, at least from a first sight.

-- E-gards: Jimmy

Lex Trotman

20 Aug 20 Aug

1:33 a.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On 20 August 2011 03:40, Dimitar Zhekov dimitar.zhekov@gmail.com wrote:

...

On Fri, 19 Aug 2011 18:10:42 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...
Hi,

I'm trying to address bug 3386129 [1], and I'd like comments & reviews about my fix, because the whole thing don't look obvious at all...

We already have 2 ways of determining what a "word" is: a manual one using GEANY_WORDCHARS or a caller-given list of wordchars, and one that uses Scintilla's word boundaries.

3? Shoudn't we have symbolchars for the current programming language ([A-Za-z_] if unknown), and wordchars that match the current locale? They don't have much in common.

By wordchars we mean symbolchars, this confusion has existed from the beginnings of C at least, and we ain't gonna change it now. :-)

Locale/human language word ends are not as simple as sets of characters so lets not go there, we would need something like IIUC to do that.

...

...
The former seems to make more sense when the caller code knows the kind of characters it wants (e.g. tags lookups), but the latter is better when getting the word to search for.

Shouldn't the tags be using the same definition of word chars as Scintilla's highlighting? I don't trust "knowing" stuff in two places, they will never match :-) I understand that it might be a bit of work to hack tagmanager into line though.

[...]

...

There is always a SCI_SETWORDCHARS... Hmmm, we even use it to set the sci wordchars to the filetype wordchars if we don't know the exact lexer or something? Well, I guess it's really non-trivial.

We should be always setting Scintilla's wordchars from the filetype file, although IIUC a few lexers think they know better and ignore them.

...

...
So in the attached patch, I added a alternative way to get the the current word (that uses the same algorithm as the word selection) and tries to use it whenever the word was fetched for a search.

Makes sense to me. Though I'm not sure about that SCI_SETWORDCHARS we use in highlighting:styleset_common().

Required, to make highlighting match word definitions (assuming lexer cooperation).

[...] Nothing else suspicious, at least

...

from a first sight.

Maybe everything should use the filetype wordchars definition, with GEANY_WORDCHARS moved to filetypes.common as the default.

Cheers Lex

Dimitar Zhekov

3:11 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On Sat, 20 Aug 2011 09:33:29 +1000 Lex Trotman elextr@gmail.com wrote:

...

...
Shoudn't we have symbolchars for the current programming language ([A-Za-z_] if unknown), and wordchars that match the current locale? They don't have much in common.

By wordchars we mean symbolchars, this confusion has existed from the beginnings of C at least, and we ain't gonna change it now. :-)

It never did, at least not for the countries where the latin letters are not [the base of] alphabet.

...

Locale/human language word ends are not as simple as sets of characters so lets not go there, we would need something like IIUC to do that.

Scintilla and regex use charsets, so we can't jump beyond that. Sorry, Lex, I have to side with Colomban here: the (locale) word chars are good for word searching, (symbol) word chars for tags, find usage etc.

...

Maybe everything should use the filetype wordchars definition, with GEANY_WORDCHARS moved to filetypes.common as the default.

Now that would probably be the first editor which works with UTF-8, but can't do locale word search. :)

-- E-gards: Jimmy

Lex Trotman

3:58 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On 20 August 2011 23:11, Dimitar Zhekov dimitar.zhekov@gmail.com wrote:

...

On Sat, 20 Aug 2011 09:33:29 +1000 Lex Trotman elextr@gmail.com wrote:

...
...
Shoudn't we have symbolchars for the current programming language ([A-Za-z_] if unknown), and wordchars that match the current locale? They don't have much in common.

By wordchars we mean symbolchars, this confusion has existed from the beginnings of C at least, and we ain't gonna change it now. :-)

It never did, at least not for the countries where the latin letters are not [the base of] alphabet.

Ah well, C was invented in an English speaking country & we native English speakers are not smart enough to handle more than 26 letters :-D

...

...
Locale/human language word ends are not as simple as sets of characters so lets not go there, we would need something like IIUC to do that.

s/IIUC/ICU/ to correct for fumble fingers.

...

Scintilla and regex use charsets, so we can't jump beyond that. Sorry,

I agree, thats what I said above, but therefore it doesn't do words as in natural language words, so what exactly do you mean, I agree, but I'm confused?

...

Lex, I have to side with Colomban here: the (locale) word chars are good for word searching, (symbol) word chars for tags, find usage etc.

I don't think Geany does words, it only ever does symbols, see my question below.

...

...
Maybe everything should use the filetype wordchars definition, with GEANY_WORDCHARS moved to filetypes.common as the default.

Now that would probably be the first editor which works with UTF-8, but can't do locale word search. :)

Can you point me to where Geany loads locale dependent wordchars and where it gets it from?

Cheers Lex

Colomban Wendling

4:46 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 20/08/2011 15:58, Lex Trotman a écrit :

...

On 20 August 2011 23:11, Dimitar Zhekov dimitar.zhekov@gmail.com wrote:

...
[…] Scintilla and regex use charsets, so we can't jump beyond that. Sorry,

I agree, thats what I said above, but therefore it doesn't do words as in natural language words, so what exactly do you mean, I agree, but I'm confused?

...
Lex, I have to side with Colomban here: the (locale) word chars are good for word searching, (symbol) word chars for tags, find usage etc.

I don't think Geany does words, it only ever does symbols, see my question below.

Actually (as supposed in another mail), I guess that a "word" in Scintilla's opinion when doing a word search is "everything non-empty between two whitespace characters", so we DO both.

My patch tries to use this definition of a "word" for search, and keeps the other ("a sequence of wordchars") for symbol stuff.

Cheers, Colomban

Dimitar Zhekov

6:06 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On Sat, 20 Aug 2011 23:58:47 +1000 Lex Trotman elextr@gmail.com wrote:

...

Can you point me to where Geany loads locale dependent wordchars and where it gets it from?

Scintilla does it... oh, wait. Neither scintilla nor scite can find the word in ‘боза’, or even ‘boza’ (grep does, for 8-bit text). *Gee*.

On Sat, 20 Aug 2011 16:46:16 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...

Actually (as supposed in another mail), I guess that a "word" in Scintilla's opinion when doing a word search is "everything non-empty between two whitespace characters", so we DO both.

Like the famous "...consists of sequences of non-blank characters separated by blanks". :)

...

My patch tries to use this definition of a "word" for search, and keeps the other ("a sequence of wordchars") for symbol stuff.

Well, since using "Find previous/next selection" on a non-selected "g_new0(gchar" still finds the next/previous g_new0 or gchar (depending on the cursor position), and not the entire text, I guess it's all right... OTOH, finding "regcomp (" in "(regcomp (&" still doesn't succeed. Guess I can't have it both ways.

-- E-gards: Jimmy

Colomban Wendling

7:45 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 20/08/2011 18:06, Dimitar Zhekov a écrit :

...

On Sat, 20 Aug 2011 23:58:47 +1000 Lex Trotman elextr@gmail.com wrote:

...
Can you point me to where Geany loads locale dependent wordchars and where it gets it from?

Scintilla does it... oh, wait. Neither scintilla nor scite can find the word in ‘боза’, or even ‘boza’ (grep does, for 8-bit text). *Gee*.

Not sure I get it?

...

On Sat, 20 Aug 2011 16:46:16 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...
Actually (as supposed in another mail), I guess that a "word" in Scintilla's opinion when doing a word search is "everything non-empty between two whitespace characters", so we DO both.

Like the famous "...consists of sequences of non-blank characters separated by blanks". :)

Yep, that's it ^^

...

...
My patch tries to use this definition of a "word" for search, and keeps the other ("a sequence of wordchars") for symbol stuff.

Well, since using "Find previous/next selection" on a non-selected "g_new0(gchar" still finds the next/previous g_new0 or gchar (depending on the cursor position), and not the entire text, I guess it's all right...

Actually, AFAICT, it'll only find more things since when the "whole word" mode is activated we will be sure that the "word" we give will be recognized as such by Scintilla. An if you want you *can* make "foo(bar" count for one word, by removing "(" from the whitepsacechars.

I think I'll commit it, we could unify the thing later anyway if we find a clear and working unified method. No objections?

...

OTOH, finding "regcomp (" in "(regcomp (&" still doesn't succeed. Guess I can't have it both ways.

Hum, sorry? How finding "regcomp (" in a document containing "(regcomp (&" could fail?

Cheers, Colomban

Dimitar Zhekov

8:02 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On Sat, 20 Aug 2011 19:45:36 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...

...
Scintilla does it... oh, wait. Neither scintilla nor scite can find the word in ‘боза’, or even ‘boza’ (grep does, for 8-bit text).

Not sure I get it?

For Scintilla/Geany, "boza" or "боза", enclosed in non-ascii quotes, is not a word any more.

...

I think I'll commit it, we could unify the thing later anyway if we find a clear and working unified method. No objections?

Not from me.

...

...
Well, since using "Find previous/next selection" on a non-selected "g_new0(gchar" still finds the next/previous g_new0 or gchar [...]

OTOH, finding "regcomp(" in "(regcomp(&" still doesn't succeed. Guess I can't have it both ways.

Hum, sorry? How finding "regcomp(" in a document containing "(regcomp (&" could fail?

Word-finding "regcomp(" will not find it in "(regcomp(&", despite being enclosed by punctuation characters.

-- E-gards: Jimmy

Colomban Wendling

8:19 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 20/08/2011 20:02, Dimitar Zhekov a écrit :

...

On Sat, 20 Aug 2011 19:45:36 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...
...
Scintilla does it... oh, wait. Neither scintilla nor scite can find the word in ‘боза’, or even ‘boza’ (grep does, for 8-bit text).

Not sure I get it?

For Scintilla/Geany, "boza" or "боза", enclosed in non-ascii quotes, is not a word any more.

Ah OK, got it. Yeah, it doesn't detect the quote as "blank chars", so doesn't fit in "...consists of sequences of non-blank characters separated by blanks".

We could maybe add the non-ASCII quotes in default blank chars (assuming using non-ASCII chars here works) -- though there may be way too much "blank chars" to treat.

Maybe Scintilla could use a more complex algorithm to find the start/end of an Unicode word (Pango does it, but it seems to be a hard piece), and it'd only work on Unicode data. And it would need a clever way of integrating the wordchars/blankchars, maybe simply ((unicode_is_word() || is_wordchar()) && ! is_blankchars())...

...

...
...
Well, since using "Find previous/next selection" on a non-selected "g_new0(gchar" still finds the next/previous g_new0 or gchar [...]

OTOH, finding "regcomp(" in "(regcomp(&" still doesn't succeed. Guess I can't have it both ways.

Hum, sorry? How finding "regcomp(" in a document containing "(regcomp (&" could fail?

Word-finding "regcomp(" will not find it in "(regcomp(&", despite being enclosed by punctuation characters.

Ah, yeah word search. Again, doesn't fit in "...consists of a sequence of non-blank characters separated by blanks" :(

Dimitar Zhekov

8:56 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On Sat, 20 Aug 2011 20:19:56 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...

...
For Scintilla/Geany, "boza" or "боза", enclosed in non-ascii quotes, is not a word any more.

Ah OK, got it. Yeah, it doesn't detect the quote as "blank chars", so doesn't fit in "...consists of sequences of non-blank characters separated by blanks".

If you enclose boza in ascii quotes, it doesn't fit the definition either, but is recognized as a word. There is no reason to guess, Scintilla is open source, and here is the exact definition:

/** * Check that the character at the given position is a word or * punctuation character and that the previous character is of * a different character class. */ bool Document::IsWordStartAt(int pos) {

And the same goes for word end and whole word. Weird.

-- E-gards: Jimmy

Colomban Wendling

10:30 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 20/08/2011 20:56, Dimitar Zhekov a écrit :

...

On Sat, 20 Aug 2011 20:19:56 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...
...
For Scintilla/Geany, "boza" or "боза", enclosed in non-ascii quotes, is not a word any more.

Ah OK, got it. Yeah, it doesn't detect the quote as "blank chars", so doesn't fit in "...consists of sequences of non-blank characters separated by blanks".

If you enclose boza in ascii quotes, it doesn't fit the definition either, but is recognized as a word.

Yes it does, since Geany defines whitespace chars to include the quote (see filetypes.common):

whitespace_chars=\s\t!"#$%&'()*+,-./:;<=>?@[\]^`{|}~

...

There is no reason to guess, Scintilla is open source, and here is the exact definition:

True enough, I should have though of digg into that weird C++ thing :-'

...

/**

Check that the character at the given position is a word or

punctuation character and that the previous character is of

a different character class.

*/ bool Document::IsWordStartAt(int pos) {

And the same goes for word end and whole word. Weird.

Lex Trotman

21 Aug 21 Aug

5:07 a.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Hi Guys,

So to summarise the thread:

1. Natural language word breaks are hard, we don't want to go there.

Refer to http://www.unicode.org/reports/tr29/ from which I quote "programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user."

As Columban said and IIUC, the algorithm in this standard is used in Pango and ICU.

It requires Unicode data tables which Geany and Scintilla do not have.

2. As Dimitar said Scintilla has its own definition of word start and end, leading to a word being a contiguous sequence of word chars or a contiguous sequence of punctuation chars. Where a word char in UTF-8 mode is >=0x80 or ( (isalnum or _) or the setting from Geany ). It is going to use this definition when searching.

3. Scintilla lexers use the programming language definition to highlight, AFAICT most don't use wordchars above.

4. Tagmanager parsers also use the programming language definition, which should match the lexer in an ideal world since most programming languages are precisely defined.

As Colomban said it is too much work to make tagmanager and lexer use the same data, but IMHO if they disagree then its a bug, since programming languages clearly define what is what Dimitar called symbols.

For a programming language find usage should find the symbol, and if Geany uses Scintilla words to decide what characters are part of the symbol that means that the Scintilla wordchars definition must match the programming language symbol definition. So Geany may have to set it based on the filetype.

I don't see much point in Geany having its own definition, thats a fourth one to get out of sync and in fact the original bug is because editor.c/read_current_word() has its own definition that doesn't match the filetype. Although I havn't examined the usages of this function in detail, maybe it should be replaced by one that uses Scintilla.

For non-programming languages, where its a different filetype then it can have a different set of wordchars to make an approximation to natural language words, but for comments in a program, its going to be stuck with the programming language definition.

So then we have to ensure that the filetype wordchars lists are right, and that tagmanager and lexers have no obvious bugs. Thats all :-)

Cheers Lex

Colomban Wendling

6:23 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 21/08/2011 05:07, Lex Trotman a écrit :

...

Hi Guys,

So to summarise the thread:

Natural language word breaks are hard, we don't want to go there.

Refer to http://www.unicode.org/reports/tr29/ from which I quote "programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user."

As Columban said and IIUC, the algorithm in this standard is used in Pango and ICU.

Yeah, it's hard, and not necessarily useful to fit Geany's goals -- though it wouldn't be a problem to have it, just to implement it.

...

It requires Unicode data tables which Geany and Scintilla do not have.

That's not (completely?) true however. GLib has at least a part of the Unicode tables, and both Scintilla and Geany depends on Pango through GTK anyway. However, I doubt Geany can do something in its side, and I doubt Scintilla wants to have a hard dependency on Pango for non-GTK platforms, nor to behave differently depending on the platform.

So basically yes, 1 is unlikely to be really fixed.

...

As Dimitar said Scintilla has its own definition of word start and

end, leading to a word being a contiguous sequence of word chars or a contiguous sequence of punctuation chars. Where a word char in UTF-8 mode is >=0x80 or ( (isalnum or _) or the setting from Geany ). It is going to use this definition when searching.

Scintilla lexers use the programming language definition to

highlight, AFAICT most don't use wordchars above.

Tagmanager parsers also use the programming language definition,

which should match the lexer in an ideal world since most programming languages are precisely defined.

As Colomban said it is too much work to make tagmanager and lexer use the same data, but IMHO if they disagree then its a bug, since programming languages clearly define what is what Dimitar called symbols.

Yes, if tagmanager and SCI lexer disagree on what is a "symbol", there is probably a bug. However as I already noted, I think SCI lexer don't need to be as precise as tagmanager's one since they probably don't care about words but for highlighting keywords.

...

For a programming language find usage should find the symbol, and if Geany uses Scintilla words to decide what characters are part of the symbol that means that the Scintilla wordchars definition must match the programming language symbol definition. So Geany may have to set it based on the filetype.

I don't see much point in Geany having its own definition, thats a fourth one to get out of sync and in fact the original bug is because editor.c/read_current_word() has its own definition that doesn't match the filetype.

If you allow the user to change the wordchars of Sinctilla (what the settings wordchars and whitespace_chars allows), then it's likely not to match tagmanager's one. And here comes the problem when looking for a tagmanager symbol from a Scintilla word.

So either we require the filetype to keep the wordchars to fit tagmanager's ones (and more or less what the language sees as a symbol), and thus don't allow what the user wanted to do in the first place (make "$" to be part of a Scintilla word for navigation and selection facilities).

...

Although I havn't examined the usages of this function in detail, maybe it should be replaced by one that uses Scintilla.

Depends on whether we allow the user to tune Scintilla's wordchars as she wants or not, see above.

...

For non-programming languages, where its a different filetype then it can have a different set of wordchars to make an approximation to

Actually here it's more whitespace_chars that'd need to include more stuff, but yeah.

...

natural language words, but for comments in a program, its going to be stuck with the programming language definition.

Which won't be a problem 90% of the time, since Good Practice (tm) wants a programmer to write its comments in pure ASCII/English :)

...

So then we have to ensure that the filetype wordchars lists are right, and that tagmanager and lexers have no obvious bugs. Thats all :-)

...and we don't support what the user wanted to do?

Cheers, Colomban

Lex Trotman

22 Aug 22 Aug

3:14 a.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

[...]

...

That's not (completely?) true however. GLib has at least a part of the Unicode tables, and both Scintilla and Geany depends on Pango through GTK anyway. However, I doubt Geany can do something in its side, and I doubt Scintilla wants to have a hard dependency on Pango for non-GTK platforms, nor to behave differently depending on the platform.

True.

...

So basically yes, 1 is unlikely to be really fixed.

[...]

...

...
As Colomban said it is too much work to make tagmanager and lexer use the same data, but IMHO if they disagree then its a bug, since programming languages clearly define what is what Dimitar called symbols.

Yes, if tagmanager and SCI lexer disagree on what is a "symbol", there is probably a bug. However as I already noted, I think SCI lexer don't need to be as precise as tagmanager's one since they probably don't care about words but for highlighting keywords.

The lexers care about words if they use them as the definition of symbols. At least the C/C++ lexer needs to know symbols to highlight typenames.

...

[...]

...

If you allow the user to change the wordchars of Sinctilla (what the settings wordchars and whitespace_chars allows), then it's likely not to match tagmanager's one. And here comes the problem when looking for a tagmanager symbol from a Scintilla word.

So either we require the filetype to keep the wordchars to fit tagmanager's ones (and more or less what the language sees as a symbol), and thus don't allow what the user wanted to do in the first place (make "$" to be part of a Scintilla word for navigation and selection facilities).

I don't know PHP, but it seems like the $ is part of the symbol, so it should be part of the word. Users can always break things by editing config files so I wouldn't worry about them changing wordchars, so long as the comment in the config mentions it affects Scintilla only.

Lots of confusion seems to be caused by the fact that wordchars are being used as symbolchars...

[...]

...

...
So then we have to ensure that the filetype wordchars lists are right, and that tagmanager and lexers have no obvious bugs. Thats all :-)

...and we don't support what the user wanted to do?

IIUC the user wanted it to be consistent, ie $ is part of the symbol.

If $ is part of the PHP filetype's wordchars it would be consistent if we used Scintilla word to select the usage search pattern as suggested?

Cheers Lex

PS I just realised that some languages eg Lua allow any Unicode letters in symbols, the Scintilla definition can't be right for them since it treats all characters >= 0x80 as wordchars and a quick look at the Lua lexer seems to show it also treats anything >0x80 as a symbolchar. But lua tagmanager uses any non-blank sequence before '=' or '('. No match!! Essentially can't win for some languages so just make the minimal change that doesn't break things.

Colomban Wendling

2:34 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 22/08/2011 03:14, Lex Trotman a écrit :

...

[...]

...
...
As Colomban said it is too much work to make tagmanager and lexer use the same data, but IMHO if they disagree then its a bug, since programming languages clearly define what is what Dimitar called symbols.

Yes, if tagmanager and SCI lexer disagree on what is a "symbol", there is probably a bug. However as I already noted, I think SCI lexer don't need to be as precise as tagmanager's one since they probably don't care about words but for highlighting keywords.

The lexers care about words if they use them as the definition of symbols. At least the C/C++ lexer needs to know symbols to highlight typenames.

True.

...

[...]

...
If you allow the user to change the wordchars of Sinctilla (what the settings wordchars and whitespace_chars allows), then it's likely not to match tagmanager's one. And here comes the problem when looking for a tagmanager symbol from a Scintilla word.

So either we require the filetype to keep the wordchars to fit tagmanager's ones (and more or less what the language sees as a symbol), and thus don't allow what the user wanted to do in the first place (make "$" to be part of a Scintilla word for navigation and selection facilities).

I don't know PHP, but it seems like the $ is part of the symbol, so it should be part of the word. Users can always break things by editing config files so I wouldn't worry about them changing wordchars, so long as the comment in the config mentions it affects Scintilla only.

[...]

...
...
So then we have to ensure that the filetype wordchars lists are right, and that tagmanager and lexers have no obvious bugs. Thats all :-)

...and we don't support what the user wanted to do?

IIUC the user wanted it to be consistent, ie $ is part of the symbol.

If $ is part of the PHP filetype's wordchars it would be consistent if we used Scintilla word to select the usage search pattern as suggested?

But then the tagmaneger and Scintilla's definition of a PHP symbol would differ, since tagmanager don't include $ in the symbols. And then if we use the Scintilla definition of a word to do a tag search, it would fail if it included a $ :(

So yeah, for searching it is better (what my patch [1] tried to implement), but it wouldn't be for tag lookup.

...

PS I just realised that some languages eg Lua allow any Unicode letters in symbols, the Scintilla definition can't be right for them since it treats all characters >= 0x80 as wordchars and a quick look at the Lua lexer seems to show it also treats anything >0x80 as a symbolchar. But lua tagmanager uses any non-blank sequence before '=' or '('. No match!! Essentially can't win for some languages so just make the minimal change that doesn't break things.

Well... either fix one or the other, or assume it's not important. I don't know Lua but I'd be tempted to guess both definitions will actually be OK in a well-formed file (e.g. that the difference between both only includes invalid syntax). But of course I can be completely wrong, since I don't know Lua.

Cheers, Colomban

[1] applied as r5895

Colomban Wendling

21 Aug 21 Aug

7:15 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 20/08/2011 20:02, Dimitar Zhekov a écrit :

...

On Sat, 20 Aug 2011 19:45:36 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote: [...]

...
I think I'll commit it, we could unify the thing later anyway if we find a clear and working unified method. No objections?

Not from me.

Committed as r5895.

Colomban Wendling

20 Aug 20 Aug

4:41 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 20/08/2011 01:33, Lex Trotman a écrit :

...

On 20 August 2011 03:40, Dimitar Zhekov dimitar.zhekov@gmail.com wrote:

...
On Fri, 19 Aug 2011 18:10:42 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...
Hi,

I'm trying to address bug 3386129 [1], and I'd like comments & reviews about my fix, because the whole thing don't look obvious at all...

We already have 2 ways of determining what a "word" is: a manual one using GEANY_WORDCHARS or a caller-given list of wordchars, and one that uses Scintilla's word boundaries.

3? Shoudn't we have symbolchars for the current programming language ([A-Za-z_] if unknown), and wordchars that match the current locale? They don't have much in common.

By wordchars we mean symbolchars, this confusion has existed from the beginnings of C at least, and we ain't gonna change it now. :-)

Locale/human language word ends are not as simple as sets of characters so lets not go there, we would need something like IIUC to do that.

Yeah, I didn't meant anything about locale characters. However Scintilla seems to "magically" handle them correctly... maybe because of whitespacechars.

...

...
...
The former seems to make more sense when the caller code knows the kind of characters it wants (e.g. tags lookups), but the latter is better when getting the word to search for.

Shouldn't the tags be using the same definition of word chars as Scintilla's highlighting? I don't trust "knowing" stuff in two places, they will never match :-) I understand that it might be a bit of work to hack tagmanager into line though.

I'm afraid it'd be more than "a bit of work": most lexers won't work correctly if the symbolchars are not the expected ones. See the example of the bug I refer in my first mail: the reporter wanted to add "$" in the PHP wordchars for it to be selected as part of the variable. choosing whether to have the "$" in the symbolchars is NOT possible in a parser's POV. So either it's not configurable or it's not used for parsing.

Scintilla lexers are a bit less concerned about this than tagmanager since they generally don't mind much about the file content and only takes care of a few control sequences (blocks, comments, etc.). Plus of course the keywords of course, those might use wordchars.

Actually the rational why we need to use Scintilla's representation of a "word" when we do a search is that if we do a "full word" search and the peeked "word" doesn't match the one Scintilla would have peek, the search won't match the expected results.

...

[...]

...
There is always a SCI_SETWORDCHARS... Hmmm, we even use it to set the sci wordchars to the filetype wordchars if we don't know the exact lexer or something? Well, I guess it's really non-trivial.

We should be always setting Scintilla's wordchars from the filetype file, although IIUC a few lexers think they know better and ignore them.

Not completely sure what Scintilla uses wordchars for, but in combination with "whitespace characters" it seems to uses it for keyboard navigation [1].

...

...
...
So in the attached patch, I added a alternative way to get the the current word (that uses the same algorithm as the word selection) and tries to use it whenever the word was fetched for a search.

Makes sense to me. Though I'm not sure about that SCI_SETWORDCHARS we use in highlighting:styleset_common().

Required, to make highlighting match word definitions (assuming lexer cooperation).

[...] Nothing else suspicious, at least

...
from a first sight.

+1

Maybe everything should use the filetype wordchars definition, with GEANY_WORDCHARS moved to filetypes.common as the default.

Well, although it looks sensible at first sight (and probably would be in most situations), we have a few places where we tune wordchars for special cases, like:

callbacks.c:988: search for a color representation, adds # to wordchars callbacks.c:1618: search for a filename

However maybe these should use something specific rather than the contrary.

Cheers, Colomban

[1] http://www.scintilla.org/ScintillaDoc.html#SCI_SETWORDCHARS

Colomban Wendling

4:41 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 19/08/2011 19:40, Dimitar Zhekov a écrit :

...

On Fri, 19 Aug 2011 18:10:42 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...
Hi,

I'm trying to address bug 3386129 [1], and I'd like comments & reviews about my fix, because the whole thing don't look obvious at all...

We already have 2 ways of determining what a "word" is: a manual one using GEANY_WORDCHARS or a caller-given list of wordchars, and one that uses Scintilla's word boundaries.

3?

No, my sentence was misleading, sorry.

1) a manual one using GEANY_WORDCHARS or a caller-given list of wordchar 2) one that uses Scintilla's word boundaries

...

[...]

...
I'd appreciate reviews & comments :)

guint start;

guint end;

We always use gint for positions. Nothing else suspicious, at least from a first sight.

True. Though actually it only prevents from having a "signed vs. unsigned" warning, but I'll fix this.

Cheers, Colomban

Thomas Martitz

22 Aug 22 Aug

12:14 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Am 19.08.2011 18:10, schrieb Colomban Wendling:

...

Hi,

Just curious. How does search for whole words in find in files adds to this mix? IIUC it uses grep's -w option?

Best regards.

Lex Trotman

12:59 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On 22 August 2011 20:14, Thomas Martitz thomas.martitz@student.htw-berlin.de wrote:

...

Am 19.08.2011 18:10, schrieb Colomban Wendling:

...
Hi,

Just curious. How does search for whole words in find in files adds to this mix? IIUC it uses grep's -w option?

Only in that it uses the same technique to select the default search string, but you get the chance to edit it for FIF, you don't get the chance for find usage, so find usage needs to be more correct or its not useful.

Cheers Lex

...

Best regards. _______________________________________________ Geany-devel mailing list Geany-devel@uvena.de https://lists.uvena.de/cgi-bin/mailman/listinfo/geany-devel

Thomas Martitz

1:15 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Am 22.08.2011 12:59, schrieb Lex Trotman:

...

On 22 August 2011 20:14, Thomas Martitz thomas.martitz@student.htw-berlin.de wrote:

...
Just curious. How does search for whole words in find in files adds to this mix? IIUC it uses grep's -w option?

Only in that it uses the same technique to select the default search string, but you get the chance to edit it for FIF, you don't get the chance for find usage, so find usage needs to be more correct or its not useful.

Uhm, I mean for FIF grep decides about the word boundaries, which may be different to GEANY_WORDCHARS and everything discussed here, no?

Best regards.

Colomban Wendling

2:43 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 22/08/2011 13:15, Thomas Martitz a écrit :

...

Am 22.08.2011 12:59, schrieb Lex Trotman:

...
On 22 August 2011 20:14, Thomas Martitz thomas.martitz@student.htw-berlin.de wrote:

...
Just curious. How does search for whole words in find in files adds to this mix? IIUC it uses grep's -w option?

Only in that it uses the same technique to select the default search string, but you get the chance to edit it for FIF, you don't get the chance for find usage, so find usage needs to be more correct or its not useful.

Uhm, I mean for FIF grep decides about the word boundaries, which may be different to GEANY_WORDCHARS and everything discussed here, no?

Yeah, once a new definition. Though this one is, according to the manual:

...

Word-constituent characters are letters, digits, and the underscore.

So it's the same as GEANY_WORDCHARS, but not Scintilla's definition. And it doesn't include any non-ASCII characters in the algorithm, making e.g. word search "hé" match "héhé" (second byte of the first "é" being treat as a separator).

I'm afraid making everything use one single definition will be near impossible :( But as Lex highlighted, in FIF it's less an issue since the user has a chance to edit the grabbed word.

Cheers, Colomban

Dimitar Zhekov

7:16 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On Mon, 22 Aug 2011 14:43:35 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...

...
Uhm, I mean for FIF grep decides about the word boundaries, which may be different to GEANY_WORDCHARS and everything discussed here, no?

Yeah, once a new definition. Though this one is, according to the manual:

...
Word-constituent characters are letters, digits, and the underscore.

And it doesn't include any non-ASCII characters in the algorithm, making e.g. word search "hé" match "héhé" (second byte of the first "é" being treat as a separator).

grep uses plain char and doesn't support UTF-8. But if your "héhé" fits in an 8-bit code page, and you have the proper LC_CTYPE set, it works. I checked this with cp1251 "боза" earlier when we discussed word finding (but was 99% sure it'll work).

echo '@@боза@@' | grep -w '@боза@' works too, but echo 'а@боза@@' does not, and neither does '9@...' or '_@...' . So it checks the characters before and after the match for not being isalnum() or underscore.

-- E-gards: Jimmy

Colomban Wendling

7:22 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

Le 22/08/2011 19:16, Dimitar Zhekov a écrit :

...

On Mon, 22 Aug 2011 14:43:35 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...
...
Uhm, I mean for FIF grep decides about the word boundaries, which may be different to GEANY_WORDCHARS and everything discussed here, no?

Yeah, once a new definition. Though this one is, according to the manual:

...
Word-constituent characters are letters, digits, and the underscore.

And it doesn't include any non-ASCII characters in the algorithm, making e.g. word search "hé" match "héhé" (second byte of the first "é" being treat as a separator).

grep uses plain char and doesn't support UTF-8. But if your "héhé" fits in an 8-bit code page, and you have the proper LC_CTYPE set, it works. I checked this with cp1251 "боза" earlier when we discussed word finding (but was 99% sure it'll work).

echo '@@боза@@' | grep -w '@боза@' works too, but echo 'а@боза@@' does not, and neither does '9@...' or '_@...' . So it checks the characters before and after the match for not being isalnum() or underscore.

...but б@боза@@ will match (at least using UTF-8), because б is a UTF-8 two-bytes characters, thus nether bytes matches (isalnum() || '_') (they are > 0x80, actually 0xb1 0xd0).

Dimitar Zhekov

8:55 p.m.

New subject: [Geany-devel] Use of Scintilla word boundaries for word searches

On Mon, 22 Aug 2011 19:22:09 +0200 Colomban Wendling lists.ban@herbesfolles.org wrote:

...

...
grep uses plain char and doesn't support UTF-8. [...]

...but б@боза@@ will match (at least using UTF-8), [...]

Neither the character classes nor the upper/lower cases will work properly if you feed an 8-bit char program with UTF-8. grep -i -w produce lots of false positives and/or negatives, and so will the [] sets. Except for the 8859-1 characters, of course.

-- E-gards: Jimmy

4948

Age (days ago)

4951

Last active (days ago)

devel@lists.geany.org

25 comments

4 participants

tags (0)

participants (4)

Colomban Wendling
Dimitar Zhekov
Lex Trotman
Thomas Martitz