[Geany-Devel] pull request on GitHub, to add GeanyHighlightSelectedWords, into Geany Plugins

Fri May 29 11:29:04 UTC 2015

KMP calculates a table of how many characters it can skip if some of
the search string matches but not all of it.  Its more useful for
relatively large search strings since it can skip potentially a lot of
compares.  Since this use is always going to search the whole file, to
mark all occurrances rather than stop at the first/next match, its
probably worthwhile for any reasonable size search string despite the
preparation needed.

But I would really like to see good benchmarks, since with modern CPUs
and caches I would also suspect many of the comparisons on the naive
algorithm will be hidden by the memory access time and its simple
order of access might make better use of prefetches.

And of course there are the SIMD instructions in modern processors
just waiting to be exploited by the adventurous :)

Cheers
Lex

On 29 May 2015 at 20:49, Thomas Martitz <kugel at rockbox.org> wrote:
> Am 29.05.2015 um 12:44 schrieb marius buzea:
>>
>> Hello,
>>
>>
>> With KMP it is possible to search all occurrences of a m length string,
>> into a n length string,
>> using O(m+n) machine operations.    Next page:
>>          http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm
>> describes the algorithm.
>>
>>
>>
>> The KMP works well with the utf-8 encoding of unicode.    One property of
>> utf8 is that
>> the encoding one unicode symbol is not a substring of another utf8
>> substring.      This
>> property allows to take the utf-8 encoding of the string you wish to
>> search, and to
>> find this utf8 encoding string, in the utf8 encoding of the text string.
>> Geany uses
>> scintilla, and scintilla uses utf8 to encode the document it displays, and
>> scintilla has
>> a command that gives the raw utf8 byte array for a [start, end) range.
>> So, KMP
>> gives great speed for searching all occurrences, and may be used with the
>> underlying
>> text representation of scintilla used by geany.     The utf-8 encoding of
>> a unicode
>> string of length n, is less than 6n, each utf8 encoding is at most 6
>> bytes.
>>
>>
>>
>> I also think that including this functionality/feature into Geany core
>> would be a good choice.
>> It would be a small tradeoff between keeping the core small, and adding
>> this new functionality,
>> but this is your choice.
>>
>>
>>
>> If you wish to extend automark, then this is good choice too.   If you
>> wish, and if it helps,
>> please reuse any part of the implementation provided here:
>>
>> http://sourceforge.net/p/geanyhighlightselectedword/code/HEAD/tree/trunk/GeanyHighlightSelectedWord/GeanyHighlightSelectedWord.c
>> If needed, I would help.
>>
>>
>> What should I do next?     Should I not do the pull request for
>> GeanyHighlightSelectedWord?
>> It is okay with me.    GeanyHighlightSelectedWord would then be still
>> available at sourceforge until
>> Geany provides this functionality from its core, or from automark.
>>
>
>
> I wonder if this algorithm should be applied to all searches, and thus be
> integrated into scintilla. Does it have any major drawbacks? I read it has
> to some kind "prefix table" prior to running the search, but I guess that's
> negligible for all reasonable search terms?
>
>
> Best regards
> _______________________________________________
> Devel mailing list
> Devel at lists.geany.org
> https://lists.geany.org/cgi-bin/mailman/listinfo/devel