搜索流的字符串有效的方法字符串、有效、方法

2023-09-10 23:48:53 作者:她很笨没有你根本不行~!

让我们假设有一个文本流(或阅读器在Java中),我想检查是否有特定的字符串。文本的流可能非常大,因此只要搜索字符串发现我想返回true,也尽量避免存储整个输入在内存中。

Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.

天真,我可能会尝试做这样的事情(在Java中):

Naively, I might try to do something like this (in Java):

public boolean streamContainsString(Reader reader, String searchString) throws IOException {
    char[] buffer = new char[1024];
    int numCharsRead;
    while((numCharsRead = reader.read(buffer)) > 0) {
        if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
            return true;
    }
    return false;
}

当然,这将失败,如果它发生在每千缓冲器的边界上,以检测在给定的搜索字符串

Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:

搜寻文字:计算器 流缓冲1:ABC .........栈 流缓冲区2:溢出....... XYZ

Search text: "stackoverflow" Stream buffer 1: "abc.........stack" Stream buffer 2: "overflow.......xyz"

我怎么能修改此code,以便它正确地发现通过缓冲区的边界,但没有加载整个流到内存中给定的搜索字符串?

How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?

编辑:注意搜索的字符串流时,我们试图为减少的数量从流中读取(以避免在网络/磁盘延迟),并为保持内存占用不变不论数据流中的金额。在字符串匹配算法是次要的,但很明显的实效,这将是很好的发现,使用的溶液一个更有效的那些算法

Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.

推荐答案

我做了一些改动的克努特莫里斯普拉特算法局部搜索。由于实际的比较位置始终小于或等于下一个也没有必要额外存储器。在code有一个Makefile,也可在 github上它是写在HAXE目标多种编程语言的一次,包括Java。

I did a few changes to the Knuth Morris Pratt algorithm for partial searches. Since the actual comparison position is always less or equal than the next one there is no need for extra memory. The code with a Makefile is also available on github and it is written in Haxe to target multiple programming languages at once, including Java.

我还写了一篇相关的文章:搜索在流子:有轻微的克努特 - 莫里斯 - 普拉特算法HAXE 的修改。文中提到的雅加达的RegExp ,现在退休了,搁在Apache阁楼。雅加达正则表达式库匹配的在对RE类方法使用的CharacterIterator作为参数。

I also wrote a related article: searching for substrings in streams: a slight modification of the Knuth-Morris-Pratt algorithm in Haxe. The article mentions the Jakarta RegExp, now retired and resting in the Apache Attic. The Jakarta Regexp library "match" method in the RE class uses a CharacterIterator as a parameter.

class StreamOrientedKnuthMorrisPratt {
    var m: Int;
    var i: Int;
    var ss:
    var table: Array<Int>;

    public function new(ss: String) {
        this.ss = ss;
        this.buildTable(this.ss);
    }

    public function begin() : Void {
        this.m = 0;
        this.i = 0;
    }

    public function partialSearch(s: String) : Int {
        var offset = this.m + this.i;

        while(this.m + this.i - offset < s.length) {
            if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) {
                if(this.i == this.ss.length - 1) {
                    return this.m;
                }
                this.i += 1;
            } else {
                this.m += this.i - this.table[this.i];
                if(this.table[this.i] > -1)
                    this.i = this.table[this.i];
                else
                    this.i = 0;
            }
        }

        return -1;
    }

    private function buildTable(ss: String) : Void {
        var pos = 2;
        var cnd = 0;

        this.table = new Array<Int>();
        if(ss.length > 2)
            this.table.insert(ss.length, 0);
        else
            this.table.insert(2, 0);

        this.table[0] = -1;
        this.table[1] = 0;

        while(pos < ss.length) {
            if(ss.substr(pos-1,1) == ss.substr(cnd, 1))
            {
                cnd += 1;
                this.table[pos] = cnd;
                pos += 1;
            } else if(cnd > 0) {
                cnd = this.table[cnd];
            } else {
                this.table[pos] = 0;
                pos += 1;
            }
        }
    }

    public static function main() {
        var KMP = new StreamOrientedKnuthMorrisPratt("aa");
        KMP.begin();
        trace(KMP.partialSearch("ccaabb"));

        KMP.begin();
        trace(KMP.partialSearch("ccarbb"));
        trace(KMP.partialSearch("fgaabb"));

    }
}