查找两个字符串共享的所有正字长子的最大长度正字、长子、字符串、长度

2023-09-11 00:08:10 作者:violate(违背)

我正在制作一个Python脚本,可以找到两个字符串共享的所有正字长子的(最长的可能)长,无视尾随的标点。给定两个字符串:

I am working to produce a Python script that can find the (longest possible) length of all n-word-length substrings shared by two strings, disregarding trailing punctuation. Given two strings:

这是一个简单的字符串

这又是一个样本串

我希望脚本来识别这些字符串具有2个字中常见的序列(这是),随后的3个字中常见的序列(样本串)。这是我目前的做法:

I want the script to identify that these strings have a sequence of 2 words in common ("this is") followed by a sequence of 3 words in common ("a sample string"). Here is my current approach:

a = "this is a sample string"
b = "this is also a sample string"

aWords = a.split()
bWords = b.split()

#create counters to keep track of position in string
currentA = 0
currentB = 0

#create counter to keep track of longest sequence of matching words
matchStreak = 0

#create a list that contains all of the matchstreaks found
matchStreakList = []

#create binary switch to control the use of while loop
continueWhileLoop = 1

for word in aWords:
    currentA += 1

    if word == bWords[currentB]:
        matchStreak += 1

        #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
        if currentB + 1 < len(bWords):
            currentB += 1

        #in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
        if currentA == len(aWords):
            matchStreakList.append(matchStreak)

    elif word != bWords[currentB]:

        #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
        if matchStreak >= 1:
            matchStreakList.append(matchStreak)
        matchStreak = 0

        while word != bWords[currentB]:

            #the two words don't match. If you can move b forward one word, do so, then check for another match
            if currentB + 1 < len(bWords):
                currentB += 1

            #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
            elif currentB + 1 == len(bWords):
                currentB = 0
                break

        if word == bWords[currentB]:
            matchStreak += 1

            #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
            if currentB + 1 < len(bWords):
                currentB += 1
            elif currentB + 1 == len(bWords):

                #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
                if currentA == len(aWords):
                    matchStreakList.append(matchStreak)
                currentB = 0
                break

print matchStreakList

该脚本正确输出常见的字长子(2,3),并且已经这样做了所有测试迄今的(最大)长度。我的问题是:有没有对两个字符串的其中上述方法将无法正常工作?更重要的一点:是否可用于寻找所有的n个字的长度的子串的最大长度即两个字符串共享有现存Python库或公知的方法?

This script correctly outputs the (maximum) lengths of the common word-length substrings (2, 3), and has done so for all tests so far. My question is: Is there a pair of two strings for which the approach above will not work? More to the point: Are there extant Python libraries or well-known approaches that can be used to find the maximum length of all n-word-length substrings that two strings share?

[这个问题是从最长的公共子串的问题,这就是我要找的只是一个特例(因为我想找到所有常见的子串,而不仅仅是最长公共子)截然不同。 This SO帖子暗示的方法,如1)聚类分析,2)编辑距离程序,以及3)最长公共序列算法可能是合适的方法,但是我没有找到任何可行的解决方案,我的问题是,也许稍微更轻松的在,因为我处理由空格为界的话链接提及。]

[This question is distinct from the longest common substring problem, which is only a special case of what I'm looking for (as I want to find all common substrings, not just the longest common substring). This SO post suggests that methods such as 1) cluster analysis, 2) edit distance routines, and 3) longest common sequence algorithms might be suitable approaches, but I didn't find any working solutions, and my problem is perhaps slightly easier that that mentioned in the link because I'm dealing with words bounded by whitespace.]

编辑:

我开始在这个问题上的赏金。在情况下,它会帮助别人,我想澄清的几个简单的点。首先,乐于助人的回答以下建议的@DhruvPathak没有找到两个字符串共享的所有最大长正字长子。例如,假设两个字符串,我们分析如下:

I'm starting a bounty on this question. In case it will help others, I wanted to clarify a few quick points. First, the helpful answer suggested below by @DhruvPathak does not find all maximally-long n-word-length substrings shared by two strings. For example, suppose the two strings we are analyzing are:

他们都是白色的一尘不染一张纸,当他们第一次出生   但他们是在通过每一个鹅毛管

"They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"

你是全白,可爱的,一尘不染的一张纸,当你第一次   是天生的;但你要潦草,并抹杀由每鹅   鹅毛笔

"You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

在这种情况下,最大长n字长子(不考虑尾随标点)的列表是:

In this case, the list of maximally long n-word-length substrings (disregarding trailing punctuation) is:

all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every

使用下列程序:

Using the following routine:

#import required packages
import difflib

#define function we'll use to identify matches
def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()

print matches(a,b)

人们得到的输出:

One gets output:

['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']

首先,我不知道如何可以从该列表中选择一个只包含整个单词的子串。其次,这份名单并不包括有,所需的最大长共通的n词长度子之一。是否有可以找到所有通过这两个字符串(你们都是......和他们都是......)?

In the first place, I am not sure how one could select from this list the substrings that contain only whole words. In the second place, this list does not include "are", one of the desired maximally-long common n-word-length substrings. Is there a method that will find all of the maximally long n-word-long substrings shared by these two strings ("You are all..." and "They all are...")?

推荐答案

还有歧义在这里,我不想花时间争论他们。但我想我可以添加一些有帮助的呢; - )

There are still ambiguities here, and I don't want to spend time arguing about them. But I think I can add something helpful anyway ;-)

我写的Python的 difflib.SequenceMatcher ,并花了的很多的时间找到预期的情况下快速的方法来寻找最长公共子串。从理论上讲,这应该(如果你想谷歌获得更多的短语在引号搜索项)完成与后缀树,或相关的后缀数组增加了最长的共同preFIX阵列。这些可以解决在最坏情况下的线性时间的问题。但是,因为有时的情况下,最坏情况下的线性时间的算法是极度复杂和微妙,并蒙受巨大常数因子 - 他们仍然可以还清巨大,如果给定的主体将要搜索的许多倍,但是这不是典型的案例Python的 difflib 和它看起来并不像你的情况下,无论是。

I wrote Python's difflib.SequenceMatcher, and spent a lot of time finding expected-case fast ways to find longest common substrings. In theory, that should be done with "suffix trees", or the related "suffix arrays" augmented with "longest common prefix arrays" (the phrases in quotes are search terms if you want to Google for more). Those can solve the problem in worst-case linear time. But, as is sometimes the case, the worst-case linear-time algorithms are excruciatingly complex and delicate, and suffer large constant factors - they can still pay off hugely if a given corpus is going to be searched many times, but that's not the typical case for Python's difflib and it doesn't look like your case either.

不管怎样,我在这里的贡献是重写 SequenceMatcher find_longest_match()方法返回的所有的发现沿途的(本地)的最大匹配。注:

Anyway, my contribution here is to rewrite SequenceMatcher's find_longest_match() method to return all the (locally) maximum matches it finds along the way. Notes:

我要使用 to_words()函数雷蒙德赫廷杰给你,但没有转换为小写。转换成小写导致输出是不完全像你说你想要的。 excel两个字符串相减 技能分享 一些数据有关的Excel小技巧

I'm going to use the to_words() function Raymond Hettinger gave you, but without the conversion to lower case. Converting to lower case leads to output that isn't exactly like what you said you wanted.

不过,正如我在注释中指出已经,这确实输出鹅毛笔,这是不是在您需要的输出列表。我不知道为什么它不是,因为鹅毛笔会出现在两个输入端。

Nevertheless, as I noted in a comment already, this does output "quill", which isn't in your list of desired outputs. I have no idea why it isn't, since "quill" does appear in both inputs.

这里的code:

import re
def to_words(text):
    'Break text into a list of words without punctuation'
    return re.findall(r"[a-zA-Z']+", text)

def match(a, b):
    # Make b the longer list.
    if len(a) > len(b):
        a, b = b, a
    # Map each word of b to a list of indices it occupies.
    b2j = {}
    for j, word in enumerate(b):
        b2j.setdefault(word, []).append(j)
    j2len = {}
    nothing = []
    unique = set() # set of all results
    def local_max_at_j(j):
        # maximum match ends with b[j], with length j2len[j]
        length = j2len[j]
        unique.add(" ".join(b[j-length+1: j+1]))
    # during an iteration of the loop, j2len[j] = length of longest
    # match ending with b[j] and the previous word in a
    for word in a:
        # look at all instances of word in b
        j2lenget = j2len.get
        newj2len = {}
        for j in b2j.get(word, nothing):
            newj2len[j] = j2lenget(j-1, 0) + 1
        # which indices have not been extended?  those are
        # (local) maximums
        for j in j2len:
            if j+1 not in newj2len:
                local_max_at_j(j)
        j2len = newj2len
    # and we may also have local maximums ending at the last word
    for j in j2len:
        local_max_at_j(j)
    return unique

然后:

a = "They all are white a sheet of spotless paper " \
    "when they first are born but they are to be " \
    "scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless " \
    "paper, when you first are born; but you are to " \
    "be scrawled and blotted by every goose's quill"

print match(to_words(a), to_words(b))

显示:

set(['all',
     'and blotted by every',
     'first are born but',
     'are to be scrawled',
     'are',
     'spotless paper when',
     'white a sheet of',
     'quill'])

修改 - 它是如何工作

有一个伟大的许多序列匹配和比对算法是最好理解为工作在2维矩阵,用规则来计算矩阵元素,后来除preting条目的意思。

A great many sequence matching and alignment algorithms are best understood as working on a 2-dimensional matrix, with rules for computing the matrix entries and later interpreting the entries' meaning.

有关的输入序列 A B ,图像矩阵 M LEN(一) LEN(B)列。在这个应用中,我们希望 M [I,J] 包含最长公共连续子与结束的长度[I] B [J] ,和计算规则的非常的方便:

For input sequences a and b, picture a matrix M with len(a) rows and len(b) columns. In this application, we want M[i, j] to contain the length of the longest common contiguous subsequence ending with a[i] and b[j], and the computational rules are very easy:

M [I,J] = 0 如果 A [1]!= B [J] M [I,J] = M [I-1,J-1] + 1 如果 A [1] == B〔 J] (在这里我们假设一个出界外基质标准默默地返回0)。 M[i, j] = 0 if a[i] != b[j]. M[i, j] = M[i-1, j-1] + 1 if a[i] == b[j] (where we assume an out-of-bounds matrix reference silently returns 0).

国米pretation也很容易在这种情况下:有一个当地最大的非空的比赛在结尾的[I] B [J] 的长度, M [I,J] ,当且仅当 M [I,J] 终止非零但 M [I + 1,J + 1] 是出0或-bounds。

Interpretation is also very easy in this case: there is a locally maximum non-empty match ending at a[i] and b[j], of length M[i, j], if and only if M[i, j] is non-zero but M[i+1, j+1] is either 0 or out-of-bounds.

您可以使用这些规则来写的很简单和放大器;紧凑code,有两个回路,其计算 M 正确这个问题。不足之处是,code将(最好,平均和最坏的情况下) 0(LEN(一)* len个(B))时间的和的空间。

You can use those rules to write very simple & compact code, with two loops, that computes M correctly for this problem. The downside is that the code will take (best, average and worst cases) O(len(a) * len(b)) time and space.

虽然它可能被莫名其妙首先,在code我贴的做法刚好以上。连接被掩盖,因为code被大量优化,在几个方面,对预期的情况:

While it may be baffling at first, the code I posted is doing exactly the above. The connection is obscured because the code is heavily optimized, in several ways, for expected cases:

而不是做一遍计算 M ,接着又传球给除preT的结果,计算及跨pretation是交错的单遍 A

Instead of doing one pass to compute M, then another pass to interpret the results, computation and interpretation are interleaved in a single pass over a.

由于这一点,整个矩阵并不需要存储。而不仅仅是当前行( newj2len )和previous行(j2len )同时present

Because of that, the whole matrix doesn't need to be stored. Instead only the current row (newj2len) and the previous row (j2len) are simultaneously present.

而因为矩阵在这个问题的一般的大多是零,一排这里重新psented稀疏$ P $,通过字典映射列索引为非零值。零项是免费的,因为他们从来没有明确地存储。

And because the matrix in this problem is usually mostly zeroes, a row here is represented sparsely, via a dict mapping column indices to non-zero values. Zero entries are "free", in that they're never stored explicitly.

在处理一排,有没有需要遍历每一列:pcomputed的$ P $ B2J 字典告诉我们,正是有趣的列索引在当前行(那些符合当前字列 A )。

When processing a row, there's no need to iterate over each column: the precomputed b2j dict tells us exactly the interesting column indices in the current row (those columns that match the current word from a).

最后,部分意外,所有的preceeding优化凑到以这样的方式,有从来不需要知道当前行的索引,所以我们不必费心计算,要么。

Finally, and partly by accident, all the preceeding optimizations conspire in such a way that there's never a need to know the current row's index, so we don't have to bother computing that either.

编辑 - 污垢简单的版本

下面是code直接实现二维矩阵,没有尝试优化(比一个计数器通常可以避免明确地存储0项除外)。这是非常简单的,短期和简单的:

Here's code that implements the 2D matrix directly, with no attempts at optimization (other than that a Counter can often avoid explicitly storing 0 entries). It's extremely simple, short and easy:

def match(a, b):
    from collections import Counter
    M = Counter()
    for i in range(len(a)):
        for j in range(len(b)):
            if a[i] == b[j]:
                M[i, j] = M[i-1, j-1] + 1
    unique = set()
    for i in range(len(a)):
        for j in range(len(b)):
            if M[i, j] and not M[i+1, j+1]:
                length = M[i, j]
                unique.add(" ".join(a[i+1-length: i+1]))
    return unique

当然;-)返回相同的结果作为优化匹配()我张贴在第一。中

编辑 - 和另一个没有一个字典

只是为了好玩恭喜,如果你有矩阵模型拍下来,这code将很容易效仿。有关这方面的问题值得注意的是,一个矩阵单元格的值只依赖于沿对角线的值单元格的西北部。因此,它是足够好只是遍历所有主对角线,从西部和北部边界的所有单元格继续东南部。这样,只有小恒的空间是必要的,无论是输入长度的:

Just for fun :-) If you have the matrix model down pat, this code will be easy to follow. A remarkable thing about this particular problem is that a matrix cell's value only depends on the values along the diagonal to the cell's northwest. So it's "good enough" just to traverse all the main diagonals, proceeding southeast from all the cells on the west and north borders. This way only small constant space is needed, regardless of the inputs' lengths:

def match(a, b):
    from itertools import chain
    m, n = len(a), len(b)
    unique = set()
    for i, j in chain(((i, 0) for i in xrange(m)),
                      ((0, j) for j in xrange(1, n))):
        k = 0
        while i < m and j < n:
            if a[i] == b[j]:
                k += 1
            elif k:
                unique.add(" ".join(a[i-k: i]))
                k = 0
            i += 1
            j += 1
        if k:
            unique.add(" ".join(a[i-k: i]))
    return unique