什么是好的测试用例基准和放大器;压力测试字符串搜索算法?放大器、基准、字符串、算法

2023-09-11 05:44:08 作者:在此绝望。

我试图评估不同的字符串搜索(ALA的strstr)算法和实现,并寻找一些精心设计的针和干草堆的字符串,将捕获最坏情况下的性能以及可能的极端情况的错误。我想我可以解决它们自己,但我想,一个人必须有测试用例收集好围坐在什么地方...

I'm trying to evaluate different substring search (ala strstr) algorithms and implementations and looking for some well-crafted needle and haystack strings that will catch worst-case performance and possible corner-case bugs. I suppose I could work them out myself but I figure someone has to have a good collection of test cases sitting around somewhere...

推荐答案

有些想法,对自己的部分答案:

Some thoughts and a partial answer to myself:

最坏情况下的蛮力算法:

Worst case for brute force algorithm:

A ^(N + 1)在B (一^ NB)^ M

例如。 AAAB aabaabaabaabaabaabaab

最坏的情况下SMOA:

Worst case for SMOA:

类似于 yxyxyxxyxyxyxx (yxyxyxxyxyxyxy)^ N 。需要进一步细化。我试图确保每个进步是部分匹配的仅一半的长度,而该最大后缀计算需要回溯的最大金额。我是pretty的肯定我在正确的轨道,因为这种类型的案件是我发现至今让我实现SMOA(这是渐近 6N + 5的唯一途径)的运行速度比的glibc的双向慢(这是渐近 N-M ,但有中等程度疼痛preprocessing开销)。

Something like yxyxyxxyxyxyxx in (yxyxyxxyxyxyxy)^n. Needs further refinement. I'm trying to ensure that each advancement is only half the length of the partial match, and that maximal suffix computation requires the maximal amount of backtracking. I'm pretty sure I'm on the right track because this type of case is the only way I've found so far to make my implementation of SMOA (which is asymptotically 6n+5) run slower than glibc's Two-Way (which is asymptotically 2n-m but has moderately painful preprocessing overhead).

最坏情况下的任何东西滚动基于散列的:

Worst case for anything rolling-hash based:

无论字节序列导致哈希冲突与针的散列。对于任何合理的快速散列和给定的针,应该很容易构建一个草堆,其散列碰撞与针的散在每一点。然而,似乎很难同时创建长部分匹配,这是获得最坏情况下的行为的唯一途径。自然为最坏情况行为针必须有一定的周期性,并通过调整只是最后的字符模拟散列的一种方式。

Whatever sequence of bytes causes hash collisions with the hash of the needle. For any reasonably-fast hash and a given needle, it should be easy to construct a haystack whose hash collides with the needle's hash at every point. However, it seems difficult to simultaneously create long partial matches, which are the only way to get the worst-case behavior. Naturally for worst-case behavior the needle must have some periodicity, and a way of emulating the hash by adjusting just the final characters.

最坏情况下的双向的:

似乎很短针平凡MS分解 - 像 BAC - 在草堆中包含重复的误报在针的右半部分 - 像 dacdacdacdacdacdacdac 。这种算法可能很慢的唯一方式(比glibc的作者不佳实施它其它...)是通过使外环迭代多次并反复招致了开销(并使建立开销显著)。

Seems to be very short needle with nontrivial MS decomposition - something like bac - where the haystack contains repeated false positives in the right-half component of the needle - something like dacdacdacdacdacdacdac. The only way this algorithm can be slow (other than by glibc authors implementing it poorly...) is by making the outer loop iterate many times and repeatedly incur that overhead (and making the setup overhead significant).

其他算法:

我真的只关心算法是 O(1)在空间和具有低preprocessing开销,所以我还没有看他们最坏案件这么多。至少博耶 - 穆尔(不修改,使之 O(N))有一个平凡的O(nm)的最坏情况下,它成为

I'm really only interested in algorithms that are O(1) in space and have low preprocessing overhead, so I haven't looked at their worst cases so much. At least Boyer-Moore (without the modifications to make it O(n)) has a nontrivial worst-case where it becomes O(nm).