我有一个大组字符串。我想分割字符串成子集,使得:
I have a large set of strings. I want to divide the strings into subsets such that:
在一个子集股每项1或多个连续的字符。 定义的一个子集的共享相邻字符都是唯一的一组子集(即共享字符是足够用于定义串的那个站与其它子集的互斥关系的子集)。 的子集是大致相同的大小。 在生成的子集的集所需亚群符合上述标准的最低数量。例如赐名以下设置:
阿伦,拉里,阿尔弗雷德·芭芭拉,阿方斯·卡尔
Alan,Larry,Alfred,Barbara,Alphonse,Carl
我可以把这套成两个相同大小的子集。由连续的字符定义的子集1AL是
I can divide this set into two subsets of equal size. Subset 1 defined by the contiguous characters "AL" would be
艾伦,阿尔弗雷德·阿尔方斯
Alan, Alfred, Alphonse
2的子集由连续的字符AR将被定义
Subset 2 defined by contiguous characters ar would be
拉里·芭芭拉,卡尔。
Larry, Barbara, Carl.
我在寻找一种算法,将做到这一点的任意一组字符串。所得子集的集合不必等于2但应当最低设置,将所得的子集应大致相等。
I am looking for an algorithm that would do this for any arbitrary set of strings. The resulting set of subsets does not have to equal 2 but it should be the minimum set and the resulting subsets should be approximately equal.
艾略特
看一看的http:// en.wikipedia.org/wiki/Suffix_array 。这可能是你真正想要做的是创建一个后缀阵列的每个文件,并将它们合并所有后缀阵列,具有指针回到原来的版本,这样就可以通过查看搜索的集合为一体的字符串为它作为在阵列中的后缀。
Have a look at http://en.wikipedia.org/wiki/Suffix_array. It is possible that what you really want to do is to create a suffix array for each document, and them merge all the suffix arrays, with pointers back to the original versions, so that you can search the collection as one for a string by looking for it as a suffix in the array.