例如,从其他字符串集合分割字符串字符串

2023-09-11 05:26:49 作者:一杯忘情

我想建立的字符串(如收取任何复杂的数据结构),我可以使用,有效的,为榜样的集合,知道在哪里我可以拆分给定的字符串。 在例子中,我有这个字符串集合:

I'd like to build a collection of String (any complex data structure like a collection) that I can use, efficiently, as "example" to know where I can split a given string. In example I had this String collection:

在阿巴科code,交流。 粗体字可以大胆。 叶的树的文件夹,和树木。

和给定的字符串:

OME codeexchangeuthercanbetreeofword

和获取,从算法是这样的:

and obtain, from algorithm, something like:

OME code交换乌瑟尔可以是字的树

部分青梅和乌瑟尔不能分裂,因此将保留原样(这将是很好,如果我标志着这部分不一样认可)。 我试着分析KMP算法,但太多远离我的需求,我想组织收集在一个有效的时间方式(小于线性到集合的大小)。

The part "ome" and "uther" cannot be splitted, so will be left as is (it would be nice if I mark this part like NOT-RECOGNIZED). I try to analyse KMP algorithm, but are too much far away from my needs, and I'd like to organize collection in a efficient time manner (less than linear to collection size).

我忘了说:

分割上串与俚语混合自然语言的单词都没有空格 我已经尝试过基于字典的动态算法的加权词,但实在是太多了科目误差当量的错分裂(错误的我的意思是自然语言) 我需要的最好成绩为这种分裂,用字序列从字符串的集合称为好榜样

推荐答案

动态规划可有帮助这里。

f(0) = 0
f(i) = min { f(j) + (dictionary.contains(word.substring(j,i)) ? 0 : i-j)  for each j=0,...,i }

我们的想法是做使用上述递归函数,而试图最小化字母不适应数字穷举搜索。采用DP技术,您可以避免重复计算,有效地得到正确的答案。

The idea is to do an exhaustive search using the above recursive function while trying to minimize the number of letters that do not 'fit'. Using DP techniques, you can avoid repeating calculations and get the correct answer efficiently.

使用实际的分区可以通过记住每一步是什么Ĵ被选来完成,并从去年回溯步骤来第一位。

Getting the actual partitions can be done by remembering at each step what j was chosen, and retrace your steps from last to first.

的Java code:

    String word = "omecodeexchangeuthercanbetreeofword";
    Set<String> set = new HashSet<>(Arrays.asList("abaco", "code", "exchange", "bold", "word", "can", "be", "tree", "folder", "and", "of", "leaf"));
    int n = word.length() + 1;
    int[] f = new int[n];
    int[] jChoices = new int[n];
    f[0] = 0;
    for (int i = 1; i < n; i++) {
        int best = Integer.MAX_VALUE;
        int bestJ = -1;
        for (int j = 0; j < i; j++) {
            int curr = f[j] + (set.contains(word.substring(j, i)) ? 0 : (i-j));
            if (curr < best) {
                best = curr;
                bestJ = j;
            }
        }
        jChoices[i] = bestJ;
        f[i] = best;
    }
    System.out.println("unmatched chars: " + f[n-1]);
    System.out.println("split:");
    int j = n-1;
    List<String> splits = new ArrayList<>();
    while (j > 0) { 
        splits.add(word.substring(jChoices[j],j));
        j = jChoices[j];
    }
    Collections.reverse(splits);
    for (String s : splits) System.out.println(s + " " + (set.contains(s)?"(match)":"(does not match)"));