文字包装算法算法、文字

2023-09-11 01:53:11 作者:栀夏微凉

我打赌有人之前已经解决了这一点,但我的搜索想出空。

I bet somebody has solved this before, but my searches have come up empty.

我想包单词的列表到缓冲区中,记录每一个单词的起始位置和长度。诀窍是,我想通过消除冗余包有效缓冲。

I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.

例:公仔娃娃屋的房子

这些能装入缓冲区只是作为洋娃娃,记住,娃娃是四个字母开始在位置0 ,洋娃娃为9个字母为0和房子是五个字母为3。

These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.

我想出来的,到目前为止是:

What I've come up with so far is:

排序的话最长到最短:(娃娃屋,房子,娃娃) 扫描缓冲区以查看是否串已经存在作为一个子字符串,如果是的话注意的位置。 如果它不存在,它添加到缓冲器的末端。

由于长字通常包含更短的话,这工作pretty的很好,但它应该有可能做显著更好。举例来说,如果我向单词列表,包括玩偶,然后我的算法来与 dollhouseragdoll ragdollhouse 。

Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.

这是一个preprocessing一步,所以我并不十分担心速度。为O(n ^ 2)的罚款。在另一方面,我的实际列表有话好几万,所以为O(n!)可能是出了问题。

This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.

作为一个侧面说明,这种存储方案用于数据的TrueType字体,参照了'name'表 http://www.microsoft.com/typography/otspec/name.htm

As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm

推荐答案

这是在最短超弦理论问题:找到包含了一组给定的字符串作为子的最短的字符串。根据这个IEEE论文(你可能无法获得可惜),正是解决这一问题是 NP完全。然而,启发式的解决方案可供选择。

This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.

作为第一步,你会发现,是其他字符串的子串并删除它们(当然你还需要记录自己的位置相对于包含字符串以某种方式)的所有字符串。这些设施齐全的字符串可以有效地使用广义后缀树被发现。

As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.

然后,通过反复合并有重叠时间最长的两个字符串,你肯定可以产生一个解决方案,其长度是最小的可能长度不逊于4倍。它应该可以通过使用两个基数树找到重叠的尺寸迅速在建议的评论通过Zifre康拉德·鲁道夫的回答。或者,你也许能够以某种方式使用广义后缀树。

Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.

我很抱歉,我不能再挖了一个体面的链接,你 - 似乎没有成为一个维基百科页面,或者在这个特殊问题的任何可公开访问的信息。它是简单地提到这里,虽然没有提供建议的解决方案。

I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.