字prediction算法算法、prediction

2023-09-12 21:19:27 作者:囚禁自己

我敢肯定有这个帖子,但我无法找到一个问这个确切的问题。考虑以下几点:

I'm sure there is a post on this, but I couldn't find one asking this exact question. Consider the following:

我们有一个字字典可用 我们被送到的话很多段落,我希望能够以predict中给出此输入句子的一个单词。

假设我们有几个句子,如你好,我的名字是汤姆,他的名字是杰里,他去那里没有水。我们检查哈希表,如果存在的话。如果不是这样,我们给它分配一个唯一的ID,并把它放在哈希表所示。这种方式,而不是存储字的链作为一串串,我们只要有UNIQUEID的列表。

Say we have a few sentences such as "Hello my name is Tom", "His name is jerry", "He goes where there is no water". We check a hash table if a word exists. If it does not, we assign it a unique id and put it in the hash table. This way, instead of storing a "chain" of words as a bunch of strings, we can just have a list of uniqueID's.

在上面,我们将具有例如(0,1,2,3,4),(5,2,3,6),和(7,8,9,10,3,11,12)。需要注意的是3张是是,我们增加了新的独特的ID,我们发现了新的单词。所以说,我们都给予了一句她的名字,这将是(13,2,3)。我们想知道,因为这样的背景下,什么样的下一个词应该是。这是算法我想,但我不认为它有效的:

Above, we would have for instance (0, 1, 2, 3, 4), (5, 2, 3, 6), and (7, 8, 9, 10, 3, 11, 12). Note that 3 is "is" and we added new unique id's as we discovered new words. So say we are given a sentence "her name is", this would be (13, 2, 3). We want to know, given this context, what the next word should be. This is the algorithm I thought of, but I dont think its efficient:

我们有N个链(观察到的句子),其中一个链可能是前名单。 3,6,2,7,8。 每个链是平均大小为M,其中M是平均句子长度 我们给出了大小S,EX的新的产业链。 13,2,3,我们想知道什么是最有可能的下一个字?

算法:

首先扫描链的那些谁包含完整的S输入(13,2,3,在本实施例)的完整列表。因为我们必须扫描ñ链,每一个长度为M,并在同一时间,它的O(N * M * S)比较的信件。

First scan the entire list of chains for those who contain the full S input(13,2,3, in this example). Since we have to scan N chains, each of length M, and compare S letters at a time, its O(N*M*S).

如果有我们没有的扫描链,有完整的S,通过删除至少显著字旁扫描(即第一位的,所以删除13)。现在,扫描(2,3),如1最坏情况下O(N * M * S)这实在是S-1。

If there are no chains in our scan which have the full S, next scan by removing the least significant word (ie. the first one, so remove 13). Now, scan for (2,3) as in 1 in worst case O(N*M*S) which is really S-1.

继续扫描就这样,直到我们得到的结果> 0(如果有的话)。

Continue scanning this way until we get results > 0 (if ever).

理货在所有我们所收集到的剩余链的下一个单词。我们可以用一个哈希表,计算每次我们增加时间,并跟踪最加词。 O(N)最坏的情况下构建,O(1)寻找最大字。

Tally the next words in all of the remaining chains we have gathered. We can use a hash table which counts every time we add, and keeps track of the most added word. O(N) worst case build, O(1) to find max word.

每个扫描需要O(M * N * S)最坏的情况。这是因为,有N个链,每个链具有M的数字,而且我们必须检查小号号码重叠匹配。我们扫描泰晤士报最坏的情况(13,2,3,然后2,3,然后是3 3扫描= S)。因此,总的复杂度为O(S ^ 2 * M * N)。

Each scan takes O(M*N*S) worst case. This is because there are N chains, each chain has M numbers, and we must check S numbers for overlaying a match. We scan S times worst case (13,2,3,then 2,3, then 3 for 3 scans = S). Thus, the total complexity is O(S^2 * M * N).

因此​​,如果我们10万个连锁店和10个字的平均句长,我们正在寻找1,000,000 * S ^ 2,以获得最佳的字。显然,N >> M,由于句子长度不与观察到的句子中的一般的数量比例的,所以M可以是一个常数。然后,我们可以减少复杂澳(S ^ 2 * N)。澳(S ^ 2 * M * N)可用于分析更有帮助虽然,因为M可以是一个相当大的常量。

So if we have 100,000 chains and an average sentence length of 10 words, we're looking at 1,000,000*S^2 to get the optimal word. Clearly, N >> M, since sentence length does not scale with number of observed sentences in general, so M can be a constant. We can then reduce the complexity to O(S^2 * N). O(S^2 * M * N) may be more helpful for analysis though, since M can be a sizeable "constant".

这可能是完全错误的做法采取了这类问题,但我想分享的只是赤裸裸地要求assitance我的想法来代替。究其原因即时扫描我的方式是因为我只想扫描,就像我不得不这样做。如果没有拥有完整的S,只是不停地修剪小号,直到一些连锁店的比赛。如果他们永远不会匹配,我们不知道该怎么predict作为下一个字!在更短的时间/空间复杂的解决方案有什么建议?谢谢!

This could be the complete wrong approach to take for this type of problem, but I wanted to share my thoughts instead of just blatantly asking for assitance. The reason im scanning the way I do is because I only want to scan as much as I have to. If nothing has the full S, just keep pruning S until some chains match. If they never match, we have no idea what to predict as the next word! Any suggestions on a less time/space complex solution? Thanks!

推荐答案

这是语言建模的问题。对于基线的做法,你唯一需要的是一个哈希表映射定长字链,说长的 K 的,最可能的下一句话。(*)

This is the problem of language modeling. For a baseline approach, The only thing you need is a hash table mapping fixed-length chains of words, say of length k, to the most probable following word.(*)

在训练的时候,你把输入到( K 的+1)-grams 使用滑动窗。所以,如果你遇到

At training time, you break the input into (k+1)-grams using a sliding window. So if you encounter

The wrath sing, goddess, of Peleus' son, Achilles

您生成,为的 K = 2,

you generate, for k=2,

START START the
START the wrath
the wrath sing
wrath sing goddess
goddess of peleus
of peleus son
peleus son achilles

这可以在线性时间内完成。对于每次3克,理货(哈希表)第三个字多久跟随第2位。

This can be done in linear time. For each 3-gram, tally (in a hash table) how often the third word follows the first two.

最后,通过哈希表和每个键(2克)的循环仅保留最常出现的第三个词。线性时间。

Finally, loop through the hash table and for each key (2-gram) keep only the most commonly occurring third word. Linear time.

在prediction时,只看了的 K 的(2)遗言和predict下一个单词。这需要唯一不变的时间,因为它只是一个哈希表查找。

At prediction time, look only at the k (2) last words and predict the next word. This takes only constant time since it's just a hash table lookup.

如果你想知道为什么你应该只保留短子链,而不是完整的链条,再看看进入马氏窗理论。如果你的模型是记单词的所有连锁店,它在其输入中看到的话,那就很糟糕过拟合其培训数据只有再现其输入为prediction时间。如何严重依赖于训练集(更多的数据比较好),但对 K > 4你真正需要的平滑在你的模型。

If you're wondering why you should keep only short subchains instead of full chains, then look into the theory of Markov windows. If your model were to remember all the chains of words that it has seen in its input, then it would badly overfit its training data and only reproduce its input at prediction time. How badly depends on the training set (more data is better), but for k>4 you'd really need smoothing in your model.

(*)或向一概率分布,但是这是不需要的简单示例使用情形。

(*) Or to a probability distribution, but this is not needed for your simple example use case.