什么算法用来段数序列为n的子集,以尽量减少在各子集中的数字的总和的标准偏差子集、偏差、总和、序列

2023-09-11 22:48:32 作者:旧楹联红褪墨残谁来揭

我正在寻找一种算法来段的序列的正数成n个子序列,使得该数字的总和的在每个子集的标准偏差最小化。

I'm looking for an algorithm to segment a sequence of positive numbers into n subsequences, such that the standard deviation of the sum of the numbers in each subset is minimized.

的数字在每个子序列的顺序需要是相同的原始序列的排序

The ordering of the numbers in each subsequence needs to be the same as the ordering in the original sequence

例如:

假设我有一个序列{1,1,1,1,1,1,10,1},我想段为2个序列。 我认为最佳的解决方案将是{1,1,1,1,1,1},{10,1}。

Suppose I have a sequence {1,1,1,1,1,1,10,1} that i wanted to segment into 2 subsequences. I believe the optimal solution would be {1,1,1,1,1,1}, {10,1} .

的第一子序列的总和是6,第二子序列的总和为11 这两个数字的标准偏差为3.5〜,我相信这是最低的。

The sum of the 1st subsequence is 6, the sum of the 2nd subsequence is 11 The standard deviation of the two numbers is ~3.5, which i believe is the lowest possible.

假设我有一个序列{4,1,1,1,1,6},我想段分为3个序列。 我相信最佳的解决办法是{4},{1,1,1,1},{6} 子序列的总和为4,4,和6 3个数字的标准偏差为1.15〜,我相信这是最低的。

Suppose I have a sequence {4,1,1,1,1,6} that i wanted to segment into 3 subsequences. I believe the optimal solution would be {4}, {1,1,1,1}, {6} The sum of the subsequences is 4, 4, and 6. The standard deviation of the 3 numbers is ~1.15, which i believe is the lowest possible.

的最佳算法我能够想出是找到各序列中的号码的累计总和,和在段的每个区间的序列〔totalSum / numSubsequences]

The best algorithm i was able to come up with was to find the cumulative sum of each of the numbers in the sequence, and segment the sequence at each interval of [totalSum/numSubsequences].

例如,给出的序列{4,1,1,1,1,6},每个序列的号的累积和为{4,5,6,7,8,14}。总的所有数字的序列中是14,因此,考虑到我想3个子序列,我应该段的序列当总达到14/3 = 4.66和2 * 14/3 = 9.333333。

For example, given the sequence {4,1,1,1,1,6} , the cumulative sums of the numbers of each sequence is {4,5,6,7,8,14}. The total of all numbers in the sequence is 14, so, given that i want 3 subsequences, i should segment the sequence when the total reaches 14/3 = 4.66 and 2 * 14/3 = 9.333333.

不过,有序列,其中累计总数等于4.66中没有实际的地方 - 第一累计总值为4,而接下来的累计值为5。这样,我圆了或者我应该向下取整?在这种情况下,舍入至4给出了最佳的解决方案,但是这并非总是如此。最好我能想到的就是尽量四舍五入向上和向下的每个组合,但结果在O(2 ^ numSubsequences)的复杂性。

However, there is no actual place in the sequence where the cumulative total is equal to 4.66 - the first cumulative total is 4, and next cumulative total is 5. So should i round up or should i round down? In this case, rounding down to 4 gives the optimal solution, but that isn't always the case. The best I can think of is to try every combination of rounding up and down, but that results in O(2^numSubsequences) complexity.

这似乎是,将有一个preexisting算法适用的东西的类型,但我的谷歌搜索失败我。我知道了划分问题,这是一个NP完全的,但涉及无序集,而不是有序序列。

This seems to be the type of thing that would have a preexisting algorithm to apply, however my Googling has failed me. I am aware of the Partition Problem, which is NP-complete, but that deals with unordered sets, and not ordered sequences.

任何帮助将是AP preciated。

Any help would be appreciated.

推荐答案

假设原始序列的长度和子序列的数量是 N

Suppose the length of the original sequence is L and the number of subsequences is N.

您可以简化EX pression标准差获得的sqrt(E [X ^ 2] - E [X] ^ 2),其中电子表示期望/平均 X 表示您的随机变量 - 在你的情况下,子序列的总和。 (类似的公式适用于样本标准差。)注意, E [X] 不取决于你如何分割你的程序,因为它永远是总总和除以 N 。因此,我们只是希望尽量减少 E [X ^ 2] 或等价时,的总和X ^ 2 (他们相差 N 的一个因素被平均的定义)。

You may simplify the expression for standard deviation to get sqrt(E[X^2] - E[X]^2), where E denotes expectation/average and X denotes your random variable -- in your case, the sum of the subsequences. (A similar formula applies for the "sample standard deviation".) Note that E[X] does not depend on how you split your sequence, because it will always be the total sum divided by N. Thus, we just want to minimize E[X^2] or equivalently, the sum of X^2 (they differ by a factor of N by the definition of average).

目前这一点上,我们可以看到,这个问题可以用动态规划来解决。让 F(I,J) 0 M Ĵ 1 N ,是子序列之和的平方最小的总和,从第一个我您的序列元素融入Ĵ序列。然后,我们看到 F(I,J)可以在所有的 F(I',J'),与我'< = I J< J'。更具体地讲,如果你的序列是 A [K] 0 索引到 M-1

At this point, we can see that this problem can be solved with dynamic programming. Let f(i,j), for i from 0 to M and j from 1 to N, be the minimal sum of squares of sums of subsequences from the split of the first i elements of your sequence into j subsequences. Then we see that f(i,j) may be computed in terms of all the f(i',j') with i' <= i and j < j'. More specifically, if your sequence is a[k] indexed from 0 to M-1:

f(i,1) = sum( a[k] for 0 <= k < i )^2
f(i,j) = minimum of  f(l,j-1)+sum( a[k] for l < k < i )^2  for l from 0 to i

已经最小化 F(N,L),你可以使用标准的动态规划技术来恢复分裂。特别是,您可以存储,最大限度地减少 F(I,J)

Having minimized f(N,L), you can use standard dynamic programming techniques to recover the splits. In particular, you can store the l that minimizes f(i,j).

该解决方案的运行时间 O(L ^ 2 N),因为你计算 O(LN)不同对 F 值与最小超过 0(1) L个不同的值

The runtime of this solution is O(L^2 N) because you compute O(L N) different values of f and the minimum is over O(L) different values of l.

下面是Perl中的一个简单的实现:

Here's a straightforward implementation in Perl:

#!/usr/bin/perl

use strict;
use warnings;

local $\ = $/;
print join ", ", map {"@$_"} best( 2, qw(1 1 1 1 1 1 10 1) );
# prints "1 1 1 1 1 1, 10 1"

print join ", ", map {"@$_"} best( 3, qw(4 1 1 1 1 6) );
# prints "4, 1 1 1 1, 6"

sub best {
    my( $N, @a ) = @_;

    my( @f, @g, $i, $j, $k, $sum );

    # DP base case
    $sum = 0;
    $f[0][1] = $g[0][1] = 0;
    for $i ( 1 .. @a ) {
        $sum += $a[$i-1];
        $f[$i][1] = $sum * $sum;
        $g[$i][1] = 0;
    }

    # DP recurrence
    for $j ( 2 .. $N ) {
        $f[0][$j] = $g[0][$j] = 0;
        for $i ( 1 .. @a ) {
            $sum = 0;
            $f[$i][$j] = $f[$i][$j-1];
            $g[$i][$j] = $i;
            for $k ( reverse 0 .. $i-1 ) {
                $sum += $a[$k];
                if( $f[$i][$j] > $f[$k][$j-1] + $sum * $sum ) {
                    $f[$i][$j] = $f[$k][$j-1] + $sum * $sum;
                    $g[$i][$j] = $k;
                }
            }
        }
    }

    # Extract best expansion
    my( @result );
    $i = @a; $j = $N;

    while( $j ) {
        $k = $g[$i][$j];
        unshift @result, [@a[$k .. $i-1]];
        $i = $k;
        $j--;
    }

    return @result;
}