试图组值?

2023-09-11 23:05:00 作者:淡笑、此时的颓废

我有这样一些数据:

1 2
3 4
5 9
2 6
3 7

和正在寻找这样的一个输出(组ID和该组的成员):

and am looking for an output like this (group-id and the members of that group):

1: 1 2 6
2: 3 4 7
3: 5 9

第一行,因为1连接到2和2连接到6。 因为3被连接到4和3个第二排被连接到7

First row because 1 is "connected" to 2 and 2 is connected to 6. Second row because 3 is connected to 4 and 3 is connected to 7

这看起来对我来说,一个图的遍历,但是最终的顺序并不重要,所以我在想,如果有人能提出一个简单的解决方案,我可以在一个大的数据集用(十亿项)。

This looked to me like a graph traversal but the final order does not matter so I was wondering if someone can suggest a simpler solution that I can use on a large dataset (billions of entries).

从注​​释:

问题是找到一组不相交的子图给定一组的边。 在边缘无定向;行'1 2'表示1连接到2和2连接到1。 的1:样品输出可能是'答:'不改变答案的意思

修改1:

问题现在看起来解决了。感谢大家的帮助。我需要一些更多的帮助采摘,可以在数十亿这样的条目中使用的最佳解决方案。

Problem looks solved now. Thanks to everyone for their help. I need some more help picking the best solution that can be used on billions of such entries.

编辑2:

测试输入文件:

1 27
1 134
1 137
1 161
1 171
1 275
1 309
1 413
1 464
1 627
1 744
2 135
2 398
2 437
2 548
2 594
2 717
2 738
2 783
2 798
2 912
5 74
5 223
7 53
7 65
7 122
7 237
7 314
7 701
7 730
7 755
7 821
7 875
7 884
7 898
7 900
7 930
8 115
9 207
9 305
9 342
9 364
9 493
9 600
9 676
9 830
9 941
10 164
10 283
10 380
10 423
10 468
10 577
11 72
11 132
11 276
11 306
11 401
11 515
11 599
12 95
12 126
12 294
13 64
13 172
13 528
14 396
15 35
15 66
15 210
15 226
15 360
15 588
17 263
17 415
17 474
17 648
17 986
21 543
21 771
22 47
23 70
23 203
23 427
23 590
24 286
24 565
25 175
26 678
27 137
27 161
27 171
27 275
27 309
27 413
27 464
27 627
27 684
27 744
29 787

基准:

我尝试了一切,并张贴TokenMacGuy的版本是最快的,我试图样本数据集。该数据集有为此我花了大约6秒的双路四核2.4GHz的机器上大约1万个条目。我还没有得到对整个数据集上运行的机会,但是我会尽快发布的基准,因为它是可用的。

I tried out everything and the version posted by TokenMacGuy is the fastest on the sample dataset that I tried. The dataset has about 1 million entries for which it took me about 6 seconds on a Dual Quad-Core 2.4GHz machine. I haven't gotten a chance to run it on the entire dataset yet but I will post the benchmark as soon as it is available.

推荐答案

我已经成功为O(n log n)的。

I've managed O(n log n).

下面是一个(有点紧张)C ++实现:

Here is a (somewhat intense) C++ implementation:

#include <boost/pending/disjoint_sets.hpp>
#include <boost/property_map/property_map.hpp>

#include <map>
#include <set>
#include <iostream>


typedef std::map<int, int> rank_t;
typedef std::map<int, int> parent_t;

typedef boost::associative_property_map< rank_t > rank_pmap_t;
typedef boost::associative_property_map< parent_t > parent_pmap_t;

typedef boost::disjoint_sets< rank_pmap_t, parent_pmap_t > group_sets_t;

typedef std::set<int> key_set;
typedef std::map<int, std::set<int> > output;

对于某些类型定义出的方式,这里是真正的肉。我使用提振:: disjoint_sets ,这是恰好是一个非常良好的再presentation的问题。第一个函数检查是否任何给定的按键都没有过的,并在需要时将它们添加到收藏。的重要组成部分,是真正的 union_set(A,B)哪个环节两套在一起。如果其中一个或另一个组已经在集团集合,它们就挂了。

With some typedefs out of the way, here's the real meat. I'm using boost::disjoint_sets, which is just happens to be an exceptionally good representation for the problem. This first function checks to see if either of the keys given have been seen before, and adds them to the collections if needed. the important part is really the union_set(a, b) which links the two sets together. If one or the other of the sets are already in the groups collection, they get linked too.

void add_data(int a, int b, group_sets_t & groups, key_set & keys)
{
  if (keys.count(a) < 1) groups.make_set(a);
  if (keys.count(b) < 1) groups.make_set(b);
  groups.union_set(a, b);
  keys.insert(a);
  keys.insert(b);
}

这是不是太令人兴奋了,它只是遍历所有我们见过的钥匙,并得到再presentative键的键,然后增加对(重新presentative,键)到地图。一旦这样做了,打印出图。

This isn't too exciting, it just iterates through all of the keys we've seen and gets the representative key for that key, then adds the pair (representative, key) to a map. Once that's done, print out the map.

void build_output(group_sets_t & groups, key_set & keys)
{
  output out;
  for (key_set::iterator i(keys.begin()); i != keys.end(); i++)
    out[groups.find_set(*i)].insert(*i);

  for (output::iterator i(out.begin()); i != out.end(); i++)
  {
    std::cout << i->first << ": ";
    for (output::mapped_type::iterator j(i->second.begin()); j != i->second.end(); j++)
      std::cout << *j << " ";
    std::cout << std::endl;
  }
}

int main()
{

  rank_t rank;
  parent_t parent;
  rank_pmap_t rank_index(rank);
  parent_pmap_t parent_index(parent);


  group_sets_t groups( rank_index, parent_index );
  key_set keys;

  int a, b;
  while (std::cin >> a)
  {
    std::cin >> b;
    add_data(a, b, groups, keys);
  }  


  build_output(groups, keys);
  //std::cout << "number of sets: " << 
  //  groups.count_sets(keys.begin()), keys.end()) << std::endl;

}

我熬了很多时间学习如何使用的boost :: disjoint_sets 这个问题。目前似乎没有太大的任何文件。

I stayed up many hours learning how to use boost::disjoint_sets on this problem. There doesn't seem to be much of any documentation on it.

关于性能。该 disjoint_sets 结构是O(字母(N)),其关键业务( MAKE_SET find_set union_set ),这是pretty的接近恒定的,所以如果它是建筑结构的只是一个问题,整个算法会为O(n&阿尔法(N))(这实际上是为O(n)),但我们必须把它打印出来。这意味着我们必须建立一些关联容器,它不能执行为​​O更好的(N log n)的。这也许可以通过选择不同的关联容器(比如的hash_set 等),因为一旦你填充的初始列表,你可以保留一个最优量来获得一个常数因子加速的空间。

About the performance. The disjoint_sets structure is O(α(n) ) for its key operations (make_set, find_set and union_set) which is pretty close to constant, and so if it were just a matter of building the structure, the whole algorithm would be O(n α(n) ) (which is effectively O(n) ) but we have to print it out. That means we have to build up some associative containers, which cannot perform better than O(n log n). It might be possible to get a constant factor speedup by choosing a different associative containers (say, hash_set etc), since once you populate the initial list, you can reserve an optimal amount of space.

相关推荐