如何让我的贪婪实施一套涵盖更快?我的、更快、贪婪

2023-09-10 23:41:43 作者:风吹柳絮飞

我想出了经过多次讨论以下实现的贪婪集合覆盖关于我原来的问题here.从帮助我收到了,我EN codeD中的问题转化为一个贪婪的集合覆盖,并得到了一些更多的帮助here,我想出了下面的实现。我很感谢大家的帮助我与此有关。下面的实施工作正常,但我想让它可扩展/速度更快。

I came up with the following implementation for the Greedy Set Cover after much discussion regarding my original question here. From the help I received, I encoded the problem into a "Greedy Set Cover" and after receiving some more help here, I came up with the following implementation. I am thankful to everyone for helping me out with this. The following implementation works fine but I want to make it scalable/faster.

通过可扩展/速度更快,我的意思是说:

By scalable/faster, I mean to say that:

在我的数据集包含50K-100K集S中 以U本身元素的数量是非常小的在100-500的顺序 每组在S的大小可以是从0到40的任何地方

在这里,我去尝试:

U = set([1,2,3,4])
R = U
S = [set([1,2]), 
     set([1]), 
     set([1,2,3]), 
     set([1]), 
     set([3,4]), 
     set([4]), 
     set([1,2]), 
     set([3,4]), 
     set([1,2,3,4])]
w = [1, 1, 2, 2, 2, 3, 3, 4, 4]

C = []
costs = []

def findMin(S, R):
    minCost = 99999.0
    minElement = -1
    for i, s in enumerate(S):
        try:
            cost = w[i]/(len(s.intersection(R)))
            if cost < minCost:
                minCost = cost
                minElement = i
        except:
            # Division by zero, ignore
            pass
    return S[minElement], w[minElement]

while len(R) != 0:
    S_i, cost = findMin(S, R)
    C.append(S_i)
    R = R.difference(S_i)
    costs.append(cost)

print "Cover: ", C
print "Total Cost: ", sum(costs), costs

我不是在Python,但任何Python特定的优化,​​这code专家将是非常好的。

I am not an expert in Python but any Python-specific optimizations to this code would be really nice.

推荐答案

你得到VS你需要什么样的时代?当然,大部分的执行时间都用在C级code组找到交点,所以没有太多的优化,你可以做什么?与一些随机数据(结果可能与您当然数据有所不同,不能肯定是否这些都是良好的值)的10万台,在每组40个元素,500独特的元素,重量随机1〜10,

What sorts of times are you getting vs what you need? Surely most of the execution time is spent in c-level code finding set intersections, so there's not much optimization you can do? With some random data (results may vary with your data of course, not sure if these are good values) of 100000 sets, 40 elements in each set, 500 unique elements, weights random from 1 to 10,

print 'generating test data'    
num_sets = 100000
set_size = 40
elements = range(500)
U = set(elements)
R = U
S = []
for i in range(num_sets):
    random.shuffle(elements)
    S.append(set(elements[:set_size]))
w = [random.randint(1,100) for i in xrange(100)]

C = []
costs = []

我有这样的表现与CPROFILE:

I got performance like this with cProfile:

         8200209 function calls in 14.391 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   14.391   14.391 <string>:1(<module>)
       41    4.802    0.117   14.389    0.351 test.py:23(findMin)
        1    0.001    0.001   14.391   14.391 test.py:40(func)
  4100042    0.428    0.000    0.428    0.000 {len}
       82    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       41    0.001    0.000    0.001    0.000 {method 'difference' of 'set' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  4100000    9.160    0.000    9.160    0.000 {method 'intersection' of 'set' objects}

嗯,所以大多显然的1/3的时间不处于集合交点。但我个人不会优化任何更多的,特别是在明确的成本。还有的不会是什么可以与其他2/3呢,何必呢?

Hm, so mostly apparently 1/3 of the time isn't in set intersections. But I personally wouldn't optimize any more, especially at the cost of clarity. There's not going to be much you can do with the other 2/3, so why bother?