如何比较集群?集群

2023-09-11 02:33:01 作者:你的名字是我久治不愈的病

我希望这可以使用Python来完成!我用相同的数据的两个集群方案,现在有来自两个群集文件。我重新格式化文件,使它们看起来是这样的:

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:

Cluster 0:
Brucellaceae(10)
    Brucella(10)
        abortus(1)
        canis(1)
        ceti(1)
        inopinata(1)
        melitensis(1)
        microti(1)
        neotomae(1)
        ovis(1)
        pinnipedialis(1)
        suis(1)
Cluster 1:
    Streptomycetaceae(28)
        Streptomyces(28)
            achromogenes(1)
            albaduncus(1)
            anthocyanicus(1)

etc.

这些文件包含的细菌种类的信息。所以我有,然后用鼠标右键下方的簇号(群集0)家庭(Brucellaceae)和细菌在家庭数(10)。根据该是每个属在家庭中找到属(名称后面加上数字,布鲁氏菌(10)),最后种(流产(1),等等)。

These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).

我的问题:我有2个文件,以这种方式格式化,并希望写一个程序,将寻找两者之间的差异。唯一的问题是,以不同的方式在两个程序的群集,所以两个集群可能是相同的,即使实际的簇号是不同的(如此群集1的在一个文件中的内容可能与另一个文件匹配群集43中,唯一的不同是实际的簇号)。所以,我需要的东西忽略簇号和重点集群的内容。

My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.

有什么办法,我可以比较这2个文件审查的区别?它甚至有可能?任何想法将大大AP preciated!

Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!

推荐答案

由于:

file1 = '''Cluster 0:
 giant(2)
  red(2)
   brick(1)
   apple(1)
Cluster 1:
 tiny(3)
  green(1)
   dot(1)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')
file2 = '''Cluster 18:
 giant(2)
  red(2)
   brick(1)
   tomato(1)
Cluster 19:
 tiny(2)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')

这是你所需要的?

Is this what you need?

def parse_file(open_file):
    result = []

    for line in open_file:
        indent_level = len(line) - len(line.lstrip())
        if indent_level == 0:
            levels = ['','','']
        item = line.lstrip().split('(', 1)[0]
        levels[indent_level - 1] = item
        if indent_level == 3:
            result.append('.'.join(levels))
    return result

data1 = set(parse_file(file1))
data2 = set(parse_file(file2))

differences = [
    ('common elements', data1 & data2),
    ('missing from file2', data1 - data2),
    ('missing from file1', data2 - data1) ]

要看到的区别:

for desc, items in differences:
    print desc
    print 
    for item in items:
        print '\t' + item
    print

打印

common elements

    giant.red.brick
    tiny.blue.candy
    tiny.blue.flower

missing from file2

    tiny.green.dot
    giant.red.apple

missing from file1

    giant.red.tomato
 
精彩推荐
图片推荐