根据他们的字符集簇的话的话、他们的、字符集、根据

2023-09-11 05:41:32 作者:我们配吗

说有一个词集,我想根据自己的炭包(多集),以集群它们。例如

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example

{喝茶,吃饭,ABBA,AABB,你好}

{tea, eat, abba, aabb, hello}

将聚成

{{茶,吃},{ABBA,AABB},{你好}}。

{{tea, eat}, {abba, aabb}, {hello}}.

ABBA AABB 聚集在一起,因为它们具有相同的炭包,即两个和两个 B

abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.

要让它有效,一个天真的方法可以让我想到的是隐蔽的每一个字成一个char-CNT系列,为〔实施例, ABBA AABB 将都转换为 A2B2 ,茶/吃了会被转换为 a1e1t1 。所以,我可以建立与相同的密钥字典和组词。

To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.

两个问题:首先,我要的字符排序来构建的关键;第二,该字符串键看起来很笨拙且性能不如CHAR / INT键。

Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.

有没有解决问题的更有效的方法?

Is there a more efficient way to solve the problem?

推荐答案

有关检测字谜,您可以使用基于素数的乘积哈希方案 A-> 2,B> 3 ,C> 5 等,将给予利群==AABB== 36(但不同的信primenumber映射会更好) 见我的回答这里。

For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better) See my answer here.