说有一个词集,我想根据自己的炭包(多集),以集群它们。例如
Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{喝茶,吃饭,ABBA,AABB,你好}
{tea, eat, abba, aabb, hello}
将聚成
{{茶,吃},{ABBA,AABB},{你好}}。
{{tea, eat}, {abba, aabb}, {hello}}.
ABBA
和 AABB
聚集在一起,因为它们具有相同的炭包,即两个在
和两个 B
。
abba
and aabb
are clustered together because they have the same char bag, i.e. two a
and two b
.
要让它有效,一个天真的方法可以让我想到的是隐蔽的每一个字成一个char-CNT系列,为〔实施例, ABBA
和 AABB
将都转换为 A2B2
,茶/吃了会被转换为 a1e1t1
。所以,我可以建立与相同的密钥字典和组词。
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba
and aabb
will be both converted to a2b2
, tea/eat will be converted to a1e1t1
. So that I can build a dictionary and group words with same key.
两个问题:首先,我要的字符排序来构建的关键;第二,该字符串键看起来很笨拙且性能不如CHAR / INT键。
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
有没有解决问题的更有效的方法?
Is there a more efficient way to solve the problem?
有关检测字谜,您可以使用基于素数的乘积哈希方案 A-> 2,B> 3 ,C> 5
等,将给予利群==AABB== 36(但不同的信primenumber映射会更好)
见我的回答这里。
For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5
etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.