什么是选择的基础上对大集匹配时,位掩码不计特殊项目的一组功能的最佳方法是什么?基础上、掩码、大集、特殊

2023-09-11 05:15:15 作者:[ 入戏太深 ]

假设我有一个大的,静态的对象集,我有,我想根据一套复杂的标准,这就需要一个昂贵的测试匹配所有的对象。

还假设这是可能的,以确定一大组的功能,可以用于排除可能的匹配,从而避免了昂贵的测试。如果一个功能是present对象中的我测试,然后我可以排除一组不具有此功能的任何对象。换句话说,该特征的presence是必要的,但不足以使测试通过

在这种情况下,我可以precompute一个位掩码为组中的每个对象表示每个功能是否present或缺如的对象。我还可以计算它,我要测试的对象,然后通过这样的(伪code)数组循环:

  objectMask = computeObjectMask(myObject的)
对于(每个TestObject定义对象集)
{
    如果((testObject.mask&安培;!objectMask)= objectMask)
    {
        //早早出局,有些功能在objectMask
        //但不是在testObject.mask,所以测试无法通过
    }
    否则,如果(doComplicatedTest(TestObject中,myObject的)
    {
        //找到了匹配!
    }
}
 

如果你想计算的相关性所以我的问题是,给定一个有限的掩码大小,以及可能的功能的大名单,而每个功能的目标设置(加上访问设置的​​对象的频率表功能等)之间,有什么算法,我可以用它来选择功能列入最优集在我的位掩码最大化早期出局的数量并减少测试次数?的

五年级 第四单元 易错题

如果我只是选择顶的X最常见的功能,然后有机会的特征在这两个面具是要高一些,所以它看起来像早期出局的数量将减少。但是,如果我选择的X最不常见的功能,然后objectMask可能经常是零,这意味着没有及早出局是可能的。看来pretty的轻松体验,并拿出了一套中等频率的功能,提供良好的性能,但我感兴趣的是是否有这样做的理论最好的方式。

请注意:每个功能的频率被假定为一组可能的myObjects作为对象集一样,虽然我很想知道如何处理,如果事实并非如此。我也有兴趣知道,如果有一个算法寻找最佳的功能集因为要匹配的候选对象的大样本。

可能的应用:选配针对大量正则表达式的输入字符串,反对使用一个标准的词语,如必须包含在同一顺序相同的字母,但可能有多余的字符插入任何地方大词典匹配的字符串单词等实施例的特点:包含文字字符D,包含字符˚F后跟字符ģ稍后在字符串等等显然一组可能的特征将是高度依赖于特定的应用 解决方案

您可以尝试阿霍 - corasick算法。它最快的多模式匹配。基本上,它是一个有限状态机故障链路计算与广度优先搜索的线索的。

Suppose I have a large, static set of objects, and I have an object that I want to match against all of them according to a complicated set of criteria that entails an expensive test.

Suppose also that it's possible to identify a large set of features that can be used to exclude potential matches, thereby avoiding the expensive test. If a feature is present in the object I am testing, then I can exclude any objects in the set that don't have this feature. In other words, the presence of the feature is necessary but not sufficient for the test to pass.

In that case, I can precompute a bitmask for each object in the set indicating whether each feature is present or absent in the object. I can also compute it for the object that I want to test, and then loop through the array like this (pseudo-code):

objectMask = computeObjectMask(myObject)
for(each testObject in objectSet)
{
    if((testObject.mask & objectMask) != objectMask)
    {
        // early out, some features are in objectMask
        // but not in testObject.mask, so the test can't pass
    }
    else if(doComplicatedTest(testObject, myObject)
    {
        // found a match! 
    }
}

So my question is, given a limited bitmask size, and a large list of possible features, and a table of the frequencies of each feature in object set (plus access to the object set if you want to compute correlations between features and so on), what algorithm can I use to choose the optimal set of features for inclusion in my bitmask to maximize the number of early outs and minimize the number of tests?

If I just choose the top x most common features, then chance of a feature being in both masks is higher, so it seems like the number of early outs would be reduced. However if I choose the x least common features then objectMask might frequently be zero, meaning no early outs are possible. It seems pretty easy to experiment and come up with a set of middling-frequency features that gives good performance, but I'm interested in whether there is a theoretical best way of doing it.

Note: the frequency of each feature is assumed to be the same in the set of possible myObjects as in the objectSet, although I'd be interested to know how to handle if it isn't. I'd also be interested to know if there is an algorithm for finding the best feature set given a large sample of candidate objects that are to be matched against the set.

Possible applications: matching an input string against a large number of regexes, matching a string against a large dictionary of words using a criteria such as "must contain the same letters in the same order, but possibly with extra characters inserted anywhere in the word", etc. Example features: "contains the literal character D", "contains the character F followed by the character G later in the string" etc. Obviously the set of possible features will be highly dependent on the specific application.

解决方案

You can try aho-corasick algorithm. Its the fastest multi pattern matcher. Basically its a finite state machine with failure links computed with a breadth-first search of the trie.