发现一系列使用非精确的测量数据(模糊)精确、测量、模糊、发现

2023-09-11 04:34:14 作者:如果i

这是一个比较复杂的后续问题:Efficient方法查找顺序值

This is a more complex follow-up question to: Efficient way to look up sequential values

每个产品可以有很多的段行(千)。每个段具有的位置的从1开始的每个产品(1,2,3,4,5,等)柱和的值的列可以包含任何值,例如作为(323.113,5423.231,873.42,422.64,763.1,等等)。的数据是只读的。

Each Product can have many Segment rows (thousands). Each segment has position column that starts at 1 for each product (1, 2, 3, 4, 5, etc.) and a value column that can contain any values such as (323.113, 5423.231, 873.42, 422.64, 763.1, etc.). The data is read-only.

这可能有助于联想到产品的一首歌曲,段作为一组在歌曲音符。

It may help to think of the product as a song and the segments as a set of musical notes in the song.

由于连续段的一个子集,就像一首歌曲的片段,我想,以确定潜在的匹配产品。然而,由于在测量潜在的错误,在所述子集的分段可以的不的匹配段在数据库中的恰好的

Given a subset of contiguous segments, like a snippet of a song, I would like to identify potential matches for products. However, due to potential errors in measurements, the segments in the subset may not match the segments in the database exactly.

我怎么能确定候选产品通过寻找其中的产品细分的最密切配合的段的子集,我有分寸?另外,是一个数据库,这种类型的数据的最佳培养基

How can I identify product candidates by finding the segments of products which most closely match the subset of segments I have measured? Also, is a database the best medium for this type of data?

这里只是一些想法,因为我怎么要解决这个问题。 请不要把这些作为的具体要求。的我愿意接受任何形式的算法,使这项工作尽可能。我想,需要有多个门槛变量来确定亲密关系。一种可能性是实施接近阈值和匹配阈值。

Here are just some thoughts for how I was about to approach this problem. Please don't take these as exact requirements. I am open to any kind of algorithms to make this work as best as possible. I was thinking there needs to be multiple threshold variables for determining closeness. One possibility might be to implement a proximity threshold and a match threshold.

例如,考虑到这些值:

Product A contains these segments: 11,21,13,13,15.
Measurement 1 has captured: 20,14,14,15.
Measurement 2 has captured: 11,21,78,13.
Measurement 3 has captured: 15,13,21,13,11.

如果一个接近阈值允许的测量段为1以上或实际段下方,然后测量1可以匹配产品A,因为,虽然许多领域不相匹配的完全,它们中的接近阈相对于实际值。

If a proximity threshold allowed the measured segment to be 1 above or below the actual segment, then Measurement 1 may match Product A because, although many segments do not match exactly, they are within the proximity threshold relative to the actual values.

如果一个匹配允许测量,3个或更多匹配的阈值,测量2可能返回产品A,因为尽管其中一个段(78),远远超过了接近阈值时,它仍然匹配3按照正确的顺序分段,所以是在匹配阈值。

If a match threshold allowed for measurements with matches of 3 or more, Measurement 2 may return Product A because, although one of the segments (78) far exceeds the proximity threshold, it still matches 3 segments in the correct order and so is within the match threshold.

3的测量不匹配产品​​A,因为尽管在实际的段存在的所有测段,它们不是接近或匹配阈值之内。

Measurement 3 would not match Product A because, although all measured segments exist in the actual segments, they are not within the proximity or match thresholds.

更新:答案之一问我确定我的意思的的最密切配合的。我不完全知道如何回答这个问题,但我会尝试的歌曲比喻继续解释。比方说,该段重新录制歌曲的present最高频率。如果我再次记录该相同的歌曲这将是类似的,但是由于背景噪声和记录设备的其他限制,某些频率的匹配,一些将接近,和几个将离开。在这种情况下,你会怎么定义,当一个记录匹配另一个?这是同一种匹配逻辑我在寻找在这个问题上要使用的。

Update: One of the answers asked me to define what I mean by most closely match. I'm not exactly sure how to answer that, but I'll try to explain by continuing with the song analogy. Let's say the segments represent maximum frequencies of a recorded song. If I record that same song again it will be similar, but due to background noise and other limitations of recording equipment, some of the frequencies will match, some will be close, and a few will be way off. In this scenario, how would you define when one recording "matches" another? That's the same kind of matching logic I'm looking for to use in this problem.

推荐答案

如果你把字面上你的歌声例如,一种方法是归结您输入到一个位向量的指纹,然后查找该指纹​​在数据库中以完全匹配。您可以通过提取几个手印,从您的输入和/或尝试,例如增加找到一个很好的比赛机会所有位向量,只有1位错误,远离你的指纹是。

If you take literally your song example, one approach is to boil down your input to a bit-vector fingerprint, and then look up that fingerprint in a database as an exact match. You can increase the chances of finding a good match by extracting several fingerprints from your input and/or trying e.g. all bit-vectors that are only 1 or bit-errors away from your fingerprint.

如果您有机会获得ACM数字图书馆,你可以看到这样的在The Shazam的音乐识别服务的方针的描述,在acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744.也有一些信息在 HTTP://www.music.mcgill。 CA /〜阿拉斯泰尔/ 621 / porter11fingerprint-summary.pdf 。

If you have access to the ACM digital library, you can read a description of this sort of approach in "The Shazam Music Recognition service" at acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744. There is also some information at http://www.music.mcgill.ca/~alastair/621/porter11fingerprint-summary.pdf.

您所描述的输入格式表明,你也许可以做一些与 HTTP描述的随机投影法://en.wikipedia.org/wiki/Locality_sensitive_hashing

The input format you describe suggests that you might be able to do something with the random projection method described in http://en.wikipedia.org/wiki/Locality_sensitive_hashing.

要回答你的第二个问题,这取决于到底是什么一个位置对应于,你可以考虑熬煮的数字凑由比特或字符的指纹,然后在文本搜索数据库中存储这些,比如Apache Lucene的。

To answer your second question, depending on exactly what a position corresponds to, you might consider boiling down the numbers to hash fingerprints made up of bits or characters, and storing these in a text search database, such as Apache Lucene.