创建与所有可能的组合的字符串组合、字符串

2023-09-11 05:55:07 作者:莫念初。

我使用的是OCR算法(基于的tesseract),它具有识别某些字符的困难。我已经部分解决,通过建立自己的后处理哈希表,其中包括对字符。例如,由于文字只是数字,我想通了,如果有问:字符的文本中,它应该是 9 代替。

I am using a OCR algorithm (tesseract based) which has difficulties with recognizing certain characters. I have partially solved that by creating my own "post-processing hash-table" which includes pairs of characters. For example, since the text is just numbers, I have figured out that if there is Q character inside the text, it should be 9 instead.

不过,我有 6 8 字符,一个更严重的问题,因为他们都被认为是 B 。现在,因为我知道我要找的(当我翻译文本的图像)和字符串是相当短(6〜8位数字),我认为创建的所有可能组合的字符串6 8 ,并比较他们每个人到一个我期待的。

However I have a more serious problem with 6 and 8 characters since both of them are recognized as B. Now since I know what I am looking for (when I am translating the image to text) and the strings are fairly short (6~8 digits), I thought to create strings with all possible combinations of 6 and 8 and compare each one of them to the one I am looking for.

因此​​,举例来说,我得到OCR识别以下字符串:

So for example, I have the following string recognized by the OCR:

L0B7B0B5

因此​​,每个 B 这里可以是 6 8

现在我想产生类似下面的列表:

Now I want to generate a list like the below:

L0878085
L0878065
L0876085
L0876065
.
.

因此​​,它是一种二进制表的3位数字,并在此情况下,有8种选择。但 B 字符的字符串,量能不是3(它可以是任何数字)。

So it's kind of binary table with 3 digits and in this case there are 8 options. But the amount of B characters in string can be other than 3 (it can be any number).

我曾尝试使用Python itertools 有这样的事情模块:

I have tried to use Python itertools module with something like that:

list(itertools.product(*["86"] * 3))

这将提供以下结果:

Which will provide the following result:

[('8', '8', '8'), ('8', '8', '6'), ('8', '6', '8'), ('8', '6', '6'), ('6', '8', '8'), ('6', '8', '6'), ('6', '6', '8'), ('6', '6', '6')]

我以为那我以后可以用它来交换 B 字符。但是,由于某种原因,我不能让 itertools 的工作在我的环境。我认为它有事情做,我使用的的Jython 而非单纯的的Python 的事实。

which I assume I can then later use to swap B characters. However, for some reason I can't make itertools work in my environment. I assume it has something to do the fact I am using Jython and not pure Python.

我会很乐意听到任何其他的想法如何完成这个任务。也许有一个简单的解决方案,我没有想到的?

I will be happy to hear any other ideas as how to complete this task. Maybe there is a simpler solution I didn't think of?

推荐答案

itertools.product 接受重复关键词您可以使用:

itertools.product accepts a repeat keyword that you can use:

In [92]: from itertools import product

In [93]: word = "L0B7B0B5"

In [94]: subs = product("68", repeat=word.count("B"))

In [95]: list(subs)
Out[95]: 
[('6', '6', '6'),
 ('6', '6', '8'),
 ('6', '8', '6'),
 ('6', '8', '8'),
 ('8', '6', '6'),
 ('8', '6', '8'),
 ('8', '8', '6'),
 ('8', '8', '8')]

后来有相当简洁的方法,使该换人是做一个还原操作字符串替换方法:

In [97]: subs = product("68", repeat=word.count("B"))

In [98]: [reduce(lambda s, c: s.replace('B', c, 1), sub, word) for sub in subs]
Out[98]: 
['L0676065',
 'L0676085',
 'L0678065',
 'L0678085',
 'L0876065',
 'L0876085',
 'L0878065',
 'L0878085']

另一种方法,利用一对夫妇更多的功能 itertools

In [90]: from itertools import chain, izip_longest

In [91]: subs = product("68", repeat=word.count("B"))

In [92]: [''.join(chain(*izip_longest(word.split('B'), sub, fillvalue=''))) for sub in subs]
Out[92]: 
['L0676065',
 'L0676085',
 'L0678065',
 'L0678085',
 'L0876065',
 'L0876085',
 'L0878065',
 'L0878085']