我使用的是OCR算法(基于的tesseract),它具有识别某些字符的困难。我已经部分解决,通过建立自己的后处理哈希表,其中包括对字符。例如,由于文字只是数字,我想通了,如果有问:
字符的文本中,它应该是 9
代替。
I am using a OCR algorithm (tesseract based) which has difficulties with recognizing certain characters. I have partially solved that by creating my own "post-processing hash-table" which includes pairs of characters. For example, since the text is just numbers, I have figured out that if there is Q
character inside the text, it should be 9
instead.
不过,我有 6
和 8
字符,一个更严重的问题,因为他们都被认为是 B
。现在,因为我知道我要找的(当我翻译文本的图像)和字符串是相当短(6〜8位数字),我认为创建的所有可能组合的字符串6
和 8
,并比较他们每个人到一个我期待的。
However I have a more serious problem with 6
and 8
characters since both of them are recognized as B
. Now since I know what I am looking for (when I am translating the image to text) and the strings are fairly short (6~8 digits), I thought to create strings with all possible combinations of 6
and 8
and compare each one of them to the one I am looking for.
因此,举例来说,我得到OCR识别以下字符串:
So for example, I have the following string recognized by the OCR:
L0B7B0B5
因此,每个 B
这里可以是 6
或 8
。
现在我想产生类似下面的列表:
Now I want to generate a list like the below:
L0878085
L0878065
L0876085
L0876065
.
.
因此,它是一种二进制表的3位数字,并在此情况下,有8种选择。但 B
字符的字符串,量能不是3(它可以是任何数字)。
So it's kind of binary table with 3 digits and in this case there are 8 options. But the amount of B
characters in string can be other than 3 (it can be any number).
我曾尝试使用Python itertools
有这样的事情模块:
I have tried to use Python itertools
module with something like that:
list(itertools.product(*["86"] * 3))
这将提供以下结果:
Which will provide the following result:
[('8', '8', '8'), ('8', '8', '6'), ('8', '6', '8'), ('8', '6', '6'), ('6', '8', '8'), ('6', '8', '6'), ('6', '6', '8'), ('6', '6', '6')]
我以为那我以后可以用它来交换 B
字符。但是,由于某种原因,我不能让 itertools
的工作在我的环境。我认为它有事情做,我使用的的Jython 而非单纯的的Python 的事实。
which I assume I can then later use to swap B
characters. However, for some reason I can't make itertools
work in my environment. I assume it has something to do the fact I am using Jython and not pure Python.
我会很乐意听到任何其他的想法如何完成这个任务。也许有一个简单的解决方案,我没有想到的?
I will be happy to hear any other ideas as how to complete this task. Maybe there is a simpler solution I didn't think of?
itertools.product
接受重复
关键词您可以使用:
itertools.product
accepts a repeat
keyword that you can use:
In [92]: from itertools import product
In [93]: word = "L0B7B0B5"
In [94]: subs = product("68", repeat=word.count("B"))
In [95]: list(subs)
Out[95]:
[('6', '6', '6'),
('6', '6', '8'),
('6', '8', '6'),
('6', '8', '8'),
('8', '6', '6'),
('8', '6', '8'),
('8', '8', '6'),
('8', '8', '8')]
后来有相当简洁的方法,使该换人是做一个还原操作字符串替换
方法:
In [97]: subs = product("68", repeat=word.count("B"))
In [98]: [reduce(lambda s, c: s.replace('B', c, 1), sub, word) for sub in subs]
Out[98]:
['L0676065',
'L0676085',
'L0678065',
'L0678085',
'L0876065',
'L0876085',
'L0878065',
'L0878085']
另一种方法,利用一对夫妇更多的功能 itertools
:
In [90]: from itertools import chain, izip_longest
In [91]: subs = product("68", repeat=word.count("B"))
In [92]: [''.join(chain(*izip_longest(word.split('B'), sub, fillvalue=''))) for sub in subs]
Out[92]:
['L0676065',
'L0676085',
'L0678065',
'L0678085',
'L0876065',
'L0876085',
'L0878065',
'L0878085']