我收到EN codeD的PDF文件定期。编码是这样的:
I receive encoded PDF files regularly. The encoding works like this:
的PDF文件可正确的Acrobat Reader软件显示 选择所有,并通过Acrobat Reader软件复制测试 和粘贴在文本编辑器 将显示内容都设有codeD所以,例子是:
13579 -> 3579;
hello -> jgnnq
这基本上是一个偏移量(可能互换)的ASCII字符。
it's basically an offset (maybe swap) of ASCII characters.
现在的问题是我怎么能找到的时候我有机会获得只有少数样本会自动偏移。我不能肯定编码偏移是否被改变。我所知道的是一些文本通常会(如果不总是)显示出来,如姓名:,摘要,全面,在PDF中
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.
感谢您!
编辑:感谢您的反馈意见。我想尝试打破问题成更小的问题:
edit: thanks for the feedback. I'd try to break the question into smaller questions:
1部分:http://stackoverflow.com/questions/2712466/how-to-detect-identical-parts-inside-string
您需要暴力破解它。
如果这些模式是简单的像+2字符code就像在你的例子(这是+2字符codeS)
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
h i j
e f g
l m n
l m n
o p q
1 2 3
3 4 5
5 6 7
7 8 9
9 : ;
您可以很容易地实现这样的检查对knowns字
You could easily implement like this to check against knowns words
>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
... rot=''.join(chr(ord(j)+i) for j in text)
... for x in knowns:
... if x in rot:
... print rot
...
hello