如何确定PDF被标记或不?或不、标记、PDF

2023-09-07 15:41:16 作者:trammels(羁绊)

我怎么会知道,如果PDF被标记或不?我正在开发一个程序,将复制的文本内部的PDF文件,并​​在我的应用程序中显示它,所以我试图测试PDF文件,我复制从PDF文件(普通复制+粘贴)一个表,并将其粘贴到MS字。其结果是无表的普通文本。有一些问题,当你从PDF文件复制表,并将其粘贴到Word它成为一个图片。是真的吗?

解决方案   

如何确定PDF被标记或不?

根据您正在使用处理文件库,你可以尝试检索条目 MarkInfo 目录字典。

从PDF规格:

  

TABLE 3.25条目目录中的字典   键: MarkInfo   类型:字典   值:(可选; PDF 1.4)约含标签PDF的文档的使用信息的标记信息词典   公约(见10.6节,逻辑结构)。

然而,即使该属性的值被设置为TRUE,但这并不意味着该标签实际上将那里的,如果是这样,他们可能不是有用到你在所有提取的表。你仍然可以找到PDF文件使用该标签只用来标记段落和图片的表。

长话短说,除非的您的正在生成您的应用程序将消耗,这样就可以知道哪个标签来查找文件,它的不是好主意,依靠这些标签为表提取PDF。

How will I know if PDF is tagged or not? I'm developing a program that would copy a text inside a PDF file and display it in my app, so I tried to test the PDF file, I copied a table from a PDF file (Ordinary Copy+Paste) and paste it in MS Word. The result was a normal text without tables. There are some issues that when you copy a table from a pdf file and paste it to Word it becomes an image. Is that true?

解决方案

为什么有的PDF不能标记,如何解决

How to determine if PDF is tagged or not?

Depending on the library you are using to process your files, you could try to retrieve the entry MarkInfo from the Catalog dictionary.

From the PDF Specification:

TABLE 3.25 Entries in the catalog dictionary KEY: MarkInfo TYPE: dictionary VALUE: (Optional; PDF 1.4) A mark information dictionary containing information about the document’s usage of Tagged PDF conventions (see Section 10.6, "Logical Structure").

However, even if the value of this property is set to TRUE, it does not mean that the tags will actually be there, and if they are, they might not be usefull to you at all for extracting tables. You can still find PDF files with tables that use the tags only for marking paragraphs and pictures.

Long story short, unless you are generating the files that your application is going to consume, so that you can know which tags to look for, it is not a good idea to rely on these tags for "tables extraction from PDF".