如何从Word文件使用C#中提取文本?文本、文件、Word

2023-09-06 06:11:16 作者:你那么骄傲怎么不脱衣炫耀

我想转换的Word文档文件的大量(100,000),这些都是很老。从1995年左右到2000版本的Word,我猜。我一直从兜兜转转我看到堆栈溢出,MS文档在这里。

I am trying to convert a large number (100,000) of word DOC files, these are quite old. From around 1995 to 2000 version of Word, i supposed. I keep going around in circles from what i see here in stack overflow and the MS documentation.

我要的是这样做简直就是读文件,坚持文成字符串,解析字符串,取出结构的东西(该文件实际上是一个结构化的报告,看起来就像病人:乔恩DOE)。在这一点上,我知道我在做什么。我可以解析字符串数据,把它粘成有用的变量,然后坚持这个数据到数据库中。但我不知道如何真正把文字转换成字符串。任何帮助?

What i want do so is simply read the file, stick the text into a string, parse the string, take out the structure stuff (the file is actually a structured report, looks like Patient: Jon Doe). At that point, I know what i am doing. I can parse the string data, stick it into useful variables, then stick this data into a database. But I do not know how to actually put the text into a string. Any help?

PPS我发现这参考这理应把一个DOC文件到一个文本文件中。这是一个开始,但我宁愿避免做一堆文件操作的。

PPS i found this reference which supposedly puts a DOC file into a text file. It's a start, but i'd rather avoid doing a bunch of file manipulations.

推荐答案

如果您尝试使用Word对象模型,你必须总是实例化客户端在某一版本的Word(因为一台服务器上运行Word,不推荐) 。不幸的是,你将取决于字的关于旧文件的限制,例如对在Word 2010中,您只能在沙盒模式下打开从Office 95文件(即你不能够通过编程访问该文件的内容)。此外,你将不得不面对未知模板内容(与宏的文档附加,例如)。

If you try to use the Word object model, you must always instantiate a certain version of Word on the client (since running Word on a server is not recommended). Unfortunately, you'll depend of the restriction of Word concerning older files, e.g. in Word 2010 you can open files from Office 95 only in sandbox mode (i.e you're not able to access the file content programmatically). Additionally, you'll have to deal with unknown template content (documents with macros attached, for example).

在你的情况,我宁愿找一个3P-组件,它允许用户访问的内容。 我知道从文档管理系统,如OpenText的eDocs中与自治iManage的,他们使用其他工具的所有类型的全索引文件和可以present在查看器应用程序的内容。所以,如果你在这个方向上,可能是你找到一些有用的东西。

In your case I'd rather look for a 3p-component which allows to access the content. I know from document management systems like OpenText eDocs and Autonomy iManage that they use other tools to full-index documents of all types and can present the content in a viewer application. So if you look in this direction, may be you find something useful.