读取XML以"&安培;"到C#XmlDocument对象安培、对象、XML、QUOT

2023-09-02 10:38:13 作者:克制不去在乎ら

我继承,似乎当它试图读取存储在具有一个数据库中的XML文档有错误,写得不好的Web应用程序&放大器;在里面。例如,有将与内容的标签:prepaid和放大器;充电。是否有一些秘密简单的事要把它没有得到一个错误的解析这个角色,还是我失去了一些东西明显?

I have inherited a poorly written web application that seems to have errors when it tries to read in an xml document stored in the database that has an "&" in it. For example there will be a tag with the contents: "Prepaid & Charge". Is there some secret simple thing to do to have it not get an error parsing that character, or am I missing something obvious?

编辑: 是否有任何其他字符会导致此相同类型的分析程序错误的没有很好形成的呢?

Are there any other characters that will cause this same type of parser error for not being well formed?

推荐答案

问题在于没有很好地形成的XML。正确生成的XML将列出的数据是这样的:

The problem is that the xml is not well-formed. Properly generated xml would list that data like this:

prepaid和放大器;放大器;充电

我不得不修正之前同样的问题,我做了这个正则表达式:

I've had to fix the same problem before, and I did it with this regex:

Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");

再加上这样定义字符串常量:

Combine that with a string constant defined like this:

const string goodAmpersand = "&";

现在你可以说 badAmpersand.Replace(<您的输入>中goodAmpersand);

请注意,一个简单的与string.replace(&放大器;,&放大器;放大器;)不够好,因为你不能提前知道对于是否有任何与放一个给定的文件;字符将被正确,不正确的,甚至是两者在同一文件codeD。

Note that a simple String.Replace("&", "&") isn't good enough, since you can't know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document.

这里的渔获量,你必须的在的做到这一点,以XML文档加载到你的分析器,它可能意味着通过它额外的一个阶段。此外,它不占一个CDATA段内的&符号。最后,它的只有的捕捉&号,而不是其他非法字符,如&LT ;. 更新:基于注释,我需要更新EX pression为将十六进制codeD(安培;#X ...;)实体以及

The catches here are that you have to do this to your xml document before loading it into your parser, which likely means an extra pass through it. Also, it does not account for ampersands inside of a CDATA section. Finally, it only catches ampersands, not other illegal characters like <. Update: based on the comment, I need to update the expression for hex-coded (&#x...;) entities as well.

有关的字符可​​能会导致问题,实际的规则有点复杂。例如,某些字符被允许在数据,但不是作为一个元素名称的第一个字母。还有的非法字符没有简单的列表。相反,UNI code大(非连续)大片是指法律 ,和那以外的任何东西是违法的。

Regarding which characters can cause problems, the actual rules are a little complex. For example, certain characters are allowed in data, but not as the first letter of an element name. And there's no simple list of illegal characters. Instead, a large (non-contiguous) swath of UNICODE is defined as legal, and anything outside of that is illegal.

所以,当它发生的时候,你必须相信你的文档的源至少有一定的合规性和一致性。例如,我发现,人们往往足够的智慧,以确保标签正常工作和逃避&LT;,即使他们不知道和放大器;是不允许的,所以你的问题的今天。然而,,最好的办法是此问题得到解决在源头。

So when it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. For example, I've found that people are often smart enough to make sure the tags work properly and escape <, even if they don't know that & isn't allowed, hence your problem today. However, the best thing would be to get this fixed at the source.

哦,对CDATA建议的说明:我会用它来确保XML中的我要创建的是良好的,但与现有的XML外界打交道时,我发现正则表达式的方法更简单。

Oh, and a note about the CDATA suggestion: I'd use that to make sure xml that I'm creating is well-formed, but when dealing with existing xml from outside, I find the regex method easier.