为什么SAXParser的失败随意?随意、SAXParser

2023-09-07 12:51:24 作者:喝⑨泡⑧讲⑩话

我用在我的Andr​​oid应用程序SAX解析器读取一些供稿时间。该脚本如下执行。

I'm using SAX parser in my Android application to read a few feeds a time. The script is executed as follows.

                     // Begin FeedLezer
                    try {

                        /** Handling XML **/
                        SAXParserFactory spf = SAXParserFactory.newInstance();
                        SAXParser sp = spf.newSAXParser();
                        XMLReader xr = sp.getXMLReader();

                        /** Send URL to parse XML Tags **/
                        URL sourceUrl = new URL(
                            BronFeeds[i]);

                        /** Create handler to handle XML Tags ( extends DefaultHandler ) **/
                        Feed_XMLHandler myXMLHandler = new Feed_XMLHandler();
                        xr.setContentHandler(myXMLHandler);
                        xr.parse(new InputSource(sourceUrl.openStream()));

                    } catch (Exception e) {
                        System.out.println("XML Pasing Excpetion = " + e);
                    }
                     sitesList = Feed_XMLHandler.sitesList;

                    String titels = sitesList.getMergedTitles();

和这里有 Feed_XMLHandler.java 并的 Feed_XMLList.java ,该我基本上都只是从网上花了。

And here are Feed_XMLHandler.java and Feed_XMLList.java, which I basically both just took from the web.

不过,这code。在次失败。我将展示一些例子。

However, this code fails at times. I'll show some examples.

https://m.xsw88.com/allimgs/daicuo/20230907/5571.png.jpg 它会在这儿很舒服。它甚至可以识别并显示撇号。即使当点击文章开放,几乎所有的文字显示,让一切都很好。源饲料是在这里。我无法控制的饲料。

https://m.xsw88.com/allimgs/daicuo/20230907/5571.png.jpg It goes very well here. It even recognizes and displays apostrophes. Even when clicking the articles open, almost all of the text shows, so that's all good. The source feed is here. I can't control the feed.

https://m.xsw88.com/allimgs/daicuo/20230907/5572.png.jpg 这里,它不那么好走。它确实显示I,但它扼流圈撇号(也应该是北美防空司令部的Waarom后)。 这里

https://m.xsw88.com/allimgs/daicuo/20230907/5572.png.jpg Here, it doesn't go so well. It does display the ï, but it chokes on the apostrophe (there's supposed to be 'NORAD' after the Waarom). Here

https://m.xsw88.com/allimgs/daicuo/20230907/5573.png.jpg 这是最差的一个。正如你所看到的,标题只显示一个撇号,虽然它被认为是一个blablabla。此外,文本的行的中间结束时,没有在报价任何特殊字符。 饲料是这里

https://m.xsw88.com/allimgs/daicuo/20230907/5573.png.jpg This is the worst one. As you can see, the title only displays an apostrophe, whilst it is supposed to be a 'blablabla'. Also, the text ends in the middle of the line, without any special characters in the quote. The feed is here

在任何情况下,我有超过供给的控制。我觉得剧本上没有的特殊字符窒息。我怎样才能确保正确的SAX获取所有的字符串?

In all cases, I have no control over the feed. I think the script does choke on special characters. How can I make sure SAX fetches all the strings correctly?

如果有人知道一个答案,你真的帮了我很多:D

If anyone knows an answer to this, you really help me out a LOT :D

先谢谢了。

推荐答案

这是从的Xerces的常见问题。

This is from the FAQ of Xerces.

为什么SAX解析器失去一些  字符数据为什么是数据  分割成几个大块?如果你  阅读SAX文档,你会  发现SAX可以提供连续的  文本字符作为多个呼叫,  对于具有解析器做的原因  效率和输入缓冲。它是  程序员的责任  妥善处理了,例如通过  累计文本,直到下一个  非字符的事件。

Why does the SAX parser lose some character data or why is the data split into several chunks? If you read the SAX documentation, you will find that SAX may deliver contiguous text as multiple calls to characters, for reasons having to do with parser efficiency and input buffering. It is the programmer's responsibility to deal with that appropriately, e.g. by accumulating text until the next non-characters event.

您是code是很好的许多XML解析的教程之一(如的这个的一个在这里)现在,本教程是好的,一切,但他们不能不提到很重要的事......

You're code is very well adapted from one of many XML Parsing tutorials (like this one here) Now, the tutorial is good and all, but they fail to mention something very important...

在这里请注意,这部分...

Notice this part here...

    public void characters(char[] ch, int start, int length)
            throws SAXException
    {
              if(in_ThisTag){
                     myobj.setName(new String(ch,start,length))
              }
    }

我敢打赌,在这一点上你核对布尔,以纪念你下的标签,然后在某种的设定值类你做?或者类似的东西....

I bet at this point you're checking up booleans to mark which tag you're under and then setting a value in some kind of class you made? or something like that....

但问题在于,SAX解析器(被缓冲的)不会necesarily让你的所有标签之间的字符一气呵成....说,如果<标签> Lorem存有...真长句...< /标签> 让你的SAX解析器调用字符功能块....

But the problem is, the SAX parser (which is buffered) will not necesarily get you all the characters between a tag at one go....say if <tag> Lorem Ipsum...really long sentence...</tag> so your SAX parser calls characters function is chunks....

所以这里的伎俩,就是不停追加值的字符串变量,实际设置(或承诺),它的结构,当标签结束... (即的endElement

So the trick here, is to keep appending the values to a string variable and the actually set (or commit) it to your structure when the tag ends...(ie in endElement)

示例

@Override
public void endElement(String uri, String localName, String qName)
        throws SAXException {

    currentElement = false;

    /** set value */
    if (localName.equalsIgnoreCase("tag"))
            {
        sitesList.setName(currentValue);
                    currentValue = ""; //reset the currentValue
            }

}

@Override
public void characters(char[] ch, int start, int length)
        throws SAXException {

    if (in_Tag) {
        currentValue += new String(ch, start, length); //keep appending string, don't set it right here....maybe there's more to come.
    }

}

此外,如果使用的StringBuilder 的追加,因为那将是更有效的会更好......

Also, it would be better if you use StringBuilder for the appending, since that'll be more efficient....

希望这是有道理的!如果没有检查这并的这里

Hope it makes sense! If it didn't check this and here

 
精彩推荐