如何解决错误的嵌套/未关闭的HTML标记?嵌套、如何解决、标记、错误

2023-09-10 23:32:18 作者:不配拥有爱

我需要关闭所有打开的标签与正确的嵌套为了清理用户提交的HTML。我一直在寻找一种算法或Python code要做到这一点,但没有发现任何东西,除了一些半生不熟的实现在PHP中,等等。

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.

例如,像

<p>
  <ul>
    <li>Foo

变为

<p>
  <ul>
    <li>Foo</li>
  </ul>
</p>

任何帮助将是AP preciated:)

Any help would be appreciated :)

推荐答案

使用BeautifulSoup:

using BeautifulSoup:

from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()

让你

<p>
 <ul>
  <li>
   Foo
  </li>
 </ul>
</p>

据我所知,你无法控制将在&lt;立GT;&LT; /李&GT;标签从富单独的行。

As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

使用整洁:

import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)

让你

<ul>
<li>Foo</li>
</ul>

不幸的是,据我所知,没有办法保持在&lt; P&GT;标签中的例子。整洁跨$ P $其中pts它作为一个空的段落,而不是一个未关闭的,这样算下来

Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

出来为

<p></p>
<ul>
<li>Foo</li>
</ul>

最终,当然,在&lt p为H.;在你的例子标签是多余的,所以你可能会被罚款与失去它。

Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

最后,整洁也可以做缩进:

Finally, Tidy can also do indenting:

print tidy.parseString(html, show_body_only=True, indent=True)

变为

<ul>
  <li>Foo
  </li>
</ul>

所有这些都起伏变化,但希望其中一人是足够接近。

All of these have their ups and downs, but hopefully one of them is close enough.