我有一个asp.net的网页,有一个TinyMCE的框。用户可以格式化文本和发送HTML以被存储在数据库中。
I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.
在服务器上,我想借此剥离从文本的HTML这样我就可以搜索存储在只有一个全文索引的列中的文本。
On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.
这是一件轻而易举剥夺客户端上的HTML使用jQuery的文本()函数,但我真的宁愿做这在服务器上。是否有我可以使用这个任何现有的事业吗?
It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?
请参阅我的答案。
我下载了 HtmlAgilityPack 并创造了这个功能
I downloaded the HtmlAgilityPack and created this function:
string StripHtml(string html)
{
// create whitespace between html elements, so that words do not run together
html = html.Replace(">","> ");
// parse html
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// strip html decoded text from html
string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
// replace all whitespace with a single space and remove leading and trailing whitespace
return Regex.Replace(text, @"\s+", " ").Trim();
}