屏幕采用HTMLAgility帮助下刮,请屏幕、HTMLAgility

2023-09-06 22:31:26 作者:啊呸

昨天晚上,当我问到屏幕抓取我得到了一个很好的文章链接,并得到了我这一点。我有几个问题,但是。我会后我的code,以及下面的HTML源文件。我试图获取数据表之间的数据,然后将数据发送到SQL表。我发现成功抢夺说明的Widget 3.5等...最后修改者乔但是由于1日2 / TR还包括IMG SRC = / ......ALT =00721​​408的号码不要让一把抓。我我坚持至于如何改变code,使表中的所有数据一把抓。2,什么我需要做的旁边,以prepare数据被发送到一个SQL表。我的code是如下:

 使用系统;
        使用System.Collections.Generic;
        使用System.Linq的;
        使用System.Text;
        使用HtmlAgilityPack;
        使用System.Windows.Forms的;

        命名空间ConsoleApplication1
        {

        }
        类节目
        {
            静态无效的主要(字串[] args)
            {
                //加载html文件
                VAR webGet =新HtmlWeb();
                VAR DOC = webGet.Load(HTTP:// localhost的);

                //获取文档中的所有表
                HtmlNodeCollection表= doc.DocumentNode.SelectNodes(//表);

                //遍历所有行中的第一个表
                HtmlNodeCollection行=表[0] .SelectNodes(// TR);
                的for(int i = 0; I< rows.Count ++ I)
                {
                    //遍历所有列在此行中
                    HtmlNodeCollection COLS =行[I] .SelectNodes(// TD);
                    对于(INT J = 0; J< cols.Count ++ j)条
                    {

                        //获取的列的值,并打印
                        字符串值= COLS [J] .InnerText;

                        Console.WriteLine(值);


                    }
                }

            }
        }





<表类=数据>




< TR>< TD>部分-民< / TD>< TD宽度=50>< / TD>< TD>< IMG SRC =/一部分code /号/ 072140 ALT =072140/>< / TD>< / TR>




< TR>< TD>马努 - 数字和LT; / TD>< TD宽度=50>< / TD>< TD>< IMG SRC =/一部分code /马努/ 00721​​408 ALT =00721​​408/>< / TD>< / TR>

< TR>< TD>简介< / TD>< TD>< / TD>< TD>的Widget 3.5< / TD>< / TR>



< TR>< TD>马努 - 国家< / TD>< TD>< / TD>< TD>美国< / TD>< / TR>

< TR>< TD>最后修改< / TD>< TD>< / TD>< TD> 26 2011年1月,下午8点08< / TD>< / TR>


< TR>< TD>最后修改者< / TD>< TD>< / TD>< TD>
马努

< / TD>< / TR>




< /表>



&其中p为H.;


< /身体GT;< / HTML>
 

解决方案

而脆弱的像这样的工作,你的情况 - 基本上就是包括所有图像的文本内容 ALT 属性:

  //迭代中的第一个表中的所有行
HtmlNodeCollection行=表[0] .SelectNodes(// TR);
的for(int i = 0; I< rows.Count ++ I)
{
    //遍历所有列在此行中
    HtmlNodeCollection COLS =行[I] .SelectNodes(// TD);
    对于(INT J = 0; J< cols.Count ++ j)条
    {
        VAR图像= COLS [J] .SelectNodes(IMG);
        如果(图像!= NULL)
            的foreach(在图像VAR图)
            {
                如果(image.Attributes [ALT]!= NULL)
                    Console.WriteLine(image.Attributes [ALT]值);
            }
        //获取的列的值,并打印
        字符串值= COLS [J] .InnerText;
        Console.WriteLine(值);
    }
}
 

vscode快速生成html模板文件的方法

Last night when I asked about screen scraping I was given an excellent article link and has got me to this point. I have a few questions however. I will post my code as well as the html source below. I am trying to grab the data between the data tables, and then send the data to an sql table. I have found success in grabbing Description Widget 3.5 ect... Last Modified By Joe however because the 1st 2 /tr also contains img src=/......" alt="00721408" the numbers do not get grabbed. I am stuck as to how to alter the code so that all the data in the table is grabbed. 2nd, What do I need to do next in order to prepare the data to be sent to a sql table. My code is as follows:

using System;
        using System.Collections.Generic;
        using System.Linq;
        using System.Text;
        using HtmlAgilityPack;
        using System.Windows.Forms;

        namespace ConsoleApplication1
        {

        }
        class Program
        {
            static void Main(string[] args)
            {
                // Load the html document
                var webGet = new HtmlWeb();
                var doc = webGet.Load("http://localhost");

                // Get all tables in the document
                HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");

                // Iterate all rows in the first table
                HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
                for (int i = 0; i < rows.Count; ++i)
                {
                    // Iterate all columns in this row
                    HtmlNodeCollection cols = rows[i].SelectNodes(".//td");
                    for (int j = 0; j < cols.Count; ++j)
                    {

                        // Get the value of the column and print it
                        string value = cols[j].InnerText;

                        Console.WriteLine(value);


                    }
                }

            }
        }





<table class="data">




<tr><td>Part-Num</td><td width="50"></td><td><img src="/partcode/number/072140" alt="072140"/></td></tr>




<tr><td>Manu-Number</td><td width="50"></td><td><img src="/partcode/manu/00721408" alt="00721408" /></td></tr>

<tr><td>Description</td><td></td><td>Widget 3.5</td></tr>



<tr><td>Manu-Country</td><td></td><td>United States</td></tr>

<tr><td>Last Modified</td><td></td><td>26 Jan 2011,  8:08 PM</td></tr>


<tr><td>Last Modified By</td><td></td><td>
Manu

</td></tr>




</table>



<p>


</body></html>

解决方案

While fragile something like this would work in your case - basically just including the text content of all image alt attributes:

// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
for (int i = 0; i < rows.Count; ++i)
{
    // Iterate all columns in this row
    HtmlNodeCollection cols = rows[i].SelectNodes(".//td");
    for (int j = 0; j < cols.Count; ++j)
    {
        var images = cols[j].SelectNodes("img");
        if(images!=null)
            foreach (var image in images)
            {
                if(image.Attributes["alt"]!=null)
                    Console.WriteLine(image.Attributes["alt"].Value);
            }
        // Get the value of the column and print it
        string value = cols[j].InnerText;
        Console.WriteLine(value);
    }
}