刮AJAX网页使用Python和/或scrapy网页、AJAX、scrapy、Python

2023-09-10 17:18:48 作者:此生为你颠沛流离

我想做的事情什么是刮信访数据 - 名称,城市,国家,日期,签名人数 - 从一个或多个请愿petitions.whitehouse.gov

What I want to do is scrape petition data - name, city, state, date, signature number - from one or more petitions at petitions.whitehouse.gov

我想在这一点上Python是要走的路 - 可能是scrapy库 - 一些功能以及处理现场的AJAX方面。这样做的原因刮的是,这份请愿书数据不向公众提供。

I assume at this point python is the way to go - probably the scrapy library - along with some functions to deal with the ajax aspects of the site. The reason for this scraper is that this petition data is not available to the public.

我是一个自由职业者的高科技记者,然后我希望能够倾倒每一个上访的数据到一个CSV文件,以便分析每个国家的人谁注册一个国家的申请数量,并与从多个上访的数据,找到的人谁签多上访等,然后进行对信访过程中的政治上的可行性和数据本身的一些结论的数量。

I am a freelance tech journalist and I want to be able to dump each petition's data into a CSV file in order to analyze the number of people from each state who sign a state's petition, and with data from multiple petitions, find the number of people who sign multiple petitions, etc., and then make some conclusions about the political viability of the petition process and the data itself.

在petitions.whitehouse.gov运行作为一个Drupal模块信访的功能,和白宫开发商回答我的问题,要求在github https://github.com/WhiteHouse/petition/issues/44他们正在研究的API,以允许从模块的访问申请数据。但对于该API无发行日期;并没有解决目前petitions.whitehouse.gov信访数据的问题。

The petition functions at petitions.whitehouse.gov run as a Drupal module, and the White House developers responded to my issue request on github https://github.com/WhiteHouse/petition/issues/44 that they are working on an API to allow access to petition data from the module. But there is no release date for that API; and that doesn't solve the problem of the petition data currently on petitions.whitehouse.gov.

我已经通过电子邮件发送了白宫和白宫开发商,说明我是一名自由撰稿人,并要求一些方法来访问数据。数字战略的白宫办公室对我说的不幸的是,我们没有办法提供数据出口在这个时候,但我们正在努力开拓通过API前进中的数据。 有一个开放式数据举措在白宫,但显然上访的数据不是盖的。

I've emailed the White House and the White House Developers, stating that I am a freelance journalist and asking for some way to access to the data. The White House Office of Digital Strategy told me that "Unfortunately, we don't have the means to provide data exports at this time, but we are working to open up the data going forward via the API." There is an "Open Data" initiative at the White House, but apparently petition data is not covered.

隐私和TOS:有预期签署了一份请愿书一点隐私。并没有明确的条款,解决这些数据的网页抓取。

Privacy and TOS: There is little privacy expected in signing a petition. And no clear TOS that addresses web scraping of this data.

已经做了什么:有些教师在UNC写了(我认为是)一个Python脚本来凑数据,但是他们并不想释放剧本给我,说他们仍在努力就可以了。 http://www.unc.edu/~ncaren/secessionists/ 他们没有给我的CSV数据转储我特别感兴趣的1请愿书。

What has been done: Some faculty at UNC have written (what I assume is) a python script to scrape the data, but they don't want to release the script to me, saying they are still working on it. http://www.unc.edu/~ncaren/secessionists/ They did send me a CSV data dump of one petition I am particularly interested in.

我做了什么:我已经设置了GitHub的项目这一点,因为我想要的任何申请资料刮刀是对大家有用 - 上访者自己,新闻记者,等等 - 谁愿意为能够获得这些数据。 https://github.com/markratledge/whitehousescraper

What I've done: I've set up a github project for this, because I want any petition data scraper to be useful for everyone - petitioners themselves, journalists, etc. - who wants to be able to get this data. https://github.com/markratledge/whitehousescraper

我有蟒蛇和shell脚本的一点经验没经验,和我想要做的显然是超出了我在这一点上的经验。的

我跑的GUI脚本发送一个空格到web浏览器每五秒钟左右,并以这种方式刮下〜万签名通过剪切和粘贴在浏览器的文本到文本编辑器。从那里,我可以处理文本使用grep和awk成CSV格式。这当然,不工作也很好;铬陷入了页面的大小,并花费几个小时才能拿到,很多签名。

I ran a GUI script to send a "spacebar" to the web browser every five seconds or so, and in that way scraped ~10,000 signatures by cutting and pasting the browser text into a text editor. From there, I could process the text with grep and awk into a CSV format. This, of course, doesn't work too well; Chrome bogged down with the size of the page, and it took hours to get that many signatures.

我到目前为止已经发现:从我可以从其他等问题和答案收集,它看起来像Python和scrapy http://scrapy.org 是要走的路,以避免出现浏览器。但该页面使用了Ajax函数加载下一组签名。看来,这是一种静态的Ajax请求,因为URL不会改变。

What I've found so far: from what I can gather from other SO questions and answers, it looks like Python and scrapy http://scrapy.org is the way to go to avoid problems with browsers. But the page uses an ajax function to load the next set of signatures. It appears that this is a "static" ajax request, because the URL doesn't change.

在Firebug的,JSON的请求头似乎有之前附加了一个页码一个随机字符串。这是否说对什么需要做什么?难道一个脚本需要效仿和它们发送给Web服务器?

In Firebug, the JSON request headers appear to have a random string appended to them with a page number just before. Does this say anything towards what needs to be done? Does a script need to emulate and send these to the webserver?

请求   URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/2/50b32771ee140f072e000001   请求   URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/3/50b1040f6ce61c837e000006   请求   URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/4/50afb3d7c988d47504000004

Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/2/50b32771ee140f072e000001 Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/3/50b1040f6ce61c837e000006 Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/4/50afb3d7c988d47504000004

这是JS的函数加载签名在页面上:

This is the JS function that loads the signatures on the page:

(function ($) {
Drupal.behaviors.morePetitions = {
  attach: function(context) {
    $('.petition-list .show-more-petitions-bar').unbind();
    $(".petition-list .show-more-petitions-bar").bind('click',
      function () {
        $('.show-more-petitions-bar').addClass('display-none');
        $('.loading-more-petitions-bar').removeClass('display-none');

        var petition_sort = retrieveSort();
        var petition_cols = retrieveCols();
        var petition_issues = retrieveIssues();
        var petition_search = retrieveSearch();
        var petition_page = parseInt($('#page-num').html());

        var url = "/petitions/more/"+petition_sort+"/"+(petition_page + 1)+"/"+petition_cols+"/"+petition_issues+"/"+petition_search+"/";
        var params = {};
        $.getJSON(url, params, function(data) {
          $('#petition-bars').remove();
          $('.loading-more-petitions-bar').addClass('display-none');
          $('.show-more-petitions-bar').removeClass('display-none');
          $(".petition-list .petitions").append(data.markup).show();

          if (typeof wh_petition_adjustHeight == 'function') {
            wh_petition_adjustHeight();
          }

          Drupal.attachBehaviors('.petition-list .show-more-petitions-bar');
          if (typeof wh_petition_page_update_links == 'function') {
            wh_petition_page_update_links();
          }
        });

        return false;
      }
    );
  }
}

和滚动浏览器窗口的底部时,此格显露被触发:

and that is fired when this div is revealed when scrolling to the bottom of the browser window:

<a href="/petition/.../l76dWhwN?page=2&amp;last=50b3d98e7043012b24000011" class="load-next no-follow active" rel="509ec31cadfd958d58000005">Load Next 20 Signatures</a>
<div id="last-signature-id" class="display-none">50b3d98e7043012b24000011</div>

那么,什么是做到这一点的最好方法是什么?我在哪里去scrapy?还是有另一种Python库更适合呢?

So, what's the best way to do this? Where do I go with scrapy? Or is there another python library better suited for this?

随意发表评论,指出我在code剪一个方向,其他等问题/解答,有助于github上。我想要做的显然是超出了我在这一点上的经验。

Feel free to comment, point me in a direction with code snips, to other SO questions/answers, contribute to github. What I'm trying to do is obviously beyond my experience at this point.

推荐答案

在'随机链接'看起来像它的格式为:

The 'random link' looks like it has the form:

https://petitions.whitehouse.gov/signatures/more/ 的 petitionid / 页次 / lastpetition 的 其中, petitionid 的是静态的针对单个上访,页次的递增每次和 lastpetition 的每次从请求返回。

https://petitions.whitehouse.gov/signatures/more/petitionid/ pagenum/ lastpetition where petitionid is static for a single petition, pagenum increments each time and lastpetition is returned each time from the request.

我通常的做法是使用请求库来模拟会话的饼干,然后制定出什​​么要求浏览器做。

My usual approach would be to use the requests library to emulate a session for cookies and then work out what requests the browser is making.

import requests
s=requests.session()
url='http://httpbin.org/get'
params = {'cat':'Persian',
          'age':3,
          'name':'Furball'}             
s.get(url, params=params)

我会特别注意以下链接:

I'd pay particular attention to the following link:

&LT;一href="/petition/shut-down-tar-sands-project-utah-it-begins-and-reject-keystone-xl-pipeline/H1MQJGMW?page=2&amp;last=50b5a1f9ee140f227a00000b"类=加载下一页没有遵循积极相对=50ae9207eab72aed25000003&GT;加载下一页20签名&LT; / A&GT;