HTTP GET在Craigslist受阻HTTP、GET、Craigslist

2023-09-11 09:18:52 作者:穿透心灵的冰

我试图做一个HTTP GET在Craigslist sfbay.craigslist.org 。这里是我的(红宝石)code这是非常简单的。

I'm trying to do a HTTP GET on craigslist sfbay.craigslist.org. Here is my (ruby) code which is really simple

require 'net/http'
result = Net::HTTP.get(URI.parse('http://sfbay.craigslist.org'))

我最终得到一个错误这个IP地址已经被自动阻止。

I end up getting an error "This IP has been automatically blocked."

当我尝试这从亚马逊EC2或在Heroku这种情况只发生。当我再试试对我自己的电脑本地主机我得到正确的结果。这是否都与亚马逊EC2?

This behaviour only happens when I try this from Amazon EC2 or on heroku. When I try again on my own computer localhost I get the correct result. Does this have to do with Amazon EC2?

我不知道是否有其他人也有同样的问题。我能做些什么,从EC2访问Craigslist的?

I'm wondering if other people have had the same issue. What can I do to access craigslist from EC2?

推荐答案

我可以证实,Craigslist的是从主要Amazon EC2的IP阻止通过IP范围(而不是由用户代理)。它的工作原理在其他地方,虽然我怀疑任何音量会导致其他IP,以得到阻止。

I can confirm that Craigslist is blocking from the major Amazon EC2 IP ranges by IP (not by user agent). It works elsewhere, though I suspect any volume would cause other IPs to get blocked.

您可以用器第绕过它。更多显著,这个计算器的问题讨论了使用Craigslist的混搭的数据源。

You could step around it with tor. More significantly, this stackoverflow question discusses data sources used by craigslist mashups.

我还测试了巴西EC2,假设他们可能没有全部封锁CIDRs。没有布埃诺。

I even tested a Brazil EC2, assuming they might not have all the CIDRs blocked. No bueno.