通过网页抓取提取JavaScript的变量值网页、变量值、JavaScript

2023-09-11 01:05:13 作者:一句再见再也不见。

对于一个公司的项目,我需要创建一个PHP和JavaScript(包括jQuery的)的Web刮应用程序,将我们的客户的网站的每个页面中提取特定的数据。刮削的应用程序需要以获得两种类型的数据的每一页:1)确定具有特定ID的某些HTML元素是否present,和2)提取特定JavaScript变量的值。 JS的变量名是在每一页上是相同的,但该值通常是不同的。

For a company project, I need to create a web scraping application with PHP and JavaScript (including jQuery) that will extract specific data from each page of our clients' websites. The scraping app needs to get two types of data for each page: 1) determine whether certain HTML elements with specific IDs are present, and 2) extract the value of a specific JavaScript variable. The JS variable name is the same on each page, but the value is usually different.

我相信,我知道我能拿到第一数据要求:使用PHP的file_get_contents()函数来获取每个页面的HTML,然后使用JavaScript / jQuery来解析HTML和寻找具有特定ID的元素。但是,我不知道如何得到第二张数据 - JavaScript变量的值。 JavaScript变量是连每个页面的HTML中发现的;相反,它在被链接到页外部JavaScript文件中找到。即使JavaScript的被嵌入到网页的HTML,我知道,file_get_contents()函数只提取JavaScript的code(以及其他HTML),而不是任何变量的值。

I believe I know how I can get the first data requirement: using the PHP file_get_contents() function to get each page's HTML and then use JavaScript/jQuery to parse that HTML and search for elements with specific IDs. However, I'm not sure how to get the 2nd piece of data - the JavaScript variable values. The JavaScript variable isn't even found within each page's HTML; instead, it is found in an external JavaScript file that is linked to the page. And even if the JavaScript were embedded in the page's HTML, I know that file_get_contents() would only extract the JavaScript code (and other HTML) and not any variable values.

任何人都可以提出一个很好的方法来得到这个变量的值对于一个给定网站的每个网页?

Can anyone suggest a good approach to getting this variable value for each page of a given website?

编辑:只是为了澄清,我需要的JavaScript变量的值后的JavaScript code已运行。这样的事情甚至可能吗?

Just to clarify, I need the values of the JavaScript variables after the JavaScript code has been run. Is such a thing even possible?

推荐答案

presumably这是不可能的,因为它似乎很简单,但如果它的您的.js文件你想检测,为什么不只是有.js文件通过做一些刮检测到的页面?

presumably this is impossible because it seems so simple, but if it's your .js you're trying to detect, why not just have that .js do something detectable via scrape to the page?

使用JS来填充像这样的地方标记(通过element.innerHTML,presumably):

use the js to populate a tag like this somewhere (via element.innerHTML, presumably):

<span><!--Important js thing has been activated!--></span>.   

编辑:或者,也许用一个文件撰写,如果脚本需要被检测的onload

edit: alternately, maybe use a document.write, if the script needs to be detectable onload