优化算法来比较两个网址模板算法、模板、两个、网址

2023-09-10 22:45:58 作者:玍銹的英雄夢

编辑,请阅读同样,当我加了我的一些工作

我的任务是要比较两个URL模板。我愿意用我的算法。但它需要太多的时间来得到最终的答案。

My task is to compare templates of two URLS. I am ready with my algorithm. But it takes too much time to give final answer.

我在写我的code的的Java 的使用的 Jsoup 和硒的

I wrote my code in Java using Jsoup and Selenium

下面,模板的是指任何页面presents其内容的方式。

Here, Templates means the way any page presents its contents.

示例: -

任何购物网站有任何鞋页面,包含,

Any shopping website have page of any Shoes, that contains,

Images in the left.
Price and Size in the right.
Reviews in the bottom.

如果两个URL任何特定的产品,那么它返回两者都是由相同的模板。例如,此链接和此链接有相同的模板

If two URLS are of any specific product , then it return "Both are from same templates". Example , this link and this link have same template.

如果一个网址显示的任何产品和另一个URL显示任何一类,则显示无匹配。 例如,此链接和此链接来自不同的模板。

If one URL shows any product and another URL shows any category ,then it shows "No match". Example, this link and this link are from different template.

我认为这个算法需要一些优化,这就是为什么我张贴在这个论坛这个问题。

I think that this algorithm requires some optimization, that's why I am posting this question in this forum.

我的算法

获取,解析两个输入网址,并让自己的 DOM树的。 在那时若有页面包含,UL和表,然后删除该标签。我这样做是因为,可能是两页中包含不同数量的项目。 然后,我算的标签这两个URL数量。比方说,initial_tag1,initial_tag2。 然后,我开始删除具有相同的位置上相应的页面和相同的ID以及他们下面的子树,如果该树具有节点数量少于10个标签。 在具有相同的位置上coresponding页面和相同的类名称及其以下子树,如果该树具有节点小于10号之后,我开始删除标签。 然后,我开始删除一些没有身份证,和类名及其以下子树,如果该树具有节点数量少于10个标签。 在步骤4,5,6有(N * N)的复杂性。这里,N是标签数量。 [以这种方式,在每一个步骤DOM树会收缩] 当它从这个递归出来,然后我检查final_t​​ag1和final_t​​ag2。 如果final_t​​ag1和final_t​​ag2小于initial_tag1 *(0.2)和initial_tag2 *(0.2),那么我的可以说两个匹配的网址,否则不是。 Fetch, parse two input URLS and make their DOM trees. Then if any page contains , UL and TABLE , then remove that tag. I done this because, may be two pages contains different number of items. Then, I count number of tags in both URLS. say, initial_tag1, initial_tag2. Then, I start removing tags that have same position on corresponding pages and same Id and their below subtree, if that tree has number of nodes less than 10. Then, I start removing tags that have same position on coresponding pages and same Class name and their below subtree, if that tree has number of nodes less than 10.. Then, I start removing tags that have no Id ,and No Class name and their below subtree, if that tree has number of nodes less than 10. Steps 4, 5, 6 have (N*N) complexity. Here, N, is number of tags. [In this way, in every step DOM tree going to shrink] When it comes out from this recursion, then I check final_tag1 and final_tag2. If final_tag1 and final_tag2 is less than initial_tag1*(0.2) and initial_tag2*(0.2) then I can say that Two URL matched, otherwise not.

我想了很多关于这个算法,我发现从DOM树删除节点是pretty的缓慢的过程。这可能是罪魁祸首减缓这种算法。

I think a lot about this algorithm, and I found that removing node from DOM tree is pretty slow process. This may be the culprit for slowing this algorithm.

我的一些极客的讨论,以及

I discussed from some of geeks, and

他们说,用分数为每个变量,而不是删除它们,并将它们添加,以及>在年底返回(比分我得到)/(accumulatedPoints)或类似的东西,并在   您决定两个URL或者是相似与否的依据。

they said that use a score for every tag instead of removing them, and add them , and > at the end return (score I Got)/(accumulatedPoints) or something similar, and on the basis of that you decide two URLS are either similar or not.

但我不明白这一点。所以,你能解释一下这句话有些怪胎,或可以给其他任何优化算法,能够有效地解决这个问题。

But I didn't understand this. So can you explain this saying of some geek, or can you give any other optimized algorithm, that solve this problem efficiently.

在此先感谢。寻找你的一种反应。

Thanks in advance. Looking for your kind response.

推荐答案

要改善你的算法的复杂度,假设你正在使用Jsoup,你必须调整你的数据结构,以你的算法。

To improve complexity of your algorithm, supposing you are using Jsoup, you must adapt your data structure to your algorithm.

4)你是什么标签的位置呢?标签的Xpath的? 如果是的话,precompute该值一次,每个标签O(n)和存储的每个节点此值。如果需要,您也可以将其存储在一个HashMap来检索O(1)。

4) What do you mean by position of tag ? the Xpath of the tag ? If yes, precompute this value once for each tag O(n) and store this value in each node. If required you can also store it in a HashMap to retrieve in O(1).

5)索引你的标签使用Multimap之类的名字。您将节省大量的运算

5) Index you tag by class name using MultiMap. You will save lot of computation

6)无标识Index类,类名

6) Index class with no Id, no class name

所有这些pre的计算可以在树的一个遍历被执行,从而为O(n)。

All these pre computations can be performed in one traversal of the tree so O(n).

一般来说,如果你想减少的计算,你将需要存储更多的数据在内存中。由于DOM页面是非常小的数据,这是你的情况没有问题。

Generally if you want to reduce computation, you will have to store more data in the memory. Since DOM page are very small data, this is no problem in your case.