有什么方法来检测类似putjbtghguhjjjanika字符串？有什么、字符串、方法来、类似

2023-09-11 01:43:07 作者：最后你却说我是伤你的罪人

人在我的网站上搜索和其中的一些搜索是这些的：

People search in my website and some of these searches are these ones:

tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a

我的问题是，有没有什么办法来检测字符串，类似于那些上面？

My question is there any way to detect strings that similar to ones above ?

我想这是不可能探测到它们的100％，但任何解决方案将受到欢迎：）

I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)

编辑：我的意思是jibberish搜索。比如有人喜欢搜索asdqweasdqw，paykaprkg，iwepr wepr流串在我的搜索引擎，我想检测jibberish搜索。

edit: I mean the "jibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.

这并不重要，如果搜索结果将是0或其他任何东西。我不能用这个逻辑。

It doesn't matter if search result will be 0 or anything else. I can't use this logic.

一些新的品牌或产品将被忽略，如果我会考虑普通的话。

Some new brands or products will be ignored if I will consider "regular words".

感谢您的帮助

推荐答案

您可以从英文一堆文字塑造性格的典范角色转换。因此，例如，你发现它是有多么普遍要经过T（$ P $共同ptty的）一个H。在英语中，你期望一个Q后，你会得到一个U。如果你得到一个Q后面不是一个'U'其他的东西，这将非常低的概率发生，因此它应该是pretty的惊人。规范化计数你的表，让你有一个概率。那么对于一个查询，穿行于矩阵，计算你需要的过渡产品。然后由查询的长度规范化。当数低，你可能有乱码的查询（或东西用不同的语言）。

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).

如果你有一堆查询日志，您可能首先使普通英语文本模型，然后重加权自己的查询在模型的训练阶段。

If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.

有关的背景，了解马尔可夫链的。

编辑，我用Python实现这个位置：

Edit, I implemented this here in Python:

https://github.com/rrenaud/Gibberish-Detector

和buggedcom在PHP重写了它：

and buggedcom rewrote it in PHP:

https://github.com/buggedcom/Gibberish-Detector-PHP

my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True

上一篇：哈希函数串函数

下一篇：快速稳定的排序算法实现在javascript算法、稳定、快速、javascript

相关推荐

精彩图集

精彩推荐

图片推荐