有什么方法来检测类似putjbtghguhjjjanika字符串?有什么、字符串、方法来、类似

2023-09-11 01:43:07 作者:最后你却说我是伤你的罪人

人在我的网站上搜索和其中的一些搜索是这些的:

People search in my website and some of these searches are these ones:

tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a

我的问题是,有没有什么办法来检测字符串,类似于那些上面?

My question is there any way to detect strings that similar to ones above ?

我想这是不可能探测到它们的100%,但任何解决方案将受到欢迎:)

I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)

编辑:我的意思是jibberish搜索。比如有人喜欢搜索asdqweasdqw,paykaprkg,iwepr wepr流串在我的搜索引擎,我想检测jibberish搜索。

edit: I mean the "jibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.

这并不重要,如果搜索结果将是0或其他任何东西。我不能用这个逻辑。

It doesn't matter if search result will be 0 or anything else. I can't use this logic.

一些新的品牌或产品将被忽略,如果我会考虑普通的话。

Some new brands or products will be ignored if I will consider "regular words".

感谢您的帮助

推荐答案

您可以从英文一堆文字塑造性格的典范角色转换。因此,例如,你发现它是有多么普遍要经过T($ P $共同ptty的)一个H。在英语中,你期望一个Q后,你会得到一个U。如果你得到一个Q后面不是一个'U'其他的东西,这将非常低的概率发生,因此它应该是pretty的惊人。规范化计数你的表,让你有一个概率。那么对于一个查询,穿行于矩阵,计算你需要的过渡产品。然后由查询的长度规范化。当数低,你可能有乱码的查询(或东西用不同的语言)。

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).

如果你有一堆查询日志,您可能首先使普通英语文本模型,然后重加权自己的查询在模型的训练阶段。

If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.

有关的背景,了解马尔可夫链的。

编辑,我用Python实现这个位置:

Edit, I implemented this here in Python:

https://github.com/rrenaud/Gibberish-Detector

和buggedcom在PHP重写了它:

and buggedcom rewrote it in PHP:

https://github.com/buggedcom/Gibberish-Detector-PHP

my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True