我怎么算文本文字和前pressions?文本、文字、我怎么、pressions

2023-09-11 06:19:28 作者:男优报刊

我要算多少次某些特定单词或短语出现在文本,但我想用一些字符串相似性算法。

I want to count how much times some given words or phrases appears in a text, but I want use some string similarity algorithms.

每个单词或前pression具有价值,所以我会根据找到的词的数量设置文本的相关性等。

Every word or expression has an value, so I will set the relevancy of the text according to the number of words found, etc.

我想从Java中的字符串类不能提供这一点。我需要遍历所有的文本,我想找到的每个单词或前pression?

I guess that the String class from Java cannot offer this. I will need to iterate all the text for each word or expression I want to find?

是否有文本处理库?

例:发现文字有电子游戏,我有一个电子游戏和这样的事情,评估类似前pressions。我想,如果我重复每个我需要评估的词或前pression,我找不到类似的话,这将是更慢。

Example: find texts that has "videogame", "i have a videogame" and something like this, evaluating similar expressions. I guess if I iterate for each word or expression I need to evaluate, I cannot find the similar words and it will be more slower.

推荐答案

倒排索引的丹尼斯说的是,你在找什么。你需要确定你的文档非常好,如果你想有一个强大的引擎。

The inverted index that Denniss said is what you are looking for. You'll need to define your Document very well if you want a powerfull engine.

词组匹配,您的文件应在本文件的字(Map的键)的位置。一旦你得到了你要找的话,就可以知道这个词是一起在原文件

For phrase matches, your Document should have the position of the word (the key of the Map) in this Document. Once you got all the words you were looking for, you can know if this words were together in the original document.

例如:

doc1: "Hello World"
doc2: "Hello Beautiful World"

inverted index {
  "Beautifull": [(doc2, 2)],
  "Hello": [(doc1, 1)(doc2, 1)],
  "World": [(doc1, 2)(doc2, 3)],
}

query: "Hello World"

这两个文件有话你好和世界,但DOC1有在一起(位置1和2)和DOC2不(位置1和3)。

Both documents have the words "Hello" and "World", but doc1 has them together (position 1 and 2) and doc2 doesn't (position 1 and 3).

如果你想找到的类似的话的,你需要一个新的结构。首先,你需要定义什么是相似的。 Levenshtein距离是您所需要的那个。

If you want to find similar words, you'll need a new structure. First, you need to define what is similar. Levenshtein distance is what you need for that.

要实现它,你需要一个全新的struture像一个自动机:莱文斯坦自动

To implement it, you'll need a whole new struture like an automata: Levenshtein automaton.

全文搜索是一个巨大的领域。实现一个搜索引擎是很难,许多库和应用程序已经做到这一点。

Full-text search is a huge area. Implementing a search engine is hard and many libraries and applications already do it.

(我的Indextank.com工作的一个实时的全文搜索引擎。如果你需要在一两分钟的运行搜索引擎,我们尝试了)

(I work for Indextank.com a realtime full-text search engine. If you need a search engine running in a couple of minutes, try us out)