
2023-09-11 06:19:28 作者:男优报刊


I want to count how much times some given words or phrases appears in a text, but I want use some string similarity algorithms.


Every word or expression has an value, so I will set the relevancy of the text according to the number of words found, etc.


I guess that the String class from Java cannot offer this. I will need to iterate all the text for each word or expression I want to find?



Example: find texts that has "videogame", "i have a videogame" and something like this, evaluating similar expressions. I guess if I iterate for each word or expression I need to evaluate, I cannot find the similar words and it will be more slower.



The inverted index that Denniss said is what you are looking for. You'll need to define your Document very well if you want a powerfull engine.


For phrase matches, your Document should have the position of the word (the key of the Map) in this Document. Once you got all the words you were looking for, you can know if this words were together in the original document.


doc1: "Hello World"
doc2: "Hello Beautiful World"

inverted index {
  "Beautifull": [(doc2, 2)],
  "Hello": [(doc1, 1)(doc2, 1)],
  "World": [(doc1, 2)(doc2, 3)],

query: "Hello World"


Both documents have the words "Hello" and "World", but doc1 has them together (position 1 and 2) and doc2 doesn't (position 1 and 3).

如果你想找到的类似的话的,你需要一个新的结构。首先,你需要定义什么是相似的。 Levenshtein距离是您所需要的那个。

If you want to find similar words, you'll need a new structure. First, you need to define what is similar. Levenshtein distance is what you need for that.


To implement it, you'll need a whole new struture like an automata: Levenshtein automaton.


Full-text search is a huge area. Implementing a search engine is hard and many libraries and applications already do it.


(I work for Indextank.com a realtime full-text search engine. If you need a search engine running in a couple of minutes, try us out)