快速文本搜索超过日志文本、快速、日志

2023-09-11 05:36:14 作者:稻草人

下面是我有问题,我有一组日志,可以很快成长。他们每天都分成单独的文件,这些文件可以很容易地成长为一个演出的规模。为了保持规模下降,超过30天左右的条目将被清除出去。

Here's the problem I'm having, I've got a set of logs that can grow fairly quickly. They're split into individual files every day, and the files can easily grow up to a gig in size. To help keep the size down, entries older than 30 days or so are cleared out.

现在的问题是,当我想搜索这些文件的某些字符串。眼下,一个博耶 - 穆尔搜索unfeasibly缓慢。我知道,像dtSearch应用程序可以提供使用索引一个非常快速的搜索,但我真的不知道如何实现,而不占用两倍空间日志已经占用了。

The problem is when I want to search these files for a certain string. Right now, a Boyer-Moore search is unfeasibly slow. I know that applications like dtSearch can provide a really fast search using indexing, but I'm not really sure how to implement that without taking up twice the space a log already takes up.

有什么资源,我可以检查出,可以帮助?我真的想找一个标准的算法,要解释什么,我应该做些什么来建立索引,并用它来搜索。

Are there any resources I can check out that can help? I'm really looking for a standard algorithm that'll explain what I should do to build an index and use it to search.

编辑: grep的将无法正常工作,需要这个搜索结果集成到一个跨平台的应用程序。有没有办法,我就可以摆动,包括任何外部的程序进去。

Grep won't work as this search needs to be integrated into a cross-platform application. There's no way I'll be able to swing including any external program into it.

它的工作方式是,有一个Web前端,有一个日志浏览器。这会谈到一个自定义的C ++ Web服务器后端。这个服务器需要搜索日志中合理的时间量。通过日志的几个演出目前搜索需要年龄。

The way it works is that there's a web front end that has a log browser. This talks to a custom C++ web server backend. This server needs to search the logs in a reasonable amount of time. Currently searching through several gigs of logs takes ages.

编辑2: 其中的一些建议是伟大的,但我要重申,我不能集成其他应用程序,这是合同的组成部分。但要回答一些问题,在日志中的数据变化从收到任何有关这些在医疗保健专用格式的邮件或消息。我在寻找依靠的索引,因为虽然它可能需要长达一分钟的时间重建索引,搜索当前需要很长的时间(我见过最多需要2.5分钟)。此外,大量的数据,甚至记录之前被丢弃。除非一些调试日志记录选项打开时,一半以上的日志消息会被忽略。

Edit 2: Some of these suggestions are great, but I have to reiterate that I can't integrate another application, it's part of the contract. But to answer some questions, the data in the logs varies from either received messages in a health-care specific format or messages relating to these. I'm looking to rely on an index because while it may take up to a minute to rebuild the index, searching currently takes a very long time (I've seen it take up to 2.5 minutes). Also, a lot of the data IS discarded before even recording it. Unless some debug logging options are turned on, more than half of the log messages are ignored.

搜索基本上是这样的:在Web表单上的用户是psented使用最新的邮件列表(从磁盘流,因为他们滚动,耶为Ajax),通常情况下,他们将要搜索$ P $与它的一些信息,可能是患者ID,或者一些字符串,他们已经发送,这样他们就可以输入字符串到搜索消息。搜索被发送异步地和自定义Web服务器线性通过日志1MB的时间对一些搜索结果。这个过程可能需要很长的时间,当日志变大。而且这是我想要优化。

The search basically goes like this: A user on the web form is presented with a list of the most recent messages (streamed from disk as they scroll, yay for ajax), usually, they'll want to search for messages with some information in it, maybe a patient id, or some string they've sent, and so they can enter the string into the search. The search gets sent asychronously and the custom web server linearly searches through the logs 1MB at a time for some results. This process can take a very long time when the logs get big. And it's what I'm trying to optimize.

推荐答案

检查出的Lucene的使用做的事情的算法。他们不太可能很简单,但。我有曾几何时,研究其中的一些算法,其中有些是非常复杂的。

Check out the algorithms that Lucene uses to do its thing. They aren't likely to be very simple, though. I had to study some of these algorithms once upon a time, and some of them are very sophisticated.

如果你能确定你想索引的文字的话,只是建立了该字的哈希值映射到其出现在每一个文件的话大的哈希表。如果用户经常重复同样的搜索,缓存搜索结果。当搜索完成后,就可以检查每个位置,确认搜索词落在那里,而不仅仅是一个具有匹配散列字。

If you can identify the "words" in the text you want to index, just build a large hash table of the words which maps a hash of the word to its occurrences in each file. If users repeat the same search frequently, cache the search results. When a search is done, you can then check each location to confirm the search term falls there, rather than just a word with a matching hash.

此外,谁真正关心,如果指数比文件本身更大?如果你的系统是真正的这个大,有这么多的活动,是几十个演出的索引世界的尽头?

Also, who really cares if the index is larger than the files themselves? If your system is really this big, with so much activity, is a few dozen gigs for an index the end of the world?