什么是一个好的算法,在被阻止的文件压缩记录?是一个、算法、文件压缩

2023-09-11 22:43:42 作者:豹纹小袜袜 ヾ

假设你有由一串固定大小的块的大文件。这些块的每含有的可变大小的记录一些数目。每个记录必须完全符合一个块中,然后这些记录被定义从来不是一个完整的块更大。随着时间的推移,记录添加到,并从这些块被删除的记录都从这个数据库走了。

Suppose you have a large file made up of a bunch of fixed size blocks. Each of these blocks contains some number of variable sized records. Each record must fit completely within a single block and then such records by definition are never larger than a full block. Over time, records are added to and deleted from these blocks as records come and go from this "database".

在某些时候,特别是在许多或许记录被添加到数据库中和几个被除去 - 许多块可能最终仅部分地填充

At some point, especially after perhaps many records are added to the database and several are removed - many of the blocks may end up only partially filled.

这是一个很好的算法,以围绕洗牌记录在该数据库中,以压缩了该文件的结束时不必要的块通过更好地填补了部分填充块?

What is a good algorithm to shuffle the records around in this database to compact out unnecessary blocks at the end of the file by better filling up the partially filled blocks?

该算法的要求:

压实必须到位原始文件的发生没有暂时超过几个街区,从它的起始大小扩展文件最多 在该算法不应该不必要的打扰块已经主要全 整套块必须被读出或写入从/给文件在同一时间和一个应承担的写操作是相对昂贵的 如果记录是从一个块移动到另一个,他们必须在他们的新位置被添加从它们的起始位置被删除之前,以便在情况下,操作被中断没有记录丢失的失败压实的结果。 (假设这些记录这个临时的复制可以恢复被检测)。 ,可用于该操作的存储器只能是也许几个块的顺序是总文件大小的比例很小上 假定记录是10字节到1K字节与说不定100字节的平均大小的数量级。固定大小的块上的4K或8K的顺序和该文件是1000的块的顺序上。 The compaction must happen in place of the original file without temporarily extending the file by more than a few blocks at most from its starting size The algorithm should not unnecessarily disturb blocks that are already mainly full Entire blocks must be read or written from/to the file at one time and one should assume the write operation is relatively expensive If records are moved from one block to another they must be added at their new location before being removed from their starting position so that in case the operation is interrupted no records are lost as a result of the "failed" compaction. (Assume that this temporary duplication of such records can be detected at recovery). The memory that can be used for this operation can only be on the order of perhaps several blocks which is a very small percentage of the overall file size Assume that records are on the order of 10 bytes to 1K bytes with an average size of maybe 100 bytes. The fixed sized blocks are on the order of 4K or 8K and that the file is on the order of 1000's of blocks.

推荐答案

这听起来像仓的变化包装问题,但如果你已经有要改善的劣质分配。所以我建议在看这是成功的装箱问题的方法的变化。

This sounds like a variation of the bin packing problem, but where you already have an inferior allocation that you want to improve. So I suggest looking at variations of the approaches which are successful for the bin packing problem.

首先,你可能想通过定义你认为完全不够来参数您的问题(其中一个模块是完全不够,你不想去触摸它),什么是太空虚(其中一个块有如此多的空的空间,它必须有添加到它更多的记录)。然后,可以将所有的块,完全足够了,太空虚,或部分满(那些既不是完全足够的,也不能为空)分类。然后,重新定义问题,如何通过创造尽可能多的充分足够的块,同时尽可能减少部分已满块的数量杜绝一切太空块。

First of all, you probably want to parameterize your problem by defining what you consider "full enough" (where a block is full enough that you don't want to touch it), and what is "too empty" (where a block has so much empty space that it has to have more records added to it). Then, you can classify all the blocks as full enough, too empty, or partially full (those that are neither full enough nor too empty). You then redefine the problem as how to eliminate all the too empty blocks by creating as many full enough blocks as possible while minimising the number of partially full blocks.

你还需要制定出什么更重要:获得记录到最少块可能,或充分包装他们,而是用最少的块的读取和写入

You'll also want to work out what's more important: getting the records into the fewest blocks possible, or packing them adequately but with the least amount of blocks read and written.

我的做法是进行初始传过来的所有块,把它们全部归为上述定义的三个类别之一。对于每个块,你也想跟踪它的可用空间,并为太空块,你想拥有的所有记录,其大小的列表。然后,开始在太空虚了块最大的记录,将它们移到部分满块。如果你希望尽量减少读取和写入,将它们移动到任何您目前拥有的内存块。如果你想尽量减少浪费的空间,找到最少量的空的空间仍然会保存记录的块,读取块如果必要的。如果没有块将保存记录,创建一个新的块。如果内存块达到完全够用的门槛,写出来。重复,直到在部分填充块中的所有记录都被放置。

My approach would be to make an initial pass over all the blocks, to classify them all into one of the three classes defined above. For each block, you also want to keep track of the free space available in it, and for the too empty blocks, you'll want to have a list of all the records and their sizes. Then, starting with the largest records in the too empty blocks, move them into the partially full blocks. If you want to minimise reads and writes, move them into any of the blocks you currently have in memory. If you want to minimise wasted space, find the block with the least amount of empty space that will still hold the record, reading the block in if necessary. If no block will hold the record, create a new block. If a block in memory reaches the "full enough" threshold, write it out. Repeat until all the records in the partially filled blocks have been placed.

我已经跳过了以上几个细节,但是这应该给你一些想法。注意,装箱问题是 NP难的,这意味着,对于实际应用,则马上要决定什么是最重要的是你,所以你可以选择的方法,这将使你在合理的时间大约最好的解决方案。

I've skipped over more than a few details, but this should give you some idea. Note that the bin packing problem is NP-hard, which means that for practical applications, you'll want to decide what's most important to you, so you can pick an approach that will give you an approximately best solution in reasonable time.