cromfs - Copyright (C) 1992,2011 Bisqwit (http://iki.fi/bisqwit/) License: GPL3 Homepage: http://bisqwit.iki.fi/source/cromfs.html How mkcromfs remembers which blocks it has encoded. -------------------------------- One of the greatest powers of mkcromfs is that it remembers all blocks it has encoded, and if it meets an identical block again later, in the same file or in another file, it can encode it as a reference to the previous data instead of adding it into the stream that is compressed by LZMA. It does this by the means of hashing: typedef std::multimap block_index_t; This STL container is an associative container where the key is a 32-bit hash value and the value is a block_info struct. block_info is a struct that contains two integers that tell which fblock the data was written in and at which offset. When mkcromfs wants to encode a new block, it calculates a hash of the block, and checks for it in this map. block_index_t::iterator i = block_index.find(hash); If it is found, it loads the data from the fblock indicated in the block_info struct, and checks if it's identical to the data that is being encoded. If it is, it will reuse that struct. If it is not found, or the data was not identical, it will insert a new key-value pair to the map and add the block data to the stream. block_index.insert(std::make_pair(hash, blk)); When adding the block data to the stream, it does a search to see if the data already exists in the particular fblock it's appending, and will overlap the block with the tail of the fblock if it's able to do so. -------------------------------- I chose to document this technique, because it could be useful in other compression programs as well. block_info could be implemented as containing a filename and the starting offset. This way, no lookups within the encoded data would be necessary. -------------------------------- Note: The actual code in mkcromfs is different and more complex for memory efficiency reasons.