StackSync

An open-source scalable personal cloud

Chunking

In the early versions of StackSync, the client implemented a static chunking algorithm, which works well for a system with a small number of users. But as we want to create a truly scalable system, we were forced to use a dynamic chunking.

A content-based dynamic chunking can significantly reduce the amount of space used by users in the storage service. Unlike the static chunking, when the user modifies a file, the chunking algorithm detects what parts of the file have not been modified and do not transfer them again. On the other hand, the computational cost of applying this algorithm is more
intensive. But as it is not a time-critical operation and the storage savings are significant, we implemented it.

The content-based chunking to be implemented was the TTTD (Two Thresholds Two Divisors). This algorithm must specify the minimum and maximum sizes of chunks (two limits). From the first byte of the file, the algorithm reads byte by byte until it reaches the minimum size of the chunk. Then, it computes the hash of the chunk and calculates the module of the hash with a main and a secondary value (two dividers). If the result is equal to the first divisor, it means that the chunk must begin there.

chunking

In the following figure we can see the worst case scenario for a static chunking. Add a byte to the beginning of the file will cause all chunks to be different from the previous version. In contrast to the dynamic chunking, the only chunk that would be affected would be the first one, once you find the intersection point with the divisors, all other chunks will be the same and will not upload them.

Chunking