An open-source scalable personal cloud

File processing

The desktop client must integrate with the operating system in the sense that, when there is a change in the synchronized folder, the application receives a notification to process the event. Once the application identifies which file or files have been modified or created, it will proceed to store it in StackSync.

First, it will extract all needed metadata about the file (e.g. file name, size, modification date, etc.). Next, it will store it in a database that will be used to keep the information about different version of files. Finally, the file will be split into smaller pieces called chunks. Each chunk is treated independently and is identified by its hash value. To reconstruct the original file, chunks must be joined in the correct order.

The chunking provides StackSync with the following advantages:

  1. Optimize the bandwidth usage. When a file is modified, it is processed again and new chunks are generated. But only those parts of the file that have been modified will generate different chunks compared to the previous version. The client will be aware of this and will only send the chunks that differ from the previous version.
  2. Optimize the storage usage. Without chunking, the smallest modification of a file would force the client to upload the file again, and therefore, using more storage space. Using chunking we significantly reduce the storage usage. 
  3. Improve big files transfers. Syncing a 3GB file is a costly operation for both the client and the server. Chunks normally use much less space (between 32KB and 1MB), so they can be transferred faster. In addition, if the connection is lost while synchronizing, the client may resume the process from the last uploaded chunk and avoid uploading the entire file again.

When the chunking process finishes, the client proceeds to upload the chunks to the storage service, storing the chunks in the user’s private space.

Once the unique chunks are successfully submitted to the Storage back-end, the Indexer (see figure below) will communicate with the SyncService to commit the changes to the Metadata back-end, which is the component responsible for keeping versioning  information and other metadata. The Metadata back-end may be a relational database like MySQL or a NoSQL data store such as Cassandra or Riak. Irrespective of the chosen database technology, the SyncService needs to provide a consistent view of the synced files. Allowing new commit requests to see uncommitted changes may result in unwanted conflicts.

When more than one user work with the same file, there is a high chance that they accidentally update their local working copies at the same time. It is therefore critical that metadata is consistent at all times to establish not only the most recent version of each individual file but to record its (entire) version history. Although relational databases process data slower than NoSQL databases, NoSQL databases do not natively support ACID transactions, which could compromise consistency, unless additional complex programming is performed. Since the nature of the metadata back-end strongly determines both the scalability and complexity of the synchronization logic, an open modular architecture like StackSync can reduce the cost of innovation adding a great flexibility to meet changing needs.

Finally, when the SyncService finishes the commit operation, it will then notify of the latest changes to all interested devices. The device that originally modified the local working copy of the file will just update the Indexer upon the arrival of the confirmation from the SyncService. The other devices will both update their local databases and download the new chunks from the Storage back-end. We are assuming that an efficient communication middleware mediates between devices and the SyncService. This middleware should support efficient marshaling and message compression to reduce traffic overhead. Very importantly, it should support scalable change notification to a high number of entities, using either pull or push strategies. To deduplicate files and offer continuous reconciliation, recall that the local database at the Indexer must be in sync with the Metadata back-end, for which notification must be performed fast.

Sync interactions

The figure above illustrates the interaction between all components of a file sync engine in StackSync.