Any file sync system worth its salt is able to skip transfers of data the receiving side has already. This is known as source (or client-side) deduplication. Many storage systems (both cloud-based and local ones) perform data deduplication, which allows to store a single copy of each piece of data even when it’s duplicated across files or users. (They are usually implemented together, but are actually independent in a strict sense.)
Source and storage deduplication speed up operations and decrease transfer and storage costs. But they also open the door to a number of side channel attacks that threaten privacy, allow service abuse, and expose both users and service operators to legal risk. Popular public cloud services are particularly exposed, because anybody can register and use (plus, as shown below, attack) them. The Sync Appliance, on the other hand, runs on your own computers, and access is thus restricted to allowed users (both regular and guests, with more limited capabilities). It incorporates nonetheless protections against the attacks detailed below.
Side channel attacks
In a threat model where the attacker can do everything a sync/storage client can (i.e., knows the protocol and can doctor any messages), three kinds of attacks are made possible by source and data deduplication:
- abuse of the system for file distribution
- data privacy attacks
- abuse of the system as a covert channel
Abuse of the system for file distribution
Sync systems with source deduplication use a small value as a proxy for a file or data blob (generally, a message digest like SHA-256), which is sent along with the message signaling that a file has been added or modified. The recipient can check if it already has data with the same digest and tell the sender there is not need to send it, or just abstain from requesting it (the probability of collision, as long as the relying cryptographic primitive has not been broken, is effectively zero).
If the server takes a message digest at face value, mere knowledge of the digest of a file amounts to access to that file, because an attacker can use a modified client to pretend a file with a given digest has been added, and the server will skip data transmission if the file had been uploaded previously by any user.
This makes it possible for an attacker to abuse a file sync service as a file distribution service – “free content delivery network (CDN)” –, by publishing the digest of a file, syncing it via the service, and then allowing other users to get it using a modified client to “add” a phony file with that digest. The recipient then downloads its “own” file via the web interface. Dropbox was one of the services affected by such an attack (early versions of Dropbox made the attack even easier because clients could request any data block given its digest regardless of whether the data was actually associated to the account of the user performing the request).
This attack is prevented using proof of ownership (PoW) schemes, where a client has to prove it actually has some data before the server accepts that as a fact. There are several PoW schemes of different complexity. The simplest, most naïve one is for the server to request small snippets of the data at random positions to the client.
Data privacy attacks
The second category of attacks build on the fact that the system exposes to an attacker whether it already has a given blob of data (this can be a whole file or a chunk). In source deduplication, this is told directly to the client (so as to avoid sending it). In data deduplication, this can be known indirectly when information about the storage is available to the attacker (e.g. total data in the storage system or free space). These attacks are possible when there is cross-user deduplication, which unfortunately is one of the most interesting use cases in the first place.
An attacker can both confirm whether a file existed already, and add multiple versions of the same file to determine which was the one already present in the system, thus performing a brute force attack on its contents (e.g., determine a 4-digit pin in an otherwise known or guessable PDF).
Note that these attacks are not prevented by PoW schemes.
An early instance of these attacks affected the Tahoe-LAFS decentralized cloud storage system, which defended against them by using heuristics to disable deduplication selectively; e.g., deduplication can be disabled for PDF files under a given size.
Protection
Indirect inference of whether data is already present in the storage backend is prevented in the obvious way, by not allowing malicious users to obtain precise information on the storage backend (total data or free space).
There exists a randomized scheme where the server allows source deduplication
only when the file was uploaded a given (and unknown to the client) number of
times t
(t
is an arbitrary threshold different for each file).
When the server requests a file (signalling it as “not known yet”), the
attacker does not know whether the file actually was not present in the
service, or is being requested again despite being already known.
Details
In this scheme, the server must track the number of uploads of any given file to the system.
It also requires that data deletion be tracked correctly, to prevent an
attacker from determining the apparent threshold t
by removing t
copies of
the file (possibly using t
different accounts), and then uploading them again
– if this time source deduplication happens after t - 1
attempts, (s)he
will know another user has uploaded the file in the meantime.
Moreover, it is necesary to prevent timing (side channel) attacks on the randomized scheme. Consider what happens when a file is present or not:
file not present file present
--------------------- --------------------------
determine if present determine if present
file missing: request it randomized scheme: determine threshold and act on it
request if number of uploads is below threshold
The attacker can infer which one the two situations is happening by precise timing of the response, which will be a bit faster when the file is not present. This is not possible with a single attempt, but the attacker can initiate (and quickly interrupt) an arbitrary number of uploads of the same file and time them all to get accurate averages with arbitrary precision.
One solution against this attack is to deny timing information by either making both paths take the same arbitrary amount of time (e.g., always respond 100ms after the request is obtained).
Another is to cache the “request or not” decision for a given file, so that there is no longer a forking code path: this way, all subsequent requests are serviced in constant time (or, more precisely, time independent from whether the file was present or not to begin with). Such caching must be global to the service (i.e. honored by all the servers if there are many).
It is desirable to purge such a cache periodically to allow source deduplication to work effectively, and to avoid boundless growth. If the caching for a decision expires after one hour, that gives the attacker at most one actual timing measurement per hour. Over the public internet, many requests are needed for accurate timing analysis, so the amount of data leaked is very small, and arbitrarily so – this compounds with the fact that brute force attacks need many iteractions.
The Sync Appliance allows to choose separately the source deduplication strategy to apply for regular and guest users (who cannot have private data of their own and only work in spaces they have been invited to):
- deduplicate always (faster)
- randomized scheme
- never perform source deduplication (slower)
By default, deduplication is always performed for regular users (trusted), and the randomized scheme is used for guests (external to the organization and not quite as trusted).
Covert channel
Knowledge of whether a file exists in the storage system or not amounts to one
bit of information. An unprotected service with source deduplication thus
allows to transfer one bit per file in a roundabout manner, where one user
creates (or not) a file with a specific content, and another verifies whether
it exists by attempting the corresponding upload. (It is easy to come
up with a protocol to transfer an arbitrarily long bit string: create files
with contents XXX000001
, XXXX000002
… and so on, where XXXX
is a
random, unique prefix chosen beforehand).
Fortunately, the same measures used to thwart the above privacy attacks are effective in preventing such a covert channel.
Wrap-up
Deduplication speeds up operations and decreases transfer and storage costs. At source, it allows a node to avoid sending data already known by the recipient. In a storage backend, data deduplication ensures only one copy of the data is saved. A consequence is that the sender knows (directly for the former case, indirectly in the latter) whether the data was already present in the recipient. This opens the door to a number of side channel attacks that threaten data privacy, allow service abuse, and expose both users and service operators to legal risk. It is possible to thwart these attacks while retaining most of the efficiency gains of deduplication by disabling selectively source deduplication with a combination of heuristic and more rigorous probabilistic schemes, and in the case of storage deduplication by preventing leakage of precise storage backend info.