Scatter

Scatter generates M grains from a given file, which contain 1/Nth of the information each. Any N out of M grains allow reconstructing the original file with high probability (> 99 %).

Scatter is suitable for distributing data over multiple disks or other storage devices. It can recover data even if several grains are lost (e.g. disk failure) or unavailable at the same time. The resilience can be increased arbitrarily by increasing M.

Download

scatter.tar.bz2

Have a look at the README file for installation instructions.

Usage

scatter -g|--generate M [OPTIONS] FILE
Generates M grains from the (regular) file FILE. Grains are stored as files, and by default named FILE-SUFFIX, where SUFFIX is a random sequence. The code length N is 8 by default, but can be changed with the --code-length option (see below).

scatter -m|--merge [OPTIONS] GRAIN GRAIN …
Attempts to reconstruct the original file from the provided grain files. At least N grains are necessary, but more grains can be provided, in which case scatter will pick a suitable subset for reconstruction. Note that all grains must stem from the same source file, and be encoded with the same code length N.

scatter -i|--identify [OPTIONS] GRAIN [GRAIN …]
Prints information about grains.

Options

-c|--code-length N
Indicates the code length (2 ≤ N ≤ 256) used when generating grains. The default is N = 8. Longer codes require more computation.

-o|--output-file FILE
Specifies the base name of the generated grains, or the output filename when merging grains.

-h|--help
Prints a short help text and version information.

Typical usage examples

scatter -g 10 photo.png
Generates 10 grains with code length 8 (the default) of photo.png.

scatter -g 70 -c 64 -o /backup/photo.png photo.png
Generates 70 grains with code length 64 of photo.png and stores them in the backup folder.

scatter -m -o photo-reconstructed.png /backup/photo.png-*
Attempts to reconstruct the original file from the grains in the backup folder, and stores the reconstructed file in the current folder as photo-reconstructed.png.

Description

Scatter spreads the data of a single file over multiple smaller files (grains). Each grain thereby contains \(\frac{1}{N}\) of the information of the original file. If \(N\) independent grains are combined, the original file can be reconstructed.

The probability that \(N\) randomly selected grains are independent – the condition for file reconstruction – is approximately 99.6%. Each additional grain increases the reconstruction probability significantly. For \(N + 1\) available grains, for example, the probability is above 99.996%.

Note that if you generate \(M ≥ N\) grains of code length \(N\) at the same time, scatter will make sure that at least one subset of \(N\) grains is independent. Hence, the original file can always be reconstructed if none of the grains are lost.

When reconstructing a file, you can provide any number of grains. All grains must stem from the same original file, and use the same code length \(N\) (but may have been generated at different times and different runs of scatter). Among the provided grains, scatter will look for a subset of \(N\) independent grains and reconstruct the file using that subset.

In-Depth Description

Grains are linear combinations of the original file's bytes \(b_1, b_2, …, b_{Nk}\) over the finite field \(GF(2^8)\). The coefficients \(c_1, c_2, …, c_N\) are chosen randomly, and stored in the header of the file. For \(N = 3\), for example, the grain data \(d_1, d_2, …, d_k\) is calculated as follows:

\[ \left[ \begin{array}{ccc} c_1 & c_2 & c_3 \end{array} \right] \left[ \begin{array}{ccc} b_1 & b_4 & ⋯ & b_{3k - 2} \\ b_2 & b_5 & ⋯ & b_{3k - 1} \\ b_3 & b_6 & ⋯ & b_{3k} \end{array} \right] = \left[ \begin{array}{ccc} d_1 & d_2 & ⋯ & d_k \end{array} \right] \]

Hence, the first output byte is a linear combination of the first 3 input bytes, and each subsequent block of 3 input bytes yields one additional output byte. If the file size is not a multiple of 3, the input bytes are zero-padded. Since the matrix multiplication is carried out over \(GF(2^8)\), \(d_1, d_2, …, d_k\) are again bytes (i.e., values from 0 to 255, inclusive) and can be stored without further processing. The complexity of generating M grains is \(O(MNk)\).

To reconstruct a file from N independent grains, scatter creates the generator matrix G, and inverts it to obtain the reconstruction matrix. For N = 3, the generator matrix for the grain files f1, f2, and f3 takes the following form:

\[ G = \left[ \begin{array}{ccc} c_{f1,1} & c_{f1,2} & c_{f1,3} \\ c_{f2,1} & c_{f2,2} & c_{f2,3} \\ c_{f3,1} & c_{f3,2} & c_{f3,3} \end{array} \right] \]

If the grains are independent, this matrix is invertible, and the bytes of the original file (\(b_1, b_2, …, b_{3k}\)) can be reconstructed as follows:

\[ G^{-1} \left[ \begin{array}{ccc} d_{f1,1} & d_{f1,2} & ⋯ & d_{f1,k} \\ d_{f2,1} & d_{f2,2} & ⋯ & d_{f2,k} \\ d_{f3,1} & d_{f3,2} & ⋯ & d_{f3,k} \end{array} \right] = \left[ \begin{array}{ccc} b_1 & b_4 & ⋯ & b_{3k - 2} \\ b_2 & b_5 & ⋯ & b_{3k - 1} \\ b_3 & b_6 & ⋯ & b_{3k} \end{array} \right] \]

Again, matrix inversion and multiplication are carried out over \(GF(2^8)\). The complexity of the implemented matrix inversion algorithm is \(O(N^3)\) while processing the bytes takes \(O(N^2 k)\) time.

Grain File Format

A grain file contains (in that order):

Bytes	Description
4	Magic number "SGRN"
20	SHA1 of the original file
2	Number of padded bytes p
2	Code length N
N	Code coefficients
k	Grain data \(d_1, d_2, …, d_k\)

Note that the data length k is not stored explicitly in the file, as it can be derived straightforwardly from the size of the grain file, \(s_g\):

\[ k = s_g - 4 - 20 - 2 - 2 - N \]

The size \(s_f\) of the original file can be calculated as follows:

\[ s_f = Nk - p \]

Grains generated from the same file using the same code length have exactly the same size. In addition, the first 28 bytes of the header are identical. It's primarily the SHA1 sum which helps identifying grains, and avoiding that grains of different files are combined together. The SHA1 sum is also used as a checksum when verifying the reconstructed file.

Related Work

Scatter is an application of random network coding.

Scatter can be compared to split, a Unix tool to split files into smaller pieces. Split simply cuts the byte stream, and provides no resilience to data loss. In mathematical terms, split is equivalent to scatter with an identity matrix (and consequently M = N).

Scatter is conceptually similar to RAID5. The parity block can be seen as a linear combination over \(GF(2)\) (bit-wise XOR). RAID5 is limited to a single parity block (i.e. M = N + 1), but much faster than scatter. While RAID5 is used in live systems (where speed is crucial), scatter is more suitable for spreading backups over multiple disks.

Questions, Suggestions and Bug Reports

Please direct questions, suggestions, and bug reports to Thomas Lochmatter (thomas.lochmatter@viereck.ch).

License

Scatter is available under the terms of the MIT license.