encryption / vocabulary / long term storage
While investigating the most appropriate encryption cipher and format, I realised I didn’t have enough vocabulary on the subject. This post aims to close that knowledge gap somewhat.
I’m looking at symmetric ciphers here, as they are used when storing lots of data. (In fact, when encrypting larger amounts of data, public/private key encryption (an asymmetric cipher) is only used to encrypt a separate key for the symmetric cipher, which is then used for bulk of the data.)
Properties
Initially, our objective was to find a suitable encryption file standard that satisfies the following properties:
safety
The encryption standards should be modern enough and be hard to crack. We’re going for long term storage here. We still want to be relatively safe in 10 years.
integrity
If a file has been tampered with, we need to know. This may seem obvious, but without any digest/hash or message authentication code (generally HMAC), a flipped bit will silently propagate through the symmetric decryption process and produce garbage, instead of signaling an error. (And, when using a keyed-hash (HMAC) we can check authenticity as well.)
standards
When we have to look at the files in 10 years, we must be able to decrypt. The secret key derivation, digest calculation and the symmetric cipher must still be around. And, importantly, all chosen parameters must be documented in or near the encrypted file container. (Salt, initialization vector, key derivation method and rounds, the chosen cipher…)
speed
When the above requirements have been satisified, we want some speed as well. When access to the data is needed, we don’t want to have to wait for slow bzip2 decompression. Also relevant here is that some encryption methods support parallel decryption, while others don’t. (Although that can be mitigated by splitting up files in 1-4GB chunks, which incidentally improves cloud/object storage usability.)
Ingredients
For symmetric file encryption, we’ll need the following ingredients:
password
The password will be used to derive a secret key. The more entropy the password has, the fewer rounds of derivation we need to do.
key derivation function
The derivation function takes the password, an optional salt and derivation parameters and yields the key.
- GnuPG — PGP, GPG, I’ll call it GnuPG, because that’s the application we’ll be using — uses iterated and salted S2K.
- OpenSSL
enc
and others use PBKDF2 (PKCS5_PBKDF2_HMAC
).
The derivation function salts the password, so one cannot precompute
results for common passphrases. And also, separate files will be
encrypted with a different key, even though they share the same
password.
More derivation iterations make a password more secure.
Both S2K and PBKDF2-HMAC accept various digest functions. Right now
the common one is SHA-256.
A quick note about iterations and counts:
- A PBKDF2-HMAC iteration means 2 times an N-bit hash is done.
- SHA-256 internally uses 64 octets, SHA-512 uses 128 octets.
- So, for each iteration, PBKDF2-HMAC with SHA-256 hashes 128 octets.
- The iterated and salted S2K counts octets, not iterations.
- Thus, the GnuPG default s2k-count of 65M (65011712) equals around 500.000 PBKDF2-HMAC iterations of SHA-256, or 250.000 iterations when using SHA-512.
initialization vector (iv)
To avoid repetition in block cipher cryptographic output — which may
leak information — parts of the previous block get intermixed in the
next block. This removes repetitive patterns. To shield the first
block as well, the data starts with random initialization data. (See
the the ECB penguin for a
visual example why a block cipher needs additional data.)
Generally, the IV should be random data, sent as cleartext. OpenSSL
enc
however derives the IV from the password; only the salt is sent
in the clear.
compression
Compression creates a more uniform distribution of characters: increasing the entropy, again reducing patterns. However, the compression may increase encryption times by a lot. (Note that compression should not be used in interactive contexts. See a human readable TLS CRIME Vulnerability explanation.)
message authentication
As mentioned above, block/stream cipher encrypted data means: feed it
garbage in, you get garbage out. Without an integrity check, you won’t
know that you’re dealing with garbage. The ubiquitous scheme for
validation here is Encrypt-then-MAC, whereby a keyed digest is
appended to the output. During decryption, this can be checked for data
errors while not giving away any clues about the cleartext. GnuPG
ships with something simpler called modification detection code (MDC,
using SHA-1 as the digest). OpenSSL enc
doesn’t support any and
would require a separate hashing (or, better yet, HMAC) pass.
ciphers and block modes
Encryption ciphers come in various flavors. There are certainly newer
ones, but AES-128 and AES-256 are common and fast when hardware
acceleration (AES-NI) is used. As for block cipher modes: GnuPG uses
Ciphertext feedback (CFB) mode. It is similar to the Cipher Block
Chaining (CBC) which OpenSSL enc
has for AES. CBC and CFB cannot
be decrypted in parallel, as they require input from previous blocks.
Note that not all ciphers are block ciphers. Some, like RC4 and
ChaCha20 are stream ciphers, and they don’t require any additional
block mode — be it CBC or CFB or CTR. (About the Counter (CTR)
mode: this mode uses a sequential integer as repetition-busting data.
This one can be encrypted/decrypted in parallel.)
Usage
With all of those ingredients on the table, a sample encryption pass with GnuPG might look like this:
$ time gpg --batch \
--symmetric --passphrase-file ./password.txt \
--s2k-mode 3 --s2k-digest-algo SHA512 \
--cipher-algo AES256 \
--output ./encrypted.gpg \
./original.txt
real 0m10.923s
user 0m10.743s
sys 0m0.180s
Pretty slow. Let’s disable (the default ZLIB) compression:
$ time gpg --batch \
--symmetric --passphrase-file ./password.txt \
--s2k-mode 3 --s2k-digest-algo SHA512 \
--compress-algo none \
--cipher-algo AES256 \
--output ./encrypted-no-compression.gpg \
./original.txt
real 0m3.725s
user 0m3.136s
sys 0m0.588s
Much better, but now our sample file is larger. Let’s add very
lightweight qlzip1
(qpress) in the mix. (See
QuickLZ and
qpress-deb.)
$ time gpg --batch \
--symmetric --passphrase-file ./password.txt \
--s2k-mode 3 --s2k-digest-algo SHA512 \
--compress-algo none \
--cipher-algo AES256 \
--output ./encrypted.qz1.gpg \
<(qlzip1 < ./original.txt)
real 0m1.402s
user 0m1.486s
sys 0m0.203s
Now we’re getting somewhere. Aside from the highly compressible (json)
original.txt
, we’ve produced the following files in the mean time:
$ du -sh original.txt encrypted*
959M original.txt
90M encrypted.gpg
959M encrypted-no-compression.gpg
86M encrypted.qz1.gpg
Checking the key derivation s2k-count and s2k-digest-algo, to get a feeling for the key derivation speed:
$ for c in 65011712 8388608 65536; do
for a in SHA1 SHA256 SHA512; do
rm -f ./encrypted-1-one-byte.qz1.gpg
out=$(a=$a c=$c bash -c '\
time gpg --batch \
--symmetric --passphrase-file ./password.txt \
--s2k-mode 3 --s2k-digest-algo $a --s2k-count $c \
--compress-algo none \
--cipher-algo AES256 \
--output ./encrypted-1-one-byte.qz1.gpg \
<(qlzip1 < ./1-byte-file.txt)' \
2>&1)
printf 't %s count %8d algo %6s\n' \
$(echo $out | awk '{print $2}') $c $a
done
done
t 0m0.490s count 65011712 algo SHA1
t 0m0.373s count 65011712 algo SHA256
t 0m0.275s count 65011712 algo SHA512
t 0m0.068s count 8388608 algo SHA1
t 0m0.052s count 8388608 algo SHA256
t 0m0.039s count 8388608 algo SHA512
t 0m0.003s count 65536 algo SHA1
t 0m0.004s count 65536 algo SHA256
t 0m0.005s count 65536 algo SHA512
So, as promised, taking a longer hash results in fewer S2K iterations: the 2 x SHA-512 hash equivalents of 250.000, 32.000 and 250 rounds, respectively. Notice how the elapsed time quickly drops from half a second to milliseconds.
This is where strong (high entropy) passwords come into play. If the password is as random as the derived key would be we don’t need to do any sizable derivation pass. That is, unless we want to slow people (both legitimate users and attackers) down on purpose. (If someone is trying to crack our file, they might as well skip the derivation step and guess the key instead. Only if they also want our (master) password, should they do the derivation. Rotating the password every so often mitigates this.)
Okay. So what about OpenSSL? How does OpenSSL enc
compare?
$ time gpg --batch \
--symmetric --passphrase-file ./password.txt \
--s2k-mode 3 --s2k-digest-algo SHA512 --s2k-count 65536 \
--compress-algo none \
--cipher-algo AES256 \
--output encrypted.qz1.gpg <(qlzip1 < ./original.txt)
real 0m1.097s
user 0m1.186s
sys 0m0.171s
$ time openssl enc -e -pass file:./password.txt \
-pbkdf2 -md sha512 -iter 250 -salt \
-aes-256-cbc \
-out encrypted.qz1.ossl-pbkdf2-250-sha512-aes256cbc \
-in <(qlzip1 < original.txt)
real 0m1.072s
user 0m1.047s
sys 0m0.197s
Pretty comparable! But, note how we have to store
ossl-pbkdf2-250-sha512-aes256cbc
somewhere, because only the key
derivation salt is stored in the clear.
And, the OpenSSL version does not even include error detection!
So, we’ve looked at GnuPG and OpenSSL. What about zip? 7zip? Any others?
Both zip and 7-zip don’t only do compression and encryption, they also do file archiving. Archiving is not something we’re looking for here. Nor is their compression. Even if we made sure that the contents were encrypted when decrypting (needed for integrity purposes), we’d still be using the wrong tool for the job. Did I mention that streaming (on-the-fly) encryption/decryption is a pre? Which they can’t do. Being able to chain tools together keeps memory low and usability high.
What about something homegrown?
Well yes, for the full comparison experience, I happen to have a private
customer-built custom encryption tool — with a custom file format —
laying around: let’s call it customcrypt.py
. It uses
PBKDF2HMAC(algorithm=SHA512, iterations=250)
for key derivation, and
HMAC(derived_key2, algorithm=SHA256)
for Encrypt-then-MAC
verification and a ChaCha20 stream cipher (hence no block mode
required).
$ time python3 customcrypt.py \
-e <(qlzip1 < original.txt) \
encrypted.qz1.custom
real 0m1.212s
user 0m1.321s
sys 0m0.180s
Not too shabby either. (Although I question the usefulness of deriving an integrity key from the same password.) The drawback of this particular file format is that it hardcodes some parameters, i.e.: it’s not self-documenting. (A problem we can fix.) But also, in 10 years, this script may get lost or have broken dependencies. (A problem that’s harder to fix.)
For the tests above, GnuPG 2.2.4 (on Ubuntu/Bionic) was used. Algorithm availability on the CLI:
$ gpg --version
gpg (GnuPG) 2.2.4
libgcrypt 1.8.1
...
Cipher: IDEA, 3DES, CAST5, BLOWFISH, AES, AES192, AES256, TWOFISH,
CAMELLIA128, CAMELLIA192, CAMELLIA256
Hash: SHA1, RIPEMD160, SHA256, SHA384, SHA512, SHA224
Compression: Uncompressed, ZIP, ZLIB, BZIP2
For OpenSSL the version was 1.1.1 (also Ubuntu). Algorithms can be found thusly:
$ openssl help
...
Message Digest commands (see the `dgst' command for more details)
blake2b512 blake2s256 gost md4
md5 rmd160 sha1 sha224
... (elided a few) ...
Cipher commands (see the `enc' command for more details)
aes-128-cbc aes-128-ecb aes-192-cbc aes-192-ecb
aes-256-cbc aes-256-ecb aria-128-cbc aria-128-cfb
... (elided a few) ...
Next time, we’ll look at symmetric cipher speed, which is the last factor in the mix. Then we can make a final verdict.