encryption / vocabulary / long term storage

Written by
Walter Doekes
Published on 2020-05-01

While investigating the most appropriate encryption cipher and format, I realised I didn’t have enough vocabulary on the subject. This post aims to close that knowledge gap somewhat.

I’m looking at symmetric ciphers here, as they are used when storing lots of data. (In fact, when encrypting larger amounts of data, public/private key encryption (an asymmetric cipher) is only used to encrypt a separate key for the symmetric cipher, which is then used for bulk of the data.)

Properties

Initially, our objective was to find a suitable encryption file standard that satisfies the following properties:

safety

The encryption standards should be modern enough and be hard to crack. We’re going for long term storage here. We still want to be relatively safe in 10 years.

integrity

If a file has been tampered with, we need to know. This may seem obvious, but without any digest/hash or message authentication code (generally HMAC), a flipped bit will silently propagate through the symmetric decryption process and produce garbage, instead of signaling an error. (And, when using a keyed-hash (HMAC) we can check authenticity as well.)

standards

When we have to look at the files in 10 years, we must be able to decrypt. The secret key derivation, digest calculation and the symmetric cipher must still be around. And, importantly, all chosen parameters must be documented in or near the encrypted file container. (Salt, initialization vector, key derivation method and rounds, the chosen cipher…)

speed

When the above requirements have been satisified, we want some speed as well. When access to the data is needed, we don’t want to have to wait for slow bzip2 decompression. Also relevant here is that some encryption methods support parallel decryption, while others don’t. (Although that can be mitigated by splitting up files in 1-4GB chunks, which incidentally improves cloud/object storage usability.)

Ingredients

For symmetric file encryption, we’ll need the following ingredients:

password

The password will be used to derive a secret key. The more entropy the password has, the fewer rounds of derivation we need to do.

key derivation function

The derivation function takes the password, an optional salt and derivation parameters and yields the key.

GnuPG — PGP, GPG, I’ll call it GnuPG, because that’s the application we’ll be using — uses iterated and salted S2K.
OpenSSL enc and others use PBKDF2 (PKCS5_PBKDF2_HMAC).

The derivation function salts the password, so one cannot precompute results for common passphrases. And also, separate files will be encrypted with a different key, even though they share the same password.

More derivation iterations make a password more secure.

Both S2K and PBKDF2-HMAC accept various digest functions. Right now the common one is SHA-256.

A quick note about iterations and counts:

A PBKDF2-HMAC iteration means 2 times an N-bit hash is done.
SHA-256 internally uses 64 octets, SHA-512 uses 128 octets.
So, for each iteration, PBKDF2-HMAC with SHA-256 hashes 128 octets.
The iterated and salted S2K counts octets, not iterations.
Thus, the GnuPG default s2k-count of 65M (65011712) equals around 500.000 PBKDF2-HMAC iterations of SHA-256, or 250.000 iterations when using SHA-512.

initialization vector (iv)

To avoid repetition in block cipher cryptographic output — which may leak information — parts of the previous block get intermixed in the next block. This removes repetitive patterns. To shield the first block as well, the data starts with random initialization data. (See the the ECB penguin for a visual example why a block cipher needs additional data.)

Generally, the IV should be random data, sent as cleartext. OpenSSL enc however derives the IV from the password; only the salt is sent in the clear.

compression

Compression creates a more uniform distribution of characters: increasing the entropy, again reducing patterns. However, the compression may increase encryption times by a lot. (Note that compression should not be used in interactive contexts. See a human readable TLS CRIME Vulnerability explanation.)

message authentication

As mentioned above, block/stream cipher encrypted data means: feed it garbage in, you get garbage out. Without an integrity check, you won’t know that you’re dealing with garbage. The ubiquitous scheme for validation here is Encrypt-then-MAC, whereby a keyed digest is appended to the output. During decryption, this can be checked for data errors while not giving away any clues about the cleartext. GnuPG ships with something simpler called modification detection code (MDC, using SHA-1 as the digest). OpenSSL enc doesn’t support any and would require a separate hashing (or, better yet, HMAC) pass.

ciphers and block modes

Encryption ciphers come in various flavors. There are certainly newer ones, but AES-128 and AES-256 are common and fast when hardware acceleration (AES-NI) is used. As for block cipher modes: GnuPG uses Ciphertext feedback (CFB) mode. It is similar to the Cipher Block Chaining (CBC) which OpenSSL enc has for AES. CBC and CFB cannot be decrypted in parallel, as they require input from previous blocks.

Note that not all ciphers are block ciphers. Some, like RC4 and ChaCha20 are stream ciphers, and they don’t require any additional block mode — be it CBC or CFB or CTR. (About the Counter (CTR) mode: this mode uses a sequential integer as repetition-busting data. This one can be encrypted/decrypted in parallel.)

Usage

With all of those ingredients on the table, a sample encryption pass with GnuPG might look like this:

$ time gpg --batch \
    --symmetric --passphrase-file ./password.txt \
    --s2k-mode 3 --s2k-digest-algo SHA512 \
    --cipher-algo AES256 \
    --output ./encrypted.gpg \
    ./original.txt

real  0m10.923s
user  0m10.743s
sys   0m0.180s

Pretty slow. Let’s disable (the default ZLIB) compression:

$ time gpg --batch \
    --symmetric --passphrase-file ./password.txt \
    --s2k-mode 3 --s2k-digest-algo SHA512 \
    --compress-algo none \
    --cipher-algo AES256 \
    --output ./encrypted-no-compression.gpg \
    ./original.txt

real  0m3.725s
user  0m3.136s
sys   0m0.588s

Much better, but now our sample file is larger. Let’s add very lightweight qlzip1 (qpress) in the mix. (See QuickLZ and qpress-deb.)

$ time gpg --batch \
    --symmetric --passphrase-file ./password.txt \
    --s2k-mode 3 --s2k-digest-algo SHA512 \
    --compress-algo none \
    --cipher-algo AES256 \
    --output ./encrypted.qz1.gpg \
    <(qlzip1 < ./original.txt)

real  0m1.402s
user  0m1.486s
sys   0m0.203s

Now we’re getting somewhere. Aside from the highly compressible (json) original.txt, we’ve produced the following files in the mean time:

$ du -sh original.txt encrypted*
   959M original.txt
    90M encrypted.gpg
   959M encrypted-no-compression.gpg
    86M encrypted.qz1.gpg

Checking the key derivation s2k-count and s2k-digest-algo, to get a feeling for the key derivation speed:

$ for c in 65011712 8388608 65536; do
    for a in SHA1 SHA256 SHA512; do
      rm -f ./encrypted-1-one-byte.qz1.gpg
      out=$(a=$a c=$c bash -c '\
        time gpg --batch \
          --symmetric --passphrase-file ./password.txt \
          --s2k-mode 3 --s2k-digest-algo $a --s2k-count $c \
          --compress-algo none \
          --cipher-algo AES256 \
          --output ./encrypted-1-one-byte.qz1.gpg \
          <(qlzip1 < ./1-byte-file.txt)' \
        2>&1)
      printf 't %s count %8d algo %6s\n' \
        $(echo $out | awk '{print $2}') $c $a
    done
  done
t 0m0.490s count 65011712 algo   SHA1
t 0m0.373s count 65011712 algo SHA256
t 0m0.275s count 65011712 algo SHA512
t 0m0.068s count  8388608 algo   SHA1
t 0m0.052s count  8388608 algo SHA256
t 0m0.039s count  8388608 algo SHA512
t 0m0.003s count    65536 algo   SHA1
t 0m0.004s count    65536 algo SHA256
t 0m0.005s count    65536 algo SHA512

So, as promised, taking a longer hash results in fewer S2K iterations: the 2 x SHA-512 hash equivalents of 250.000, 32.000 and 250 rounds, respectively. Notice how the elapsed time quickly drops from half a second to milliseconds.

This is where strong (high entropy) passwords come into play. If the password is as random as the derived key would be we don’t need to do any sizable derivation pass. That is, unless we want to slow people (both legitimate users and attackers) down on purpose. (If someone is trying to crack our file, they might as well skip the derivation step and guess the key instead. Only if they also want our (master) password, should they do the derivation. Rotating the password every so often mitigates this.)

Okay. So what about OpenSSL? How does OpenSSL enc compare?

$ time gpg --batch \
    --symmetric --passphrase-file ./password.txt \
    --s2k-mode 3 --s2k-digest-algo SHA512 --s2k-count 65536 \
    --compress-algo none \
    --cipher-algo AES256 \
    --output encrypted.qz1.gpg <(qlzip1 < ./original.txt)

real  0m1.097s
user  0m1.186s
sys   0m0.171s

$ time openssl enc -e -pass file:./password.txt \
    -pbkdf2 -md sha512 -iter 250 -salt \
    -aes-256-cbc \
    -out encrypted.qz1.ossl-pbkdf2-250-sha512-aes256cbc \
    -in <(qlzip1 < original.txt)

real  0m1.072s
user  0m1.047s
sys   0m0.197s

Pretty comparable! But, note how we have to store ossl-pbkdf2-250-sha512-aes256cbc somewhere, because only the key derivation salt is stored in the clear.

And, the OpenSSL version does not even include error detection!

So, we’ve looked at GnuPG and OpenSSL. What about zip? 7zip? Any others?

Both zip and 7-zip don’t only do compression and encryption, they also do file archiving. Archiving is not something we’re looking for here. Nor is their compression. Even if we made sure that the contents were encrypted when decrypting (needed for integrity purposes), we’d still be using the wrong tool for the job. Did I mention that streaming (on-the-fly) encryption/decryption is a pre? Which they can’t do. Being able to chain tools together keeps memory low and usability high.

What about something homegrown?

Well yes, for the full comparison experience, I happen to have a private customer-built custom encryption tool — with a custom file format — laying around: let’s call it customcrypt.py. It uses PBKDF2HMAC(algorithm=SHA512, iterations=250) for key derivation, and HMAC(derived_key2, algorithm=SHA256) for Encrypt-then-MAC verification and a ChaCha20 stream cipher (hence no block mode required).

$ time python3 customcrypt.py \
    -e <(qlzip1 < original.txt) \
    encrypted.qz1.custom

real  0m1.212s
user  0m1.321s
sys   0m0.180s

Not too shabby either. (Although I question the usefulness of deriving an integrity key from the same password.) The drawback of this particular file format is that it hardcodes some parameters, i.e.: it’s not self-documenting. (A problem we can fix.) But also, in 10 years, this script may get lost or have broken dependencies. (A problem that’s harder to fix.)

For the tests above, GnuPG 2.2.4 (on Ubuntu/Bionic) was used. Algorithm availability on the CLI:

$ gpg --version
gpg (GnuPG) 2.2.4
libgcrypt 1.8.1
...
Cipher: IDEA, 3DES, CAST5, BLOWFISH, AES, AES192, AES256, TWOFISH,
        CAMELLIA128, CAMELLIA192, CAMELLIA256
Hash: SHA1, RIPEMD160, SHA256, SHA384, SHA512, SHA224
Compression: Uncompressed, ZIP, ZLIB, BZIP2

For OpenSSL the version was 1.1.1 (also Ubuntu). Algorithms can be found thusly:

$ openssl help
...
Message Digest commands (see the `dgst' command for more details)
blake2b512        blake2s256        gost              md4
md5               rmd160            sha1              sha224
... (elided a few) ...
Cipher commands (see the `enc' command for more details)
aes-128-cbc       aes-128-ecb       aes-192-cbc       aes-192-ecb
aes-256-cbc       aes-256-ecb       aria-128-cbc      aria-128-cfb
... (elided a few) ...

Next time, we’ll look at symmetric cipher speed, which is the last factor in the mix. Then we can make a final verdict.

encryption / vocabulary / long term storage