Providing integrity for Encrypted data with HMACs in .NET

Once again I’m about to dip into and scratch the surface of Cryptography. So here’s the disclamer: This is not my job. I don’t do this for a living. Don’t ever make up your own encryption algorithms. Try not to write your own cryptographic code. Don’t take anything that I say as the advice of a security expert or legal advice or assume that any code in this post or any linked posts are correct in any way shape or form. If you have a project that requires cryptographic security, I suggest you find someone who has been doing this far longer than I have to write the code for you. Then have several 3rd party security firms review or write your security code. In short, I’m not responsible for your mistakes, or the correctness of the code or writing presented here.

As even more of a taste of just how hard it is, I admit that I thought the last article I wrote covered most of the basics of modern encryption in .NET pretty well. The result? Nope. Missed something big. See, cryptography is hard not because it’s hard to say “take a password plus this data and make it so someone can’t see the data without the password”, but because of all the ways someone could get access to your password, or access your data when its not encrypted, or access your computer when it’s unlocked, or crack your password if you pick a bad password to begin with. Take a look at this reddit thread article that nicely rips apart my previous article. In all seriousness, I really appreciate the feedback, especially when it comes to security because it’s hard, and the devil is in the details.

I forgot to cover data integrity in my last article. Encryption is not enough. Redditors referred to it as authentication, which is one of the uses of HMAC’s, but authenticity is not whats important in the example case I presented for my original article. We are interested in “How to detect that someone or something accidentally or maliciously changed the encrypted data”, essentially, we’re interested in cryptographic data integrity. In addition to that there were a few other things wrong with the article that the redditors pointed out, such as my salt values being overly large, just picking CBC without covering other chaining methods, not discussing how to remove keys from memory, and referencing Jeff Atwood. The long and the short of it, they’re completely right. Getting crypto correct is hard, and good crypto systems are worth millions of dollars. You’re probably not getting paid that much to write a little encryption. So use a library thats already been written and provides high level abstractions and don’t write crypto code yourself.

Alright. HMAC, what is it? The MAC stands for a Message Authentication Code, the H stands for Hash. Put it all together and you have a Hash-based Message Authentication Code. It’s a hashing function thats deliberately designed to resist malicious tampering. The key is preventing malicious tampering. A normal hash function (e.g. MD5 or SHA1) would detect accidental byte tweaks, but somebody maliciously tampering with your data could tinker with it, create a new hash code for the data, replace the old hash code, and you would never be the wiser. For this reason, Message Authentication Codes generally fall into the category of keyed hash algorithms since they use a key or derived key and mix it with a hashing or encryption function to produce a value that an attacker can’t reproduce, providing both data integrity, and (if you happen to be in a sender and receiver role where both parties share a key) authentication. There are however, a couple of things to take into account. First, we shouldn’t use the same key twice to encrypt and hash the data. The more times the same key is used, especially against a known piece of data, the more likely an attack can be developed and used to figure out our key. Arguably, the final key used for encryption and the final key used for the message authentication should be different, as different as possible. The best way to do this would be to append or change the key such that both the encryption key and authentication key run through the KDF (Key Derivation Function) from different starting points. As an example, consider the following:

// Derive the passkey from a hash of the passwordBytes plus salt with the number of hashing rounds.
var deriveKey = new Rfc2898DeriveBytes(password, passwordSalt, 10000);
var deriveHMAC = new Rfc2898DeriveBytes(password, hmacSalt, 10000);
// This gives us a derived byte key from our passwordBytes.
var aes256Key = deriveKey.GetBytes(32);
var hmacKey = deriveHMAC.GetBytes(32);

Because the hash function mixes the password with the salt, and because we have different salts, after only one round the derived keys will already be different on account of the salt value. So we have a derived key. One for the actual encryption, one to prevent tampering and provide data integrity.

On a side note, it’s arguably better to authenticate the encrypted output (encrypt then authenticate) rather than authenticate the plaintext, then encrypt, or authenticate and encrypt. Again, I know it’s beating a dead horse, but cryptography is hard, and ultimately, the security of a system is going to depend on the security of the entire system, not just the individual parts. So. We have an HMAC key, we have an encryption key, and we know that we want to encrypt, then authenticate the encrypted output. In addition, we want to make sure anything else that could easily be tampered with is also authenticated, such as our Initialization Vector, since any change to it can easily affect our decrypted output. Another small advantage thats almost not worth mentioning is that by authenticating the encrypted output instead of the plain text is that we can detect if anything has changed even before we start decrypting the text. So here’s an example. In this case, I chose to use HMACSHA1, there’s others on the MSDN but I chose this particular one since it uses the same hash algorithm used internally by the KDF I used in my previous post, Rfc2898DeriveBytes, aka (PKDF2).

var hmac = new HMACSHA1(hmacKey);
var ivPlusEncryptedText = iv.Concat(cipherTextBytes).ToArray();
var hmacHash = hmac.ComputeHash(ivPlusEncryptedText);

In this case, we’re using our derived hmacKey, and we’re computing the hash of both the initialization vector concated with our encrypted ciphertext. That gives us everything we need to have a self validating “package” of data that is secured and can’t be tampered without us knowing unless the attacker knows our key or can break AES256 encryption, but at that point this whole discussion is pointless.

With decryption, remember how I said we compute the Encryption and HMAC key separately? If we did that, and if we computed the hmac over the encrypted data, we can perform the validation step on the data before we compute our decryption key. The only reason we would do this is so that if the data is invalid or has been tampered with we don’t take the time to also compute the decryption key. Small things, but I wanted to explain why the key computation for the encryption and hmac is kept separate:

var deriveHmac = new Rfc2898DeriveBytes(password, hmacSalt, 10000);
var hmacKey= deriveHmac.GetBytes(32);
var hmacsha1 = new HMACSHA1(hmacKey);
var ivPlusEncryptedText = ivBytes.Concat(encryptedBytes).ToArray();
var hash = hmacsha1.ComputeHash(ivPlusEncryptedText);

if (!BytesAreEqual(hash, hmac))
   throw new CryptographicException( "Your encrypted data was tampered with!" );

var deriveKey = new Rfc2898DeriveBytes(password, passwordSalt, 10000);
var aes256Key = deriveKey.GetBytes(32);

using (var transform = new AesManaged())
   using (var ms = new MemoryStream(encryptedBytes))
      using (var cryptoStream = new CryptoStream(ms, transform.CreateDecryptor(aes256Key, ivBytes), CryptoStreamMode.Read))
         var decryptedBytes = new byte[encryptedBytes.Length];
         var length = cryptoStream.Read(decryptedBytes, 0, decryptedBytes.Length);

         var decryptedData = decryptedBytes.Take(length).ToArray();

So, there you go. Basic explanation about why HMAC’s are important, what I missed, some code, and the disclaimer to write security code at your own risk. Full demo code, demo code output, and a bunch of random links after the break.

See the code, output, and related links