Authentication for beginners, part 1: You don’t store the passwords

The best way to protect a secret is not knowing the secret at all.

Oct 25, 2024

If you're new to authentication and think 'what's so hard about storing usernames and passwords to make a simple accounts system?' - well then you're already doing it wrong!

The truth is, you should never store the passwords of your users. But then the question arises, what do you mean “I should not store” and how is it possible?

In this post, I will explain what I mean exactly, and how to implement a secure accounts system without storing any of the user's passwords.

TL;DR

The best way to protect a secret is not knowing the secret at all.

The accounts system should only ever store a proxy to the user's password, which we often refer to as password hashes. We use the user’s password as the secret to compute its hash via a computationally-expensive algorithm (by purpose).

Because a user's password is very often the same or similar across different websites, the system should always generate a random string to mixin with the password to achieve a certain level of unpredictability (of the secret) for computing the signature. This process is referred to as “adding salt” and those signatures are then called “salted hashes”.

As of this writing, popular and safe choices of signature algorithms are Argon2, Bcrypt, Scrypt and PBKDF2 with cryptographically-randomly-generated slats and right parameters.

You wanted to store passwords as plaintext

First of all, I want to clarify what “store” means in the context of this post. It means keeping something in persistent storage, very often, if you’re building an accounts system for your website, then your persistent storage is a database.

Let’s say, you have the following database table schema for your `users` table, where you have a column for the username and another column for the user’s password.

How would you go about checking a user's credentials when they try to authenticate (aka. log in, sign in)?

In the Go programming language, you could do it as follows (the code snippet is for illustration purpose only):

You may now ask, “isn’t that right? What’s wrong with it? Can’t users sign in perfectly to my website and my job is done?”

Well, it is functionally correct, but security-wise, it is a nightmare.

“How so? Aren’t users without the correct passwords unable to sign in? Are you selling me shit like NFTs?” You’re getting skeptical.

Systems developed nowadays are well aware of common attack vectors, some even directly baked into the framework you choose. Unfortunately, there is still an unsolved attack vector that accounts for many of the data breaches that you hear about in the news, which is what we call “social engineering”. What that means, in simple terms, is that the system itself is robust and secure (like you’re checking a user's credentials to be able to authenticate), but access to the system infrastructure and persistent storage, on the other hand, is often overlooked. That is saying, if the access to the system infrastructure is compromised (which is often much easier than attacking the system directly), then all of the defenses built in the system are effectively bypassed because of the malicious users (aka. hackers) get the backdoor access.

Now, think of it again, all of your users’ passwords in what, plaintext, what a pleasant gift to the hackers for them to sell. You’re literally making them rich by sacrificing your own business brand (tax-free assets transfer, huh?).

“Wait a minute”, you said, “access control in my company is heavily locked down.” Sure, you can and you should, but then why are there “insider trades” in the stock market even in such a highly regulated and policed industry? What about a malicious application silently installed on your employees devices and monitoring screens and/or network traffic before your IT gets alerted? What about an application bug that accidently logs the credentials and/or sends them to third-party vendors (who have their own set of security problems)? What about taking pictures from a mobile device? What about the database backup storage got compromised? The list can go on for the rest of my life, but I hope you get what’s the problem here: there are infinite possibilities to leak users’ passwords in this imperfect world.

“What do I do now?” Don’t sit there with you hands in your hair just yet, remember:

The best way to protect a secret is not knowing the secret at all.

If your persistent storage and your employees do not even know what the passwords are, it is impossible to leak through them (by chance or on purpose) just because they have high privilege access to your system infrastructure.

Taking one step back, the fundamental reason that we want to avoid storing users’ passwords is coming from two basic security principles when doing your system designs:

The least privilege principle: Access is only granted to those who actually need it, and revoked when no longer needed.
The defense in depth principle: Put security measures in every possible layer, not just the current outermost layer (usually the user-facing layer).

How do these two principles apply to our storyline here?

The least privilege principle for user passwords
1. Who else besides the users themselves need to know the password? The answer is only the application logic (that lives only in memory) that validates the username and password.
The defense in depth principle for user passwords
1. The system should not only check user passwords for authentication, but also store the passwords securely.

“OK OK, I’m sold, what should I do instead?”

Let’s ask ourselves this question: why do users need to enter their username and password for authentication, in the first place?

Because we need an approach for users to prove that they are who they claim they are by asking something theoretically only the users themselves know. Therefore, the question becomes, is there a way to prove the user knows the correct password without storing the password itself?

Yes, in computer science, a common technology that can achieve exactly that, is called hashing. It uses the following attributes that we can take advantage of:

The same output is guaranteed given the same input
It is absurdly time-consuming to reverse the input from an output
- This is why hashing algorithms are always evolving with the commodity computing powers and mathematical theories. What was considered irreversible before may not hold true today or in the near future.
Some algorithms are designed to be computationally expensive, making computing all possible variants of the outputs absurdly expensive.
- Similarly, what it means to be “absurdly expensive” evolves with the commodity computing powers and mathematical theories.

To summarize, instead of storing user passwords in plaintext, we should store hashed results (aka. signatures) in the persistent storage.

MD5 looks pretty random to me?

MD5 was once the popular choice but has been considered insecure because it lost some of the attributes with the current commodity computing powers and mathematical theories:

It is now practically possible to get the same output with different inputs (aka. collisions)
It is now computationally inexpensive

Same for many other older hashing algorithms like MD4, NTLM, etc. you may never even heard of (which might be good so you won’t try to use those insecure algorithms), making them no longer suitable to be used as password hashing algorithms.

Then what about the SHA family?

SHA1 is in a very close spot as MD5. SHA256 and SHA512 are better choices nowadays, but in terms of protecting user passwords, SHA family isn’t a great choice because they are not computationally expensive enough to prevent cracking passwords with a brute-force approach.

Salted hashes

“I know I don’t need to store the actual passwords by using a hashing algorithm, but what the heck are salts?”

One of the advantages to us, can be the disadvantage to us as well. The attribute that “the same output is guaranteed given the same input” of hashing algorithms gives predictability to both us and hackers.

Let’s say, when the databases were compromised on other sites through data breaches, or precomputed hashes of common passwords offline (see https://weakpass.com as an example), hackers then have a very large dataset of mappings between leaked/known passwords and their corresponding hashes. With that, they are able to compare against leaked data from your database to efficiently reveal the original passwords, authenticate against your system legitimately, and start stealing user’s data. This attack technique is called dictionary attack.

The defense technique that we call “adding salts” comes from the idea that, every time you add salts to your dishes, no two additions will be exactly the same in terms of the amount of salts, nor the shape of every salt. This technique effectively makes any precomputation of hashes useless, and requires hackers to recompute almost everything, which when things done right on our side, becomes absurdly expensive.

The salts however, must be cryptographically-randomly generated to make this defense technique effective, i.e. do not use pseudo-random generators.

In the Go programming language, you could do it as follows (the code snippet is good to use as-is):

Who are the cool kids?

As of this writing, popular and safe choices of signature algorithms are Argon2, Bcrypt, Scrypt and PBKDF2 with cryptographically randomly-generated salts and right parameters.

“Wait wait, not that fast! What are the right parameters?”

Unlike hashing algorithms like the SHA family, where you have fixed parameters (e.g. SHA1 computes 20-byte signatures, SHA256 computes 32-byte signatures, SHA512 computes 64-byte signatures), modern hashing algorithms are configurable in terms of the level of computational expensiveness.

For example,

you may choose the length of entropy, which usually means the length of the salts added. As a contrary, SHA family or any older hashing algorithms have no builtin support for salts.
and number of compute iterations, SHA family has exactly one iteration, modern algorithms default to thousands or tens of thousands of iterations. It makes the computation of the final signature both CPU and memory intensive, and ultimately achieving the goal of being computational expensive.

Be careful though, always use well-respected implementations of those hashing algorithms in your programming language, and do not implement any of those algorithms on your own and think you're a genius unless you indeed are (in which case, better contribute to those implementations instead). The reasons being those well-respected implementations are usually battle-tested in production systems who often had to go through extensive penetration tests to ensure there are no exploitable flaws in those systems. Better to enjoy the free rides when it comes to cryptography than pay the cost on your own.

Be aware of the performance implications

For new accounts systems, it is recommended to start using Argon2 as the strongest password hashing algorithm, but like every design decision in computer science, it’s all about trade-offs.

Theoretically, the longer the entropy and the more compute iterations the more secure, it is more costly for your computing infrastructure as well. You definitely do not want to end up in a situation where the old saying goes, “killing flies with a machine gun”, that the algorithm and the parameters you choose drags down your system performance to to the point regular user authentication becomes a timing attack to your system (e.g. it takes one minute to complete any authentication because you chose to have ten million compute iterations with 99 as your entropy length). A good balance is that the signature computation is expensive enough to make computing at a massive scale absurdly expensive for unethical hackers, but cheap enough to feel instantaneous at the time of user authentication.

Speaking of the timing attack, one little detail is that you should always use a constant-time comparison function when possible.

In the Go programming language, you could do it as follows (the code snippet is for illustration purpose only):

You can find a simple benchmark on my GitHub repository to give you a grasp of the performance implications between different password hashing algorithms with recommended parameters.

Some takeaways:

Since SHA family alone isn’t a good password hashing algorithm, their results are meaningless compared to other real algorithms, but you can tell between SHA256 and SHA512 that the cost is linear to the hash size they compute.
None of the good password hashing algorithms look great on benchmark because that’s by design, and only Argon2 meaningfully increases the speed from multiple cores starting at 4, but its memory is insane compared to others.
Based on the recommended parameters, PBKDF2 with SHA512 seems to strike a good balance between speed and memory usage.

Why does it have to be hashed? Can’t encryption do it?

Encryption is better than plaintext, but not so much and still goes against the least privilege principle because the persistent storage does not need to know the users’ passwords at all. In the event of the database being compromised, likely your encryption secret is being exposed too, making your encrypted passwords effectively plaintext to the hackers.

Welcome to the reality shit show!

You should have a good understanding about how to securely handle your users’ passwords, but does this worry you in any way?

If not, then you should! Because you should always use a different randomly generated password for every website you have an account with, you never know how they handle your password in their persistent storage, and if you share passwords across different services, and when one gets compromised, it would put all your accounts in other services in danger! The best you can do is to get yourself a proper password manager.

Identity, Authenticity, and Security

Discussion about this post