With the increasing popularity of blog platforms like WordPress or TypePad, security has become a major issue. This is the first of a two part series of posts I’m writing about password security schemes. In this post I’m going to introduce some cryptography notions and some general approaches to password security.
First of all, I’m going to explain how do most password schemes work. A user creates an account on the application (let it be web or desktop) giving, among other details, a user name and a password. The application then takes this user name and password and stores it, usually, in a database. At this point there are two approaches. The first one,
and the easiest, is to store the user name and password as is in the database. A password that hasn’t experience any transformation, like in this case, is called a clear text password. Hence the method is called, clear text password storage. It’s an easy method, as you just have to retrieve the password for a given user and compare it, character by character with the one a user is giving you as part of his login. If they match, the user is allowed in the system. The problem with this method is that if anyone has access to the database where the passwords are stored, either via a legit way or not, it’s game over. The second method tries to avoid the previous problem with the use of a hash function. And what is a hash function? Well, it’s a mathematical function, used extensibly in cryptography. What it does is, it takes an input, let it be a string, and produces a unique fixed length string of gibberish which receives the name of hash. The key is that first, for the same string, it produces the same output (theoretically), and second, it’s a one way function, meaning that it’s very easy to compute the hash given an input, but it’s computationally unfeasible to retrieve the original input, given the hash. As an example:
md5sum(“AlwaysNewMistakes”) = 76dc83e4e19a1fd01bac6fbdfec92a27
You can try this at home, and you’ll see that every single time you input that string, you’ll obtain the same hash. There are several well known hash functions like md5, sha1, sha2 or blowfish. Each of them are different, and for the same input they’ll produce a different hash, so for example, an md5 hash won’t be “compatible” with an sha1 hash:
sha1sum(“AlwaysNewMistakes”) = e91972ddd17e1dd9dc3ede454fc652f3e2fe236f
Well, back to our subject. As I was saying, the second method uses a hash function to obfuscate the password. So, right before the password is stored in the database, it’s process through a hash function (usually md5) and instead of storing the password, it’s the hash of the password what gets stored. Let’s remember that theoretically, there can only be one hash for that password. So, when a user needs to login, the system takes the user’s password, generates a hash for it and then retrieves the one stored in the database and compares both:
IF md5(stored_password) == md5(user_input_password) THEN ACCESS!
So, with this approach, if someone tampers with the database where the passwords are stored, they won’t have the passwords, just the hashes. The intruder will need to find the string that generates that hash, and as I said before, hash functions are one way, so it’s impossible to retrieve the original string from the hash. Now, this method is way cooler than the first one, much more secure and pretty inexpensive to implement. But, as you’ve already figured, it still has some problems.
The question is, if storing the hash is so secure, then what’s all that fuzz about cracking passwords that you see in hackers movies? The truth is that, although you can’t reverse the hash to obtain the original string, you can test different strings to see if they generate the same hash. This process is what is known as password cracking (also known as password cryptanalysis). There are three approaches to eventually crack the password. The first one is called Bruteforce, and as it names indicates, it’s based on generating all possible combinations of letters, numbers and symbols and passing them through a hash function. Once a hash is obtained, it will be checked against the ones we are trying to crack:
Trying AAAA… -> md5(AAAA) -> ae5b468c7707a1f3d36c49b1fe2ef850
Checking hash: ae5b468c7707a1f3d36c49b1fe2ef850 == 76dc83e4e19a1fd01bac6fbdfec92a27 -> No match
Trying AAAB… -> md5(AAAB) -> d8063b11214a9f867d6184a8779ace6b
Checking hash: d8063b11214a9f867d6184a8779ace6b == 76dc83e4e19a1fd01bac6fbdfec92a27 -> No match
….
Trying ZZZZ… -> md5(ZZZZ) -> a2d048bcc847c4a7dc1ebfaecb27a6a0
Checking hash: a2d048bcc847c4a7dc1ebfaecb27a6a0 == 76dc83e4e19a1fd01bac6fbdfec92a27 -> No match
You get the idea. The problem with this method is that is computationally very expensive. If you don’t know the
original password length (and hash function) this process can take forever. That’s why attackers usually try a second method called Dictionary Attack. It is well known that many users tend to use non random passwords, most of them easy to guess. Taking this as a premise, we can build a dictionary of common words like “god”, “secret”, “password”, etc. and then run a little program that reads them, calculates the hash and compares it with the one from the password we are trying to crack. The Dictionary Attack has the advantage of being faster than the Bruteforce and with a higher rate of success. The drawback is that first, you need to build a dictionary and second, if that dictionary doesn’t contains the password or a derivation of it, you won’t crack it.
The third method is known as Rainbow tables and it’s an evolution from the Dictionary Attack. Computing a hash for thousands of words, as the ones in a dictionary, can be time consuming and require quite some powerful hardware. The solution? Why not precomputate all the hashes of a dictionary and store them in a table. That way, the next time you look for the hash of the word “secret” you will already have it, speeding the process of cracking a password. It’s obvious that the first time you build a Rainbow table it will take time and resources (several Gb of storage). Please note that a Rainbow table depends on the character set, the number of words and the hash algorithm that it employs. So it’s not like one Rainbow table to rule them all. Also, the lookup mechanism used in a Rainbow table is quite more complex than what I just explained, but the underlaying idea is basically the same.
Now, back to the beginning and our password schemes. Using a hash instead of the plain text password is secure, but it still can be defeated. So, that’s the part where we introduce a new concept, the salt. A salt is a value that is append to the password string before obtaining the hash value:
md5(Salt + MyPassword) = hash
Just to clarify, the salt is a value we (our application) generates and can be a random or a predefined value. The use of a salt value gives an extra protection layer. On one side it increases the complexity of a Bruteforce or Dictionary Attack against the password, as the intruder has to take the salt into account when calculating the hashes.
If the salt is an undisclosed value (read hidden value), the intruder might find it impossible to crack the password as he will be computing hash(password) instead of hash(salt + password). On another side, if the salt value is a random alphanumeric value it will increase the passwords complexity and will reduce the chances of being discovered using bruteforce or a dictionary, as it won’t match lists of common or used passwords. Finally, adding a salt value avoids the use of Rainbow tables against our passwords. Why is that? Well, as I said before, a Rainbow table has to be generated for a specific character set with an specific hash function. All the precomputed hashes stored in the table aren’t generated with a salt value, so the intruder won’t be able to use already made tables. Instead he will have to generate his own set for that specific salt value (if it’s constant), which as you’ve guessed, defeats the point of using precomputed tables.
Although the previous method is quite secure, we still have a slight problem. If we use a constant salt value, there is always a possibility of someone creating an ad hoc Rainbow table for it. This is specially true if your software is very well known. For example, if WordPress used a default salt value for their passwords, someone will most probably create a Rainbow table for it. You can argue that you can change it and you’ll be protected, nevertheless, the percentage of users that actually do that is very low. On the other side, why take the risk when you can make it better?
Now, lets say that instead of a constant salt we use a different pseudo random salt for each password. That is, for each new password we store, we generate a pseudo random salt value (I stress pseudo random as there isn’t a way, yet, to obtain truly random numbers with a computer) and store the salted hash. The question that arouses then is, how do I know which value I used when salting the passwords? The answer is that you store it with the hash. In this scenario we will do the following:
- Generate a pseudo random salt value, S
- Obtain the hash of the password: H = hash(S + password)
- Store S and H in the database
Even though we store the salt in the database, we’ve achieved an extra layer of security. With this method, it’s nearly impossible to precompute a Rainbow table and we avoid bruteforcing and dictionary attacks. The problem is that if someone breaches the database and is able to retrieve the passwords with the salt values they could, theoretically, craft a bruteforce or dictionary attack using each different salt value. To avoid this, to a certain extent, we can also use a constant hidden salt value. That is, we can hardcode a constant salt value in the configuration file of our application. That way, even if the database is leaked, they won’t have the constant salt value, rendering any possible crack attempt:
hash = md5(randomSalt + password + constantSalt)
There is a caveat, if an intruder also gains access to the application and can read the constant salt value from the configuration file, we’ll be back to square one. Nevertheless, if an intruder reaches that point, it’s already game over for your application, as it means they have access to the system where it’s installed.
For the record, there are different variations to the methods I’ve exposed. That’s the case of WordPress which processes its passwords with a double hash:
hash = md5(md5(password))
I will talk a little more about this method on the next post, but just to clarify, hashing a value twice doesn’t adds any extra security. It might render any bruteforcing or dictionary attack a little harder but nothing more.
Another important note. In the above examples I’ve been using the md5 hashing algorithm. Right now md5 has been broken. This means that there is a way of creating the same hash value with two different input strings. This is called a collision and it renders a hashing algorithm useless. As I said before, one of the key strengths of a hash is that for an input string there is only one output hash, if this doesn’t holds, then it’s useless. So, my recommendation is to use sha2 or blowfish (as sha1 is also known to have collision problems).
I hope this post has been helpful in giving a little insight into passwords implementations. I’m writing a second part, this time with real code and some problems that have been flagged in the way WordPress manages its passwords.
Idea: Embedding RSS comments in a content RSS feed
Mark Evans had a nice question on his blog some time ago and while commenting it I got an idea. One of the biggest problems (among others) about blogs, comments and RSS is that you can’t have both at the same time. That is, you can’t download one RSS feed with the content and the comments (as far as I know, so if I’m wrong, please correct me). I am aware that some feed readers can display both, but it’s not mainstream yet. On of those systems is the one on fav.or.it which I had the luck to see at FOWA in London this year. They are doing an awesome job and their interface is very very cool. Anyhow, my idea is based on something I read on the latest OPML draft specification (2.0) by Dave Winer.
Most blogs and content platforms offer two RSS feeds. One for the actual content and another one for the comments. The problem with it is that you need to subscribe to both to get them AND most probably, your feed reader will display them as separate feeds. In my opinion this is a burden and it might be one of the reasons why we see
such low ratios of comments per subscriber. I always tell the same fact, but it’s astonishing, at least for me, that a blog like TechCrunch with 165k+ subscribers gets an average of 20-30 comments per post. So, looking at the RSS 2.0 specification I remembered that there is a <comments> tag for each RSS item where, in theory, you have a link to the comments. Now, getting back to the OPML draft, Dave added a new feature which I think could allow a better comment handling. The basic tag in OPML is the <outline>. This tag might have an attribute called type, which can take several values, among them the “link” value. Quoting the OPML specification: “An outline element whose type is link must have a url attribute whose value is an http address“. In the new specification, he added what he calls inclusion. That is, if the url of an outline with link type ends in .opml, the file will be downloaded and included when displayed. I think the idea is very cool, specially if we give it a little twist. I was thinking that if instead of doing the inclusion with .opml files, we do it with an RSS file and we add that capability to the <comments> tag in the RSS specification we could have a wonderful tool to include comments in a feed.
So, putting all this together we might end up with something like this:
An RSS file of a blog (http://feeds.feedburner.com/AlwaysNewMistakes) which has items like the following one:
Now, if instead of having a comments tag like that, we mix it with the OPML notion of outline included link we could rewrite it as:
With that slight modification we can now have the number of comments we have and the xmlURL of the comment’s RSS feed for that post. What happens on the other side? Well, the feed reader must read that tag and include it as if it was part of the feed. Adding the number attribute allow us, on one side to show the readers how many comments the post has and on another hand lets the feed reader know if he needs to download the RSS feed of the comments again. Evidently, the number attribute will change with time if we have new comments and will be updated every time our feed reader requests the RSS feed.
So, what do you think of my idea? As with every thing I’m sure this issue has been discussed by someone before me and it might even have a different solution I’m unaware of. So please, if you know a cool way to handle comments drop me a comment and I’ll include it in the post.
UPDATE: This is what happens when you rush a post. You get it wrong. What I just described exists already and is known as wfw:commentRss. The only difference with what I was proposing is that the comment count is specified by a different tag (slash:comments), which by the way has nothing to do with the wfw namespace. Now, the commentRss tag was proposed in 2003. I’m quite fascinated as to why it hasn’t been adopted in feed readers yet. Some readers display the comments count, some others even parse the commentRss tag, but I don’t know yet of a reader that actually pulls that data based on the counter and collapse it with the content. Maybe you could even ask the user to activate that feature on a per post basis.