LinkedIn,, Eharmony - Analysis of over 10 Million leaked passwords

Currently, it is high season for password hacking. The victims LinkedIn, and eHarmony, each have millions of users. Speculation holds, that the same security exploit may have enabled the hacking of these three popular social sites within a time-window of days, but details are unknown at the moment. In this post, the notion of hashes and an analysis of LinkedIn passwords is provided with an outlook of a comparative analysis between the three obtained datasets. Shuffled, anonymous data-excerpts are being uploaded have been uploaded to my Data-Hub.

In the 1970's computer administrators figured that storing raw user passwords in databases is a bad thing to do, and started to hash the passwords instead. Hashing is a one-way encryption process, utilizing a so called hash function. Different Hash functions exist for different purposes. An InChi Key attempts to yield a maximum dissimilarity for a set of strings which characterize chemical compounds. Contrarily, for passwords a hash-function should be hard to break, by either brute-force means or via algorithmic weaknesses and it may have to overcome intrinsic weaknesses of the provided data, such as notoriously short user-passwords.

The increase of data and computing power, made many initially applied hash-functions such as MD5 and SHA1 obsolete for password hashing, as they can be cracked within seconds to minutes nowadays, helped by vast look up tables (LUT) of precomputed hashes and their corresponding passwords. Such tables now easily cover a spectrum of 20% or more of all users passwords. This means that up to a quarter of the entire user-base of a major site can be immediately exposed after a hack! In many cases, the creaking dinosaurs of password hash functions (MD5, SHA1,...) are used on the sole basis of providing interoperability with other databases of older applications.

In any case, the purpose of a hash should be the increase in overall entropy, as can be clearly seen in Fig 1.1, computed over a 10MB subset of the LinkedIn database (10000000/40= 250.000 passwords)
Fig 1.1 Showing the increased, yet evened-out entropy,  regardless of the application of weak  (~non-computationally demanding) hash-functions such as unsalted SHA1 in LinkedIn's case. Update: The command-line's tool applicability to password hash files is further elaborated here.
In comparison, the raw passwords feature an accumulated entropy of roughly 0.3, varying widely over the entire recovered raw-password database.

How to obtain raw passwords from hashes:
Fig 1.2 Showing the graphical user interface of hashcat,- an application used for rapidly cracking password hash datasets

Hackers typically leak password hashes without usernames on various platforms such as pastebin and dropbox, making them available for password analysis by third parties. There are many programs which faciliate recovering weak password hashes into the original user-password. HashCat is one of best programs, and provides a cross-platform Graphical User Interface. PiPal is another password-recovery application, with graphical user interfaces (GUI's) available as well. Pipal was used for analysis of the 280MB LinkedIn dataset. As a response to the recent LinkedIn breakin, Hashcat released a special version to crack leaked LinkedIn hash passwords:

Why analyze?
Password analysis over large datasets enables researchers, application developers and user interface designers to come up with better means that aid the user in the long term goal of raising security. Additionally since information is the stepping stone, the user can react proactively to the provided information, be it health or security related. A notion of information-awareness which is the self declared modus operandi for some hacker-groups such as Lulzsec. The aspect of security is increasingly more important in a rapidly expanding social web environment, with more personal information distributed around various web-resources, often absent of the user's control in regard to data dissemination.
Most users memorize and use only one or two passwords of arguable length and strength in the web, which users tend to over-use during web site-registration procedures. Additionally, OAuth login procedures, that is where a third party such as Facebook or Google provides user-authenticity, could potentially worsen the security problem if data-leakage through these central providers should occur, either directly or through man-in-the middle attacks.

This recovered about 40% of the zeroed passwords, or 1.4 million out of the 3.7 million. That's to be expected, as these are passwords that the hacker already found, so should be easily found by us. - Robert Graham

Prior to my post, Robert Graham has performed an analysis of the LinkedIn-passwords, which is available here.
Via hashcats, roughly 20% of the leaked 6.5 Million hashes could be recovered.

Google Chart
Fig 1.3 Password length. Many web-sites no longer allow passwords of a length of less than 8 characters, which may explain the popularity of 8 character-long passwords

As shown in Fig 1.3, 20% of users have passwords of a length  between one to six characters. 70% of passwords distribute to a length of one to eight characters. In other words, only 30% of users have a sufficiently long password of more than 8 characters, which is able to computationally offset brute-force password attacks.These are attacks wherein generated passwords are hashed and compared with the target user hash of the dataset over a given search space. Passwords of a length of six characters shouldn't be allowed by any security conscious site, as they are notoriously easy to crack nowadays, in times of abundant computing power. Although it is possible to make short passwords more secure, through a process known by cryptographers as Key Stretching, this is an infrequently encountered method in web programming.

Google Chart
Fig 1.4 User password digits at the end are popular . Single digit: 12%, two digits: 21%, three digits:  7%, other: 60% 

User passwords often have digits at the end or beginning. In LinkedIn's instance, the most frequent number is 1 (12%) , followed by 3 (7%) , 2 (6%), 0 (5%) and all remaining digits roughly distributed at 4% each, when considering the password cases using three end digits. This is consistent with previous analysis.
In any case, the clear winner in popularity is a sequence of 12345... among digit-sequence lengths of 0 to 8.

Google Chart
Fig 1.5 User Password character set with the most popular option being lower alphanumerical characters.

Further reading and Resources

6.5 Million Encrypted LinkedIn Passwords Leaked Online 
LinkedIn Password Leak: Salt Their Hide
Searchable Database of LinkedIn Password hashes

Update: Read this post on how to securely look up your own hash.

Several sites offer advice on how vendors can improve user-password security. To be expanded....
Best way to store password in database
Internet Security: SQL Injection Attacks

Conclusive remarks
Sony has seen its share of password leakage, and subsequent blame and community retribution almost two years back. Since then numerous other companies have become victim to data breaches. Password hashes based on the canonical hash algorithms (md5, sha1,) are no longer guaranteeing security, as vast look-up-tables and so called precomputed rainbow tables have become available, all in light of a generally narrow spectrum of user-passwords. User security awareness, user-interface designs which encourage security and new cryptographic means are necessary to improve upon the status quo.
Web programmers are recommended to use md5crypt or a similar accepted algorithm/scheme specifically designed for hashing passwords. For interoperability with older datasets using md5 (e.g. PHPBB) , it is recommended to rehash the passwords or password-hashes in its entirety.
Time constraints currently prevent me from showing you a comparative analysis, but I can roughly tell that LinkedIn users are slightly above average in terms of using 8 character long passwords and lower alphanumerical passwords as opposed to lower character passwords absent of numbers or special characters. I still intend to provide a comparative analysis of all three datasets from LinkedIn, and eHarmony in the future.

I leave you off, with LinkedIn's,'s and eHarmony's official twitter responses, without further comment: