[Crypto] Birthday attack: yet another explanation

For this example, I devised my own hash function. It's SHA2-224, but I use only 32 bits. Quick question, what this code will print?

#!/usr/bin/env python3
import hashlib
from collections import defaultdict

def my_hash(s):
    m = hashlib.sha224()
    m.update(s.encode('utf-8'))
    return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes

hashes=defaultdict(int)

for ctr in range(0x10000): # 16 bits
    tmp=my_hash(str(ctr))
    hashes[tmp]=hashes[tmp]+1

for h in hashes:
    if hashes[h]>1:
        print (h, hashes[h])

It will...

166942807 2

So, you have to enumerate \( 2^{16} \) random inputs so that two 32-bit hash outputs will somewhere collide. This is very counter-intuitive.

Now let's state this: there is a string that contains something like "random string|salt or padding". Your task is to create two such strings. First must be "Hello, world!", second "Goodbye, cruel world...". You can add any salt/padding, but 32-bit hashes of these strings must be the same.

Tell this to a random guy at bus stop and he will say that you probably have to enumerate all 32-bit values in order to find a collision. And this wrong.

#!/usr/bin/env python3
import hashlib, math

hash_called=0

def my_hash(s):
    global hash_called
    hash_called=hash_called+1
    m = hashlib.sha224()
    m.update(s.encode('utf-8'))
    return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes

hashes={}

s1="Hello, world!"
s2="Goodbye, cruel world..."

ctr=0
while True:
    # hash both strings

    s=s1+"|"+str(ctr)
    h=my_hash(s)
    # stash this hash in hashes dictionary for future use
    hashes[h]=s

    s=s2+"|"+str(ctr)
    h=my_hash(s)
    if h in hashes:
        print ("collision found")
        print ("hash:", h)
        print ("first string: ["+s+"]")
        print ("second string: ["+hashes[h]+"]")
        print ("hash_called:", hash_called)
        print ("binlog(hash_called): %2.3f" % math.log(hash_called, 2))
        exit(0)
    ctr=ctr+1

That code need to invoke hashing function only 240326 times, and this is ~18 bits instead of 32.

collision found
hash: 237048764
first string: [Goodbye, cruel world...|120162]
second string: [Hello, world!|97153]
hash_called: 240326
binlog(hash_called): 17.875

Let's go further.

--- 32.py       2022-05-15 22:30:44.511264522 +0300
+++ 40.py       2022-05-15 22:30:48.311304071 +0300
@@ -8,7 +8,7 @@
     hash_called=hash_called+1
     m = hashlib.sha224()
     m.update(s.encode('utf-8'))
-    return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes
+    return int.from_bytes(m.digest()[0:5], byteorder='big') # 40-bit, 5 bytes

 hashes={}

The result:

collision found
hash: 723356978576
first string: [Goodbye, cruel world...|1977920]
second string: [Hello, world!|1339750]
hash_called: 3955842
binlog(hash_called): 21.916

Further:

--- 40.py       2022-05-15 22:30:48.311304071 +0300
+++ 48.py       2022-05-15 22:30:52.343346033 +0300
@@ -8,7 +8,7 @@
     hash_called=hash_called+1
     m = hashlib.sha224()
     m.update(s.encode('utf-8'))
-    return int.from_bytes(m.digest()[0:5], byteorder='big') # 40-bit, 5 bytes
+    return int.from_bytes(m.digest()[0:6], byteorder='big') # 48-bit, 6 bytes

 hashes={}
collision found
hash: 203312475864873
first string: [Goodbye, cruel world...|10179356]
second string: [Hello, world!|3772087]
hash_called: 20358714
binlog(hash_called): 24.279

You can see some regularity here. During bruteforce, you have enumerate \( 2^{half\_of\_bits} \). Actually, this is \( 2 \cdot \sqrt{search\_space} \).

And there is no remedy against this. All hash functions and symmetrical encryption algorithms are vulnerable. Remember MD5 (outdated long ago), which output is 128 bits. Even if it will still be robust, you'll have to enumerate \( 2^{64} \) inputs. This is doable on a powerful cluster.

If SHA1 wouldn't be cracked, this is 160 bits / 2 = 80 bits. Hard, but it's widely considered that organizations like NSA may have such power.

The only solution is to use hashes with bigger outputs and keep in mind that attacker can try to enumerate all inputs, square root of search space.

And what is in the name? This astonishes youngsters, when they enrolled first time to school or kindergarten. They found coincidences when list their birthdays. Only 23 random persons in groups is enough to have a probability of 50% that there is a pair with the same birthday. For a group of 50-60 random persons, it's almost 100%.


(Update at May-2023.)

This applicable not only to cryptographic hash functions:

I wrote an app that kept track of things by their CRC values for tracking items. We needed to track potentially millions of items and we started with a CRC32 which in real world practice has a collision rate of around 1 in 2^16 which was an unpleasant surprise. We then re-coded to use a CRC64 which had a real world collision rate of about 1 in 2^23. We tested this after the unpleasant surprise of the 32 bit one we started with and accepted the small error rate of the 64 bit one.

( src )

(the post first published at 20220515.)


List of my other blog posts.

Subscribe to my news feed

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.