[Crypto] Birthday attack: yet another explanation

For this example, I devised my own hash function. It's SHA2-224, but I use only 32 bits. Quick question, what this code will print?

#!/usr/bin/env python3
import hashlib
from collections import defaultdict

def my_hash(s):
    m = hashlib.sha224()
    m.update(s.encode('utf-8'))
    return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes

hashes=defaultdict(int)

for ctr in range(0x10000): # 16 bits
    tmp=my_hash(str(ctr))
    hashes[tmp]=hashes[tmp]+1

for h in hashes:
    if hashes[h]>1:
        print (h, hashes[h])

It will...

166942807 2

So, you have to enumerate \( 2^{16} \) random inputs so that two 32-bit hash outputs will somewhere collide. This is very counter-intuitive.

Now let's state this: there is a string that contains something like "random string|salt or padding". Your task is to create two such strings. First must be "Hello, world!", second "Goodbye, cruel world...". You can add any salt/padding, but 32-bit hashes of these strings must be the same.

Tell this to a random guy at bus stop and he will say that you probably have to enumerate all 32-bit values in order to find a collision. And this wrong.

#!/usr/bin/env python3
import hashlib, math

hash_called=0

def my_hash(s):
    global hash_called
    hash_called=hash_called+1
    m = hashlib.sha224()
    m.update(s.encode('utf-8'))
    return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes

hashes={}

s1="Hello, world!"
s2="Goodbye, cruel world..."

ctr=0
while True:
    # hash both strings

    s=s1+"|"+str(ctr)
    h=my_hash(s)
    # stash this hash in hashes dictionary for future use
    hashes[h]=s

    s=s2+"|"+str(ctr)
    h=my_hash(s)
    if h in hashes:
        print ("collision found")
        print ("hash:", h)
        print ("first string: ["+s+"]")
        print ("second string: ["+hashes[h]+"]")
        print ("hash_called:", hash_called)
        print ("binlog(hash_called): %2.3f" % math.log(hash_called, 2))
        exit(0)
    ctr=ctr+1

That code need to invoke hashing function only 240326 times, and this is ~18 bits instead of 32.

collision found
hash: 237048764
first string: [Goodbye, cruel world...|120162]
second string: [Hello, world!|97153]
hash_called: 240326
binlog(hash_called): 17.875

Let's go further.

--- 32.py       2022-05-15 22:30:44.511264522 +0300
+++ 40.py       2022-05-15 22:30:48.311304071 +0300
@@ -8,7 +8,7 @@
     hash_called=hash_called+1
     m = hashlib.sha224()
     m.update(s.encode('utf-8'))
-    return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes
+    return int.from_bytes(m.digest()[0:5], byteorder='big') # 40-bit, 5 bytes

 hashes={}

The result:

collision found
hash: 723356978576
first string: [Goodbye, cruel world...|1977920]
second string: [Hello, world!|1339750]
hash_called: 3955842
binlog(hash_called): 21.916

Further:

--- 40.py       2022-05-15 22:30:48.311304071 +0300
+++ 48.py       2022-05-15 22:30:52.343346033 +0300
@@ -8,7 +8,7 @@
     hash_called=hash_called+1
     m = hashlib.sha224()
     m.update(s.encode('utf-8'))
-    return int.from_bytes(m.digest()[0:5], byteorder='big') # 40-bit, 5 bytes
+    return int.from_bytes(m.digest()[0:6], byteorder='big') # 48-bit, 6 bytes

 hashes={}
collision found
hash: 203312475864873
first string: [Goodbye, cruel world...|10179356]
second string: [Hello, world!|3772087]
hash_called: 20358714
binlog(hash_called): 24.279

You can see some regularity here. During bruteforce, you have enumerate \( 2^{half\_of\_bits} \). Actually, this is \( 2 \cdot \sqrt{search\_space} \).

And there is no remedy against this. All hash functions and symmetrical encryption algorithms are vulnerable. Remember MD5 (outdated long ago), which output is 128 bits. Even if it will still be robust, you'll have to enumerate \( 2^{64} \) inputs. This is doable on a powerful cluster.

If SHA1 wouldn't be cracked, this is 160 bits / 2 = 80 bits. Hard, but it's widely considered that organizations like NSA may have such power.

The only solution is to use hashes with bigger outputs and keep in mind that attacker can try to enumerate all inputs, square root of search space.

And what is in the name? This astonishes youngsters, when they enrolled first time to school or kindergarten. They found coincidences when list their birthdays. Only 23 random persons in groups is enough to have a probability of 50% that there is a pair with the same birthday. Indeed, \( \sqrt{365} \approx 19 \). For a group of 50-60 random persons, it's almost 100%.

(the post first published at 20220515.)


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.