For this example, I devised my own hash function. It's SHA2-224, but I use only 32 bits. Quick question, what this code will print?
#!/usr/bin/env python3 import hashlib from collections import defaultdict def my_hash(s): m = hashlib.sha224() m.update(s.encode('utf-8')) return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes hashes=defaultdict(int) for ctr in range(0x10000): # 16 bits tmp=my_hash(str(ctr)) hashes[tmp]=hashes[tmp]+1 for h in hashes: if hashes[h]>1: print (h, hashes[h])
It will...
166942807 2
So, you have to enumerate \( 2^{16} \) random inputs so that two 32-bit hash outputs will somewhere collide. This is very counter-intuitive.
Now let's state this: there is a string that contains something like "random string|salt or padding". Your task is to create two such strings. First must be "Hello, world!", second "Goodbye, cruel world...". You can add any salt/padding, but 32-bit hashes of these strings must be the same.
Tell this to a random guy at bus stop and he will say that you probably have to enumerate all 32-bit values in order to find a collision. And this wrong.
#!/usr/bin/env python3 import hashlib, math hash_called=0 def my_hash(s): global hash_called hash_called=hash_called+1 m = hashlib.sha224() m.update(s.encode('utf-8')) return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes hashes={} s1="Hello, world!" s2="Goodbye, cruel world..." ctr=0 while True: # hash both strings s=s1+"|"+str(ctr) h=my_hash(s) # stash this hash in hashes dictionary for future use hashes[h]=s s=s2+"|"+str(ctr) h=my_hash(s) if h in hashes: print ("collision found") print ("hash:", h) print ("first string: ["+s+"]") print ("second string: ["+hashes[h]+"]") print ("hash_called:", hash_called) print ("binlog(hash_called): %2.3f" % math.log(hash_called, 2)) exit(0) ctr=ctr+1
That code need to invoke hashing function only 240326 times, and this is ~18 bits instead of 32.
collision found hash: 237048764 first string: [Goodbye, cruel world...|120162] second string: [Hello, world!|97153] hash_called: 240326 binlog(hash_called): 17.875
Let's go further.
--- 32.py 2022-05-15 22:30:44.511264522 +0300 +++ 40.py 2022-05-15 22:30:48.311304071 +0300 @@ -8,7 +8,7 @@ hash_called=hash_called+1 m = hashlib.sha224() m.update(s.encode('utf-8')) - return int.from_bytes(m.digest()[0:4], byteorder='big') # 32-bit, 4 bytes + return int.from_bytes(m.digest()[0:5], byteorder='big') # 40-bit, 5 bytes hashes={}
The result:
collision found hash: 723356978576 first string: [Goodbye, cruel world...|1977920] second string: [Hello, world!|1339750] hash_called: 3955842 binlog(hash_called): 21.916
Further:
--- 40.py 2022-05-15 22:30:48.311304071 +0300 +++ 48.py 2022-05-15 22:30:52.343346033 +0300 @@ -8,7 +8,7 @@ hash_called=hash_called+1 m = hashlib.sha224() m.update(s.encode('utf-8')) - return int.from_bytes(m.digest()[0:5], byteorder='big') # 40-bit, 5 bytes + return int.from_bytes(m.digest()[0:6], byteorder='big') # 48-bit, 6 bytes hashes={}
collision found hash: 203312475864873 first string: [Goodbye, cruel world...|10179356] second string: [Hello, world!|3772087] hash_called: 20358714 binlog(hash_called): 24.279
You can see some regularity here. During bruteforce, you have enumerate \( 2^{half\_of\_bits} \). Actually, this is \( 2 \cdot \sqrt{search\_space} \).
And there is no remedy against this. All hash functions and symmetrical encryption algorithms are vulnerable. Remember MD5 (outdated long ago), which output is 128 bits. Even if it will still be robust, you'll have to enumerate \( 2^{64} \) inputs. This is doable on a powerful cluster.
If SHA1 wouldn't be cracked, this is 160 bits / 2 = 80 bits. Hard, but it's widely considered that organizations like NSA may have such power.
The only solution is to use hashes with bigger outputs and keep in mind that attacker can try to enumerate all inputs, square root of search space.
And what is in the name? This astonishes youngsters, when they enrolled first time to school or kindergarten. They found coincidences when list their birthdays. Only 23 random persons in groups is enough to have a probability of 50% that there is a pair with the same birthday. For a group of 50-60 random persons, it's almost 100%.
(Update at May-2023.)
This applicable not only to cryptographic hash functions:
I wrote an app that kept track of things by their CRC values for tracking items. We needed to track potentially millions of items and we started with a CRC32 which in real world practice has a collision rate of around 1 in 2^16 which was an unpleasant surprise. We then re-coded to use a CRC64 which had a real world collision rate of about 1 in 2^23. We tested this after the unpleasant surprise of the 32 bit one we started with and accepted the small error rate of the 64 bit one.
( src )
Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.