Another example. Here I published a list of 891,190 unique RSA-2k moduli, in decimal form. How good it can be compressed by xz? Final size is 242704140 bytes. Let's divide:
% bc scale=3 242704140/891190 272.337 272*8 2176
Yes, each number (modulus) has ~2048 bits. And RSA moduli exhibits high level of entropy. So, xz can compress numbers in decimal form perfectly.
What about random stream converted to base64 and compressed?
% dd if=/dev/random of=tmp bs=1024M count=10 % stat tmp File: tmp Size: 335544310 % base64 tmp > tmp.base64 % stat tmp.base64 File: tmp.base64 Size: 453279159 % xz tmp.base64 % stat tmp.base64.xz File: tmp.base64.xz Size: 348321524
Almost of the same size, as the original file of high entropy.
It"s like xz has base64 decoder and/or can recognize decimal numbers!
This can be yet another test for compression algorithms.
% echo passpass | ent Entropy = 1.836592 bits per byte. % echo password | ent Entropy = 2.947703 bits per byte. % echo pAsswORd | ent Entropy = 2.947703 bits per byte. % echo "l33tc0de" | ent Entropy = 2.947703 bits per byte. % echo "kewl_l33t_c0der" | ent Entropy = 3.500000 bits per byte. % echo "pA\$swORd" | ent Entropy = 3.169925 bits per byte. % echo "_coolpA\$sw0Rd" | ent Entropy = 3.664498 bits per byte. % echo "_c00lpA\$sw0Rd" | ent Entropy = 3.467720 bits per byte. ( zero repeats ) % echo "_c00lpA\$sw0Rd%" | ent Entropy = 3.589898 bits per byte. ( new character at the end )
Also, entropy metric is used in Discourse forum to determine, how short/uninformative title and body is. These are configuration parameters: "body min entropy" and "title min entropy".
Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.