[RevEng][Math] Data compression and entropy, part II

(Previously.)

Another example. Here I published a list of 891,190 unique RSA-2k moduli, in decimal form. How good it can be compressed by xz? Final size is 242704140 bytes. Let's divide:

% bc

scale=3
242704140/891190
272.337

272*8
2176

Yes, each number (modulus) has ~2048 bits. And RSA moduli exhibits high level of entropy. So, xz can compress numbers in decimal form perfectly.

What about random stream converted to base64 and compressed?

% dd if=/dev/random of=tmp bs=1024M count=10

% stat tmp
  File: tmp
  Size: 335544310

% base64 tmp > tmp.base64

% stat tmp.base64
  File: tmp.base64
  Size: 453279159

% xz tmp.base64

% stat tmp.base64.xz
  File: tmp.base64.xz
  Size: 348321524

Almost of the same size, as the original file of high entropy.

It"s like xz has base64 decoder and/or can recognize decimal numbers!

This can be yet another test for compression algorithms.


Entropy level can be a quick-n-dirty metric of how good your password is.
% echo passpass | ent
Entropy = 1.836592 bits per byte.

% echo password | ent
Entropy = 2.947703 bits per byte.

% echo pAsswORd | ent
Entropy = 2.947703 bits per byte.

% echo "l33tc0de" | ent
Entropy = 2.947703 bits per byte.

% echo "kewl_l33t_c0der" | ent
Entropy = 3.500000 bits per byte.

% echo "pA\$swORd" | ent
Entropy = 3.169925 bits per byte.

% echo "_coolpA\$sw0Rd" | ent
Entropy = 3.664498 bits per byte.

% echo "_c00lpA\$sw0Rd" | ent
Entropy = 3.467720 bits per byte.
( zero repeats )

% echo "_c00lpA\$sw0Rd%" | ent
Entropy = 3.589898 bits per byte.
( new character at the end )

Also, entropy metric is used in Discourse forum to determine, how short/uninformative title and body is. These are configuration parameters: "body min entropy" and "title min entropy".

(the post first published at 20220929.)


List of my other blog posts.

Subscribe to my news feed

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.