[Math][Python] Computing CRC in parallel

CRC is merely a computation of remainder over GF(2). This is why I wrote my previous blog post: Parallelizing remainder calculation over bignum. Now I'm going to do the same, but only over GF(2).

Two GF(2) operations will be utilized: GF(2) multiplication (also known as carry-less multiplication) and shifting 1 bit some bits left to produce a multiplier (but over GF(2)).

To read more about GF(2) basics, see this and this.

Let's start with second operation: shifting 1 some bits left. This is a CRC32 function written in Python. It's perfectly workable, but usually not used in practice, because it's slow if compared to table-based functions. But it's better for understanding.

CRC32_POLY=0xEDB88320

def my_crc32(buf):
    crc=0xffffffff
    for c in buf:
        crc=crc^c
        for _ in range(8):
            if (crc & 1)==1:
                crc=(crc>>1) ^ CRC32_POLY # GF(2) subtraction
            else:
                crc=crc>>1
    crc=crc^0xffffffff
    return crc

I'm going to rework this function a bit. It will produce a CRC value of a bit string like '0b1000000....000' with specified number of zeroes:

# it takes 1 and shifts it left, $cnt_bits$ times
# this function is no different from the common (but naive, tableless) CRC32 computation function
def crc32_of_1_shifted_left(cnt_bits):
    init_data=0x80000000 # low bit (1) reversed
    crc=0
    crc=crc^init_data
    for i in range(cnt_bits):
        if (crc & 1)==1:
            crc=(crc>>1) ^ CRC32_POLY # GF(2) subtraction
        else:
            crc=crc>>1
    return crc

Another function is GF(2) multiplication, which I already used in my notes:

# yes, both are XORs:
def gf2_add(x, y):
    return x^y

def gf2_sub(x, y):
    return x^y

def gf32_mul(x, y, R):
    z = 0
    for i in range(32-1, -1, -1):

        if (y >> i) & 1:
            z=gf2_add(z, x)

        # shift and also reduce by R if overflow detected
        # IOW, keep x smaller than R or modulo R
        if x & 1:
            x = gf2_sub(x >> 1, R)
        else:
            x = x >> 1

    return z

BTW, the PCLMUL x86 instruction does the same, but without reduction stage after each iteration.

Now here I split the input buffer and calculate CRC32 of each (sub)buffer separately, using the (standard) zlib crc32 function. Then I do almost the same I did with bignum previously, but using GF(2) operations:

fname=sys.argv[1]
import zlib
f=open(fname, "rb")
buf=f.read()
f.close()
print ("whole buf 0x%x" % zlib.crc32(buf))

fsize=len(buf)

CHUNK_SIZE=fsize//3

CHUNK1_size=CHUNK_SIZE
CHUNK2_size=CHUNK_SIZE
CHUNK3_size=fsize-CHUNK_SIZE*2

print ("chunk sizes:", CHUNK1_size, CHUNK2_size, CHUNK3_size)

buf1=buf[0:CHUNK1_size]
buf2=buf[CHUNK1_size:CHUNK1_size+CHUNK2_size]
buf3=buf[CHUNK1_size+CHUNK2_size:]

newbuf=buf1+buf2+buf3

crc_part1=zlib.crc32(buf1)
crc_part2=zlib.crc32(buf2)
crc_part3=zlib.crc32(buf3)
print ("crc_part1 0x%x" % crc_part1)
print ("crc_part2 0x%x" % crc_part2)
print ("crc_part3 0x%x" % crc_part3)

CRC32_POLY=0xEDB88320

def method1():
    x_hi=crc32_of_1_shifted_left((len(buf2)+len(buf3))*8)
    print ("x_hi 0x%x" % x_hi)
    x_mi=crc32_of_1_shifted_left((len(buf3)*8))
    print ("x_mi 0x%x" % x_mi)
    tmp1=gf32_mul(crc_part1, x_hi, CRC32_POLY)
    tmp2=gf32_mul(crc_part2, x_mi, CRC32_POLY)
    print ("my result 0x%x" % (tmp1 ^ tmp2 ^ crc_part3))

method1()

The last step -- XORing as addition operation.

Works perfectly, without problems. But ouch! A problem -- the crc32_of_1_shifted_left() function is so slow, because it spends as many iterations as there are bits in input buffers. Of course, this is not acceptable. (After all, our goal was to make more fast CRC computation utility.) In other words, we have to optimize this function somehow.

I'm going to precompute this function for \( 2^x \) values. This is the code written in pure C for that:

#include <stdio.h>
#include <stdint.h>

#define CRC32_POLY 0xEDB88320

uint32_t f(uint64_t cnt_bits)
{
        uint32_t crc=0x80000000; // low bit (1) reversed
        for (uint64_t i=0; i<cnt_bits; i++)
        {
                if ((crc & 1)==1)
                        crc=(crc>>1) ^ CRC32_POLY;
                else
                        crc=crc>>1;
        };
        return crc;
};

void main()
{
        for (int i=0; i<64; i++)
        {
                printf ("0x%x, # 1<<%d\n", f(1ULL<<i), i);
        };
};

But ouch! The CRC state seems to wrap after \( 2^{32} \):

0x40000000, # 1<<0
0x20000000, # 1<<1
0x8000000,  # 1<<2
0x800000,   # 1<<3
0x8000,     # 1<<4
0xedb88320, # 1<<5
0xb1e6b092, # 1<<6
0xa06a2517, # 1<<7
0xed627dae, # 1<<8
0x88d14467, # 1<<9
0xd7bbfe6a, # 1<<10
0xec447f11, # 1<<11
0x8e7ea170, # 1<<12
0x6427800e, # 1<<13
0x4d47bae0, # 1<<14
0x9fe548f,  # 1<<15
0x83852d0f, # 1<<16
0x30362f1a, # 1<<17
0x7b5a9cc3, # 1<<18
0x31fec169, # 1<<19
0x9fec022a, # 1<<20
0x6c8dedc4, # 1<<21
0x15d6874d, # 1<<22
0x5fde7a4e, # 1<<23
0xbad90e37, # 1<<24
0x2e4e5eef, # 1<<25
0x4eaba214, # 1<<26
0xa8a472c0, # 1<<27
0x429a969e, # 1<<28
0x148d302a, # 1<<29
0xc40ba6d0, # 1<<30
0xc4e22c3c, # 1<<31
0x40000000, # 1<<32
0x20000000, # 1<<33
0x8000000,  # 1<<34
0x800000,   # 1<<35
...

No wonder: CRC over x bits may (but not always necessary) to wrap after \( 2^x \) input bits. This is ~500 MiB of data or half of 1 GiB. (If you work with larger buffers, you may switch to CRC64 or cryptographic hash.)

OK, now instead of multiplying CRC result by crc32_of_1_shifted_left(), I would factor buffer length by \( 2^x \) numbers. That is 0x1234 = 0x1000 + 0x200 + 0x30 + 0x4. And I would multiply CRC result by a precomputed value from table depending of how buffer length was factored. IOW, that depends on bits in buffer length.

CRC32_POLY=0xEDB88320

precomp=[
# 0x80000000, # 0
0x40000000, # 1<<0
0x20000000, # 1<<1
0x08000000, # 1<<2
0x00800000, # 1<<3
0x00008000, # 1<<4
0xedb88320, # 1<<5
...
0xc40ba6d0, # 1<<30
0xc4e22c3c, # 1<<31
]

def method2():
    first_idx=(len(buf2)+len(buf3))*8
    tmp1=crc_part1
    for i in range(32):
        if ((first_idx>>i) & 1)==1:
            tmp1=gf32_mul(tmp1, precomp[i], CRC32_POLY)

    second_idx=len(buf3)*8
    tmp2=crc_part2
    for i in range(32):
        if ((second_idx>>i) & 1)==1:
            tmp2=gf32_mul(tmp2, precomp[i], CRC32_POLY)

    print ("my result 0x%x" % (tmp1 ^ tmp2 ^ crc_part3))

method2()

It's perfectly workable and much faster than my first method. There are only ~64 calls to gf32_mul() function + table lookups. But ouch again -- tests shown that it behaves incorrectly on files larger than 4GiB ( \( 2^{32} \) bytes ). Another problem -- input file splitted only on 3 buffers, which is unscalable.

So this is my final version (unabridged). It splits the input file in even chunks. It's tested on files larger than 4GiB: we reuse the same precomputed table again.

#!/usr/bin/env python3

import sys, functools, operator

# yes, both are XORs:
def gf2_add(x, y):
    return x^y

def gf2_sub(x, y):
    return x^y

def gf32_mul(x, y, R):
    z = 0
    for i in range(32-1, -1, -1):

        if (y >> i) & 1:
            z=gf2_add(z, x)

        # shift and also reduce by R if overflow detected
        # IOW, keep x smaller than R or modulo R
        if x & 1:
            x = gf2_sub(x >> 1, R)
        else:
            x = x >> 1

    return z

#CHUNK_SIZE=2**14 # 16KiB
#CHUNK_SIZE=2**20 # 1MiB
CHUNK_SIZE=2**23 # 8MiB
#CHUNK_SIZE=2**30 # 1GiB
chunks=[]

fname=sys.argv[1]
import zlib
f=open(fname, "rb")

while True:
    buf=f.read(CHUNK_SIZE)
    crc=zlib.crc32(buf)
    print ("chunk. crc32 0x%08x, len 0x%x" % (crc, len(buf)))
    chunks.append((crc, len(buf)))
    if len(buf)!=CHUNK_SIZE:
        break

f.close()

CRC32_POLY=0xEDB88320

precomp=[
0x40000000, 0x20000000, 0x08000000, 0x00800000, 0x00008000, 0xedb88320,
0xb1e6b092, 0xa06a2517, 0xed627dae, 0x88d14467, 0xd7bbfe6a, 0xec447f11,
0x8e7ea170, 0x6427800e, 0x4d47bae0, 0x09fe548f, 0x83852d0f, 0x30362f1a,
0x7b5a9cc3, 0x31fec169, 0x9fec022a, 0x6c8dedc4, 0x15d6874d, 0x5fde7a4e,
0xbad90e37, 0x2e4e5eef, 0x4eaba214, 0xa8a472c0, 0x429a969e, 0x148d302a,
0xc40ba6d0, 0xc4e22c3c,
]

def process_chunk(crc, idx_in_bytes):
    # if the last block... return as is
    if idx_in_bytes==0:
        return crc
    idx=idx_in_bytes*8
    tmp1=crc
    for i in range(64): # as if idx can be up to 2**64
        if ((idx>>i) & 1)==1:
            # why modulo 32? so to 'reuse' lower constants in case of overflow (for files >4GiB):
            tmp1=gf32_mul(tmp1, precomp[i % 32], CRC32_POLY)
    return tmp1

cur_idx=0
crcs=[]
# process chunks backwards:
for crc, len_ in reversed(chunks):
    crcs.append(process_chunk(crc, cur_idx))
    cur_idx=cur_idx+len_

print ("final result", hex(functools.reduce(operator.xor, crcs, 0)))

It's almost perfect. But prototypical. In fact, I have no need to make CRC computation parallel. I did all this only to get better understanding of GF(2) math.

So all you have to do is to write a function that will compute CRC32 on chunks in parallel. That feed that data to my Python script.

All the files

Now about other versions. Older zlib version had the crc32_combine_ function:

local uLong crc32_combine_(crc1, crc2, len2)

( src )

You see -- it also requires the length of the second block, so to 'shift' first CRC left. But this function is harder to understand than what I've explained here.

All other versions you can google which use PCLMUL instruction (also, the official PDFs from Intel) may also be cool, but harder to grasp.

(the post first published at 20230518.)

List of my other blog posts.

Subscribe to my news feed

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.