CRC is merely a computation of remainder over GF(2). This is why I wrote my previous blog post: Parallelizing remainder calculation over bignum. Now I'm going to do the same, but only over GF(2).

Two GF(2) operations will be utilized: GF(2) multiplication (also known as carry-less multiplication) and shifting 1 bit some bits left to produce a multiplier (but over GF(2)).

To read more about GF(2) basics, see this and this.

Let's start with second operation: shifting 1 some bits left. This is a CRC32 function written in Python. It's perfectly workable, but usually not used in practice, because it's slow if compared to table-based functions. But it's better for understanding.

CRC32_POLY=0xEDB88320 def my_crc32(buf): crc=0xffffffff for c in buf: crc=crc^c for _ in range(8): if (crc & 1)==1: crc=(crc>>1) ^ CRC32_POLY # GF(2) subtraction else: crc=crc>>1 crc=crc^0xffffffff return crc

I'm going to rework this function a bit. It will produce a CRC value of a bit string like '0b1000000....000' with specified number of zeroes:

# it takes 1 and shifts it left, $cnt_bits$ times # this function is no different from the common (but naive, tableless) CRC32 computation function def crc32_of_1_shifted_left(cnt_bits): init_data=0x80000000 # low bit (1) reversed crc=0 crc=crc^init_data for i in range(cnt_bits): if (crc & 1)==1: crc=(crc>>1) ^ CRC32_POLY # GF(2) subtraction else: crc=crc>>1 return crc

Another function is GF(2) multiplication, which I already used in my notes:

# yes, both are XORs: def gf2_add(x, y): return x^y def gf2_sub(x, y): return x^y def gf32_mul(x, y, R): z = 0 for i in range(32-1, -1, -1): if (y >> i) & 1: z=gf2_add(z, x) # shift and also reduce by R if overflow detected # IOW, keep x smaller than R or modulo R if x & 1: x = gf2_sub(x >> 1, R) else: x = x >> 1 return z

BTW, the PCLMUL x86 instruction does the same, but without reduction stage after each iteration.

Now here I split the input buffer and calculate CRC32 of each (sub)buffer separately, using the (standard) zlib crc32 function. Then I do almost the same I did with bignum previously, but using GF(2) operations:

fname=sys.argv[1] import zlib f=open(fname, "rb") buf=f.read() f.close() print ("whole buf 0x%x" % zlib.crc32(buf)) fsize=len(buf) CHUNK_SIZE=fsize//3 CHUNK1_size=CHUNK_SIZE CHUNK2_size=CHUNK_SIZE CHUNK3_size=fsize-CHUNK_SIZE*2 print ("chunk sizes:", CHUNK1_size, CHUNK2_size, CHUNK3_size) buf1=buf[0:CHUNK1_size] buf2=buf[CHUNK1_size:CHUNK1_size+CHUNK2_size] buf3=buf[CHUNK1_size+CHUNK2_size:] newbuf=buf1+buf2+buf3 crc_part1=zlib.crc32(buf1) crc_part2=zlib.crc32(buf2) crc_part3=zlib.crc32(buf3) print ("crc_part1 0x%x" % crc_part1) print ("crc_part2 0x%x" % crc_part2) print ("crc_part3 0x%x" % crc_part3) CRC32_POLY=0xEDB88320 def method1(): x_hi=crc32_of_1_shifted_left((len(buf2)+len(buf3))*8) print ("x_hi 0x%x" % x_hi) x_mi=crc32_of_1_shifted_left((len(buf3)*8)) print ("x_mi 0x%x" % x_mi) tmp1=gf32_mul(crc_part1, x_hi, CRC32_POLY) tmp2=gf32_mul(crc_part2, x_mi, CRC32_POLY) print ("my result 0x%x" % (tmp1 ^ tmp2 ^ crc_part3)) method1()

The last step -- XORing as addition operation.

Works perfectly, without problems. But ouch! A problem -- the crc32_of_1_shifted_left() function is so slow, because it spends as many iterations as there are bits in input buffers. Of course, this is not acceptable. (After all, our goal was to make more fast CRC computation utility.) In other words, we have to optimize this function somehow.

I'm going to precompute this function for \( 2^x \) values. This is the code written in pure C for that:

#include <stdio.h> #include <stdint.h> #define CRC32_POLY 0xEDB88320 uint32_t f(uint64_t cnt_bits) { uint32_t crc=0x80000000; // low bit (1) reversed for (uint64_t i=0; i<cnt_bits; i++) { if ((crc & 1)==1) crc=(crc>>1) ^ CRC32_POLY; else crc=crc>>1; }; return crc; }; void main() { for (int i=0; i<64; i++) { printf ("0x%x, # 1<<%d\n", f(1ULL<<i), i); }; };

But ouch! The CRC state seems to wrap after \( 2^{32} \):

0x40000000, # 1<<0 0x20000000, # 1<<1 0x8000000, # 1<<2 0x800000, # 1<<3 0x8000, # 1<<4 0xedb88320, # 1<<5 0xb1e6b092, # 1<<6 0xa06a2517, # 1<<7 0xed627dae, # 1<<8 0x88d14467, # 1<<9 0xd7bbfe6a, # 1<<10 0xec447f11, # 1<<11 0x8e7ea170, # 1<<12 0x6427800e, # 1<<13 0x4d47bae0, # 1<<14 0x9fe548f, # 1<<15 0x83852d0f, # 1<<16 0x30362f1a, # 1<<17 0x7b5a9cc3, # 1<<18 0x31fec169, # 1<<19 0x9fec022a, # 1<<20 0x6c8dedc4, # 1<<21 0x15d6874d, # 1<<22 0x5fde7a4e, # 1<<23 0xbad90e37, # 1<<24 0x2e4e5eef, # 1<<25 0x4eaba214, # 1<<26 0xa8a472c0, # 1<<27 0x429a969e, # 1<<28 0x148d302a, # 1<<29 0xc40ba6d0, # 1<<30 0xc4e22c3c, # 1<<31 0x40000000, # 1<<32 0x20000000, # 1<<33 0x8000000, # 1<<34 0x800000, # 1<<35 ...

No wonder: CRC over x bits *may* (but not always necessary) to wrap after \( 2^x \) input bits.
This is ~500 MiB of data or half of 1 GiB.
(If you work with larger buffers, you may switch to CRC64 or cryptographic hash.)

OK, now instead of multiplying CRC result by crc32_of_1_shifted_left(), I would factor buffer length by \( 2^x \) numbers. That is 0x1234 = 0x1000 + 0x200 + 0x30 + 0x4. And I would multiply CRC result by a precomputed value from table depending of how buffer length was factored. IOW, that depends on bits in buffer length.

CRC32_POLY=0xEDB88320 precomp=[ # 0x80000000, # 0 0x40000000, # 1<<0 0x20000000, # 1<<1 0x08000000, # 1<<2 0x00800000, # 1<<3 0x00008000, # 1<<4 0xedb88320, # 1<<5 ... 0xc40ba6d0, # 1<<30 0xc4e22c3c, # 1<<31 ] def method2(): first_idx=(len(buf2)+len(buf3))*8 tmp1=crc_part1 for i in range(32): if ((first_idx>>i) & 1)==1: tmp1=gf32_mul(tmp1, precomp[i], CRC32_POLY) second_idx=len(buf3)*8 tmp2=crc_part2 for i in range(32): if ((second_idx>>i) & 1)==1: tmp2=gf32_mul(tmp2, precomp[i], CRC32_POLY) print ("my result 0x%x" % (tmp1 ^ tmp2 ^ crc_part3)) method2()

It's perfectly workable and much faster than my first method. There are only ~64 calls to gf32_mul() function + table lookups. But ouch again -- tests shown that it behaves incorrectly on files larger than 4GiB ( \( 2^{32} \) bytes ). Another problem -- input file splitted only on 3 buffers, which is unscalable.

So this is my final version (unabridged). It splits the input file in even chunks. It's tested on files larger than 4GiB: we reuse the same precomputed table again.

#!/usr/bin/env python3 import sys, functools, operator # yes, both are XORs: def gf2_add(x, y): return x^y def gf2_sub(x, y): return x^y def gf32_mul(x, y, R): z = 0 for i in range(32-1, -1, -1): if (y >> i) & 1: z=gf2_add(z, x) # shift and also reduce by R if overflow detected # IOW, keep x smaller than R or modulo R if x & 1: x = gf2_sub(x >> 1, R) else: x = x >> 1 return z #CHUNK_SIZE=2**14 # 16KiB #CHUNK_SIZE=2**20 # 1MiB CHUNK_SIZE=2**23 # 8MiB #CHUNK_SIZE=2**30 # 1GiB chunks=[] fname=sys.argv[1] import zlib f=open(fname, "rb") while True: buf=f.read(CHUNK_SIZE) crc=zlib.crc32(buf) print ("chunk. crc32 0x%08x, len 0x%x" % (crc, len(buf))) chunks.append((crc, len(buf))) if len(buf)!=CHUNK_SIZE: break f.close() CRC32_POLY=0xEDB88320 precomp=[ 0x40000000, 0x20000000, 0x08000000, 0x00800000, 0x00008000, 0xedb88320, 0xb1e6b092, 0xa06a2517, 0xed627dae, 0x88d14467, 0xd7bbfe6a, 0xec447f11, 0x8e7ea170, 0x6427800e, 0x4d47bae0, 0x09fe548f, 0x83852d0f, 0x30362f1a, 0x7b5a9cc3, 0x31fec169, 0x9fec022a, 0x6c8dedc4, 0x15d6874d, 0x5fde7a4e, 0xbad90e37, 0x2e4e5eef, 0x4eaba214, 0xa8a472c0, 0x429a969e, 0x148d302a, 0xc40ba6d0, 0xc4e22c3c, ] def process_chunk(crc, idx_in_bytes): # if the last block... return as is if idx_in_bytes==0: return crc idx=idx_in_bytes*8 tmp1=crc for i in range(64): # as if idx can be up to 2**64 if ((idx>>i) & 1)==1: # why modulo 32? so to 'reuse' lower constants in case of overflow (for files >4GiB): tmp1=gf32_mul(tmp1, precomp[i % 32], CRC32_POLY) return tmp1 cur_idx=0 crcs=[] # process chunks backwards: for crc, len_ in reversed(chunks): crcs.append(process_chunk(crc, cur_idx)) cur_idx=cur_idx+len_ print ("final result", hex(functools.reduce(operator.xor, crcs, 0)))

It's almost perfect. But prototypical. In fact, I have no need to make CRC computation parallel. I did all this only to get better understanding of GF(2) math.

So all you have to do is to write a function that will compute CRC32 on chunks in parallel. That feed that data to my Python script.

Now about other versions. Older zlib version had the crc32_combine_ function:

local uLong crc32_combine_(crc1, crc2, len2)

( src )

You see -- it also requires the length of the second block, so to 'shift' first CRC left. But this function is harder to understand than what I've explained here.

All other versions you can google which use PCLMUL instruction (also, the official PDFs from Intel) may also be cool, but harder to grasp.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.