CRC32 optimization for ARM Cortex M4
At work I played with an ultra low power SoC powered by a single core ARM Cortex M4. To check some data integrity, we have to use CRC32 but there is no hardware peripheral to speed up computation and ARMv6 doesn't have special instruction for this (it starts from Armv8.1). After some researches, I was surprised not to find any optimized implementation on Internet. So, I wrote it by myself and the result is quite impressive : my version is ~50% faster ! Some figures (on ~30KB of data) :
- -Os compilation : 19.4 milliseconds
- -02 compilation : 15.2 milliseconds
- -0s + optimizations : 8.2 milliseconds
Here, we have to consider that Cortex M4 doesn't have Data nor Instruction cache and memory accesses are done on a SRAM. Compilation is done in thumb mode with GCC 9.
Original version is the one from Wang Yaofu licensed under Apache2. It's quite simple and very academic. C primitives doesn't allows to optimize this algorithm so much because CRC has to be processed byte by byte, so we have to do some assembly !
I used multiple optimization tricks :
Use all registers available
The idea is to play with registers from r0 to 10 and not be limited to r0-r5 as commonly used.
Unroll loop
Avoid to break CPU pipeline by doing some checks + jump. Code is bigger and repetitive, but faster. We may write macro to reduce source code, not my choice here.
Do memory burst instead of unitary access
Especially when there is no cache, memory burst accesses are really faster. Here we load 4*32 bits at a time and keep all data into registers. Bytes outside burst window are computed using non optimized version.
Use shifts and rotates within load and eor instructions
ARM instructions allows to shift registers values within load and eor (and some other instructions) without having to do it in a separate line.
Avoid pipeline register lock
When it's possible, we can invert assembly lines to avoid working on the same registers on consecutive instructions (and thus avoid to lock them).
Update condition flags in sub instruction
Use subs variant to update EQ flag and avoid to check it for 0 in a separate instruction.
Do aligned accesses
In the calling function there is some code to inject "s" as a 32 bits aligned pointer (extra bytes processed by standard code).
Here is the optimized function. Whole C file can be found here
/**
* Optimized version of _update_crc32 for 16 bytes blocks
*/
static void _update_crc32_opt16(const unsigned char *s, unsigned int len)
{
/* unsigned int i; */
/* for (i = 0; i < len; i++) { */
/* crc32val = crc32_tab[(crc32val ^ s[i]) & 0xFF] ^ ((crc32val >> 8) & 0x00FFFFFF); */
/* } */
/*
r0 -> s
r1 -> len
r2 -> crc32val
r3 -> crc32tab
r4 -> curval[0]
r5 -> (crc32val ^ s[i]) & 0xFF
r6 -> crc32_tab[(crc32val ^ s[i]) & 0xFF]
r7 -> curval[1]
r8 -> curval[2]
r9 -> curval[3]
*/
__asm__ volatile (
"mov r0, %1\n"
"mov r1, %2\n"
"mov r2, %3\n"
"mov r3, %4\n"
"push {r7, r8, r9}\n"
"crc32_opt16_loop:\n"
"ldm r0!, {r4, r7, r8, r9}\n"
// curval[0]
"eor r5, r2, r4\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r4, ror #8\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r4, ror #16\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r4, ror #24\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
// curval[1]
"eor r5, r2, r7\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r7, ror #8\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r7, ror #16\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r7, ror #24\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
// curval[2]
"eor r5, r2, r8\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r8, ror #8\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r8, ror #16\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r8, ror #24\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
// curval[3]
"eor r5, r2, r9\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r9, ror #8\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r9, ror #16\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
"eor r2, r6, r2, lsr #8\n"
"eor r5, r2, r9, ror #24\n"
"uxtb r5, r5\n"
"ldr r6, [r3, r5, lsl #2]\n"
// Last two lines inverted
"subs r1, r1, #16\n"
"eor r2, r6, r2, lsr #8\n"
"bne crc32_opt16_loop\n"
"pop {r7, r8, r9}\n"
"str r2, %0\n"
: "=m" (crc32val)
: "r" (s), "r" (len), "r" (crc32val), "r" (crc32_tab)
// Missing r7-r9, manually save it
: "r0", "r1", "r2", "r3", "r4", "r5", "r6"
);
}
Code has to be compiled with minimum -O1 or -Os option
For comparison, the (quite good) code generated by GCC 12 with -Os, working on a single byte :
570: 4288 cmp r0, r1
572: d100 bne.n 576 <_update_crc32+0x12>
574: bd30 pop {r4, r5, pc}
576: 6814 ldr r4, [r2, #0]
578: f810 3b01 ldrb.w r3, [r0], #1
57c: 4063 eors r3, r4
57e: b2db uxtb r3, r3
580: f855 3023 ldr.w r3, [r5, r3, lsl #2]
584: ea83 2314 eor.w r3, r3, r4, lsr #8
588: 6013 str r3, [r2, #0]
58a: e7f1 b.n 570 <_update_crc32+0xc>
Not so far from mine, but main weakness is result write at each loop, which is really time consuming.