Blog de Grégory Soutadé

InEnglish

CRC16-CCITT optimization for ARM Cortex M4

Sunday, 13 July 2025

Écrit par

Grégory Soutadé

(Ooops) I did it again ! After doing the CRC32 optimization, I tried the same for CRC16-CCITT. This one is harder (but not so hard) to optimize for a C compiler because in modern CPU we mainly have 32 bits/64 bits registers, but for CRC16, we have to play with 16 bits values, split into upper and lower parts which need shift + mask operations.

Whatever, context is the same : target is ARM Cortex M4, no data/instruction cache, SRAM memory and GCC 9. One interesting point is that this time, I didn't target armv6, but armv7 (we'll see why later). Figures are still impressive, with a gain between 50% and 60% !

-Os compilation : 21.7 milliseconds
-O2 compilation : 17.2 milliseconds
-Os + optimizations : 8.8 milliseconds

I used the same optimization tricks than CRC32 (see article) plus this ones :

Use specific instruction if you can

ARMv7 instruction set provides thumb2 instructions which contains bitfield extraction. This is really really (yes 2 times) nice ! instruction ubfx (and variants) allows to extract a specified range of bits from a register and thus avoid to do shift + mask (2 instructions).

Be careful on instruction size

Thumb2 is really nice because you can mix 16 bits and 32 bits instructions. But, in order to save space (thus speed), you have to carefully choose your instructions. The case here is :

lsrs r5, r5, #24 C equivalent r5 = r5 >> 24

and

ubfx r5, r5, #24, #8 C equivalent r5 = (r5 & 0xff000000) >> 24

They both do have the same result but the first one is an "old" instruction and can be encoded on 16 bits while the second is new and is encoded into 32 bits.

Don't take care on unused register part

At some point, I do a 32 bits xor operation which generate random values on bits 31..15. But we don't care because we have to focus on 16 bits lower part.

Here is the optimized function. Whole C file can be found here. Optimization is effective for 16 bytes blocks (aligned).

uint16_t crc16_ccitt_opt16(
        const unsigned char*     block,
        unsigned int            blockLength,
        uint16_t          crc)
{
    /* unsigned int i; */

    /* for(i=0U; i<blockLength; i++){ */
    /*     uint16_t tmp = (crc >> 8) ^ (uint16_t) block[i]; */
    /*     crc = ((uint16_t)(crc << 8U)) ^ crc16_ccitt_table[tmp]; */
    /* } */

/*
      r0 -> s
      r1 -> len
      r2 -> crc16val
      r3 -> crc16tab
      r4 -> curval[0]
      r5 -> (crc >> 8) ^ (uint16_t) block[i]
      r6 -> crc16_ccitt_table[(crc >> 8) ^ (uint16_t) block[i])
      r7 -> curval[1]
      r8 -> curval[2]
      r9 -> curval[3]
     */
    __asm__ volatile (
        "mov r0, %1\n"
        "mov r1, %2\n"
        "mov r2, %3\n"
        "mov r3, %4\n"

        "push {r7, r8, r9}\n"

        "crc16_opt16_loop:\n"
        "ldm r0!, {r4, r7, r8, r9}\n"

        // curval[0]
        "eor r5, r4, r2, lsr #8\n"
        "uxtb r5, r5\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r4, r2\n"
        "ubfx r5, r5, #8, #8\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r4, r2, lsl #8\n"
        "ubfx r5, r5, #16, #8\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r4, r2, lsl #16\n"
        "lsrs r5, r5, #24\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        // curval[1]        
        "eor r5, r7, r2, lsr #8\n"
        "uxtb r5, r5\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r7, r2\n"
        "ubfx r5, r5, #8, #8\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r7, r2, lsl #8\n"
        "ubfx r5, r5, #16, #8\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r7, r2, lsl #16\n" 
        "lsrs r5, r5, #24\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        // curval[2]        
        "eor r5, r8, r2, lsr #8\n"
        "uxtb r5, r5\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r8, r2\n"
        "ubfx r5, r5, #8, #8\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r8, r2, lsl #8\n"
        "ubfx r5, r5, #16, #8\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r8, r2, lsl #16\n"
        "lsrs r5, r5, #24\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        // curval[3]        
        "eor r5, r9, r2, lsr #8\n"
        "uxtb r5, r5\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r9, r2\n"
        "ubfx r5, r5, #8, #8\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r9, r2, lsl #8\n"
        "ubfx r5, r5, #16, #8\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"
        "eor r2, r6, r2, lsl #8\n"

        "eor r5, r9, r2, lsl #16\n"
        "lsrs r5, r5, #24\n\n"
        "ldrh r6, [r3, r5, lsl #1]\n"

        // Last two lines inverted
        "subs r1, r1, #16\n"
        "eor r2, r6, r2, lsl #8\n"

        "bne crc16_opt16_loop\n"

        "pop {r7, r8, r9}\n"
        "strh r2, %0\n"
        : "=m" (crc)
        : "r" (block), "r" (blockLength), "r" (crc), "r" (crc16_ccitt_table)
          // Missing r7-r9, manually save it
        : "r0", "r1", "r2", "r3", "r4", "r5", "r6"
        );

    return crc;
}

We can see that computation is not the same for all parts of the 32 bits register while it was really symmetric in CRC32.

Code has to be compiled with minimum -O1 or -Os option

For comparison, the (quite good) code generated by GCC 12 with -Os, working on a single byte :

 594:   428b            cmp     r3, r1
 596:   d100            bne.n   59a <crc16_ccitt+0x12>
 598:   bd30            pop     {r4, r5, pc}
 59a:   f813 2b01       ldrb.w  r2, [r3], #1
 59e:   0204            lsls    r4, r0, #8
 5a0:   b2a4            uxth    r4, r4
 5a2:   ea82 2210       eor.w   r2, r2, r0, lsr #8
 5a6:   f835 2012       ldrh.w  r2, [r5, r2, lsl #1]
 5aa:   ea84 0002       eor.w   r0, r4, r2
 5ae:   e7f1            b.n     594 <crc16_ccitt+0xc>

It's clearly focus on Armv6 compatibility as it use masking operation + shift at lines 59e and 5a0.

CRC32 optimization for ARM Cortex M4

Sunday, 15 June 2025

Écrit par

Grégory Soutadé

#Programmation

#InEnglish

At work I played with an ultra low power SoC powered by a single core ARM Cortex M4. To check some data integrity, we have to use CRC32 but there is no hardware peripheral to speed up computation and ARMv6 doesn't have special instruction for this (it starts from Armv8.1). After some researches, I was surprised not to find any optimized implementation on Internet. So, I wrote it by myself and the result is quite impressive : my version is ~50% faster ! Some figures (on ~30KB of data) :

-Os compilation : 19.4 milliseconds
-02 compilation : 15.2 milliseconds
-0s + optimizations : 8.2 milliseconds

Here, we have to consider that Cortex M4 doesn't have Data nor Instruction cache and memory accesses are done on a SRAM. Compilation is done in thumb mode with GCC 9.

Original version is the one from Wang Yaofu licensed under Apache2. It's quite simple and very academic. C primitives doesn't allows to optimize this algorithm so much because CRC has to be processed byte by byte, so we have to do some assembly !

I used multiple optimization tricks :

Use all registers available

The idea is to play with registers from r0 to 10 and not be limited to r0-r5 as commonly used.

Unroll loop

Avoid to break CPU pipeline by doing some checks + jump. Code is bigger and repetitive, but faster. We may write macro to reduce source code, not my choice here.

Do memory burst instead of unitary access

Especially when there is no cache, memory burst accesses are really faster. Here we load 4*32 bits at a time and keep all data into registers. Bytes outside burst window are computed using non optimized version.

Use shifts and rotates within load and eor instructions

ARM instructions allows to shift registers values within load and eor (and some other instructions) without having to do it in a separate line.

Avoid pipeline register lock

When it's possible, we can invert assembly lines to avoid working on the same registers on consecutive instructions (and thus avoid to lock them).

Update condition flags in sub instruction

Use subs variant to update EQ flag and avoid to check it for 0 in a separate instruction.

Do aligned accesses

In the calling function there is some code to inject "s" as a 32 bits aligned pointer (extra bytes processed by standard code).

Here is the optimized function. Whole C file can be found here

/**
 * Optimized version of _update_crc32 for 16 bytes blocks
 */
static void _update_crc32_opt16(const unsigned char *s, unsigned int len)
{
    /* unsigned int i; */

    /* for (i = 0;  i < len;  i++) { */
    /*     crc32val = crc32_tab[(crc32val ^ s[i]) & 0xFF] ^ ((crc32val >> 8) & 0x00FFFFFF); */
    /* } */

    /*
      r0 -> s
      r1 -> len
      r2 -> crc32val
      r3 -> crc32tab
      r4 -> curval[0]
      r5 -> (crc32val ^ s[i]) & 0xFF
      r6 -> crc32_tab[(crc32val ^ s[i]) & 0xFF]
      r7 -> curval[1]
      r8 -> curval[2]
      r9 -> curval[3]
     */
    __asm__ volatile (
        "mov r0, %1\n"
        "mov r1, %2\n"
        "mov r2, %3\n"
        "mov r3, %4\n"

        "push {r7, r8, r9}\n"

        "crc32_opt16_loop:\n"
        "ldm r0!, {r4, r7, r8, r9}\n"

        // curval[0]
        "eor r5, r2, r4\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r4, ror #8\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r4, ror #16\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r4, ror #24\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        // curval[1]        
        "eor r5, r2, r7\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r7, ror #8\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r7, ror #16\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r7, ror #24\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        // curval[2]        
        "eor r5, r2, r8\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r8, ror #8\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r8, ror #16\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r8, ror #24\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        // curval[3]        
        "eor r5, r2, r9\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r9, ror #8\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r9, ror #16\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"
        "eor r2, r6, r2, lsr #8\n"

        "eor r5, r2, r9, ror #24\n"
        "uxtb r5, r5\n"
        "ldr r6, [r3, r5, lsl #2]\n"

        // Last two lines inverted
        "subs r1, r1, #16\n"
        "eor r2, r6, r2, lsr #8\n"

        "bne crc32_opt16_loop\n"

        "pop {r7, r8, r9}\n"
        "str r2, %0\n"
        : "=m" (crc32val)
        : "r" (s), "r" (len), "r" (crc32val), "r" (crc32_tab)
          // Missing r7-r9, manually save it
        : "r0", "r1", "r2", "r3", "r4", "r5", "r6"
        );
}

Code has to be compiled with minimum -O1 or -Os option

For comparison, the (quite good) code generated by GCC 12 with -Os, working on a single byte :

  4288            cmp     r0, r1
  d100            bne.n   576 <_update_crc32+0x12>
  bd30            pop     {r4, r5, pc}
  6814            ldr     r4, [r2, #0]
  f810 3b01       ldrb.w  r3, [r0], #1
 57c:   4063            eors    r3, r4
 57e:   b2db            uxtb    r3, r3
  f855 3023       ldr.w   r3, [r5, r3, lsl #2]
  ea83 2314       eor.w   r3, r3, r4, lsr #8
  6013            str     r3, [r2, #0]
 58a:   e7f1            b.n     570 <_update_crc32+0xc>

Not so far from mine, but main weakness is result write at each loop, which is really time consuming.

Libgourou v0.8.7

Sunday, 02 March 2025

Écrit par

Grégory Soutadé

#InEnglish

Reminder : Libgourou is an open source ADEPT protocol implementation (ePub DRM management from Adobe) that helps download ACSM files on Linux system (and remove DRM).

Libgourou v0.8.7 is out. I just realized that I missed to announce v0.8.6 release and the mysterious v0.8.5 has just been dropped... There is only few updates since v0.8.4, mainly small bugfixes :

Use of $HOME environment variable if available instead of static /home/XXX
Fix a use after free for sendHTTPRequest()
Remove EBX object (that contains DRM information) when removing DRM from PDF
Handle empty names in adept_load_mgt

I initialy did not planed to release v0.8.7 so early, but libzip has been updated into Debian (v4 -> v5). So I felt it was a good opportunity to provide updated binary release. The nice thing with this version is that I received 3 external contributions, which is great !

You can find source code and binaries in my forge

IWLA 0.8

Tuesday, 04 February 2025

Écrit par

Grégory Soutadé

Capture d'écran IWLA

Version 0.8 of IWLA (Intelligent Web Log Analyzer written in Python) is now released. While looking at the Changelog, I realized that IWLA is now 10 years old ! It's crazy to see that it's still so useful for me, I use it everyday. There is no real alternatives to it if we consider statistic analysis without cookies/embedded javascript. Only AWSTATS do the same, but it's one big unmaintainable PERL file. This is why I created my own tool which is, I think, more accurate and fine tuned. IWLA is my most actively developed personal project, but this not the only one to be old. I also developed gPass (12 years old), KissCount (15 years old), Denote (10 years old) and Dynastie (13 years old). For these projects, I only do maintenance, but I still use them a lot. For KissCount, I think it should be re wrote from scratch with a cleaner architecture. For Dynastie, there is no real alternative, any current static site generator has a dynamic/web frontend, but I think backend should be migrated to something else (more fast). Denote is quite simple and perfect for my needs.

Going back to IWLA, the changelog is :

Core

Awstats data updated (8.0)
Sanitize HTTP requests before analyze
Fix potential division by 0

Plugins

Add rule for robot : forbid "1 page and 1 hit"
Try to detect robots by "compatible" strings
Move feeds and reverse_dns plugins from post_analysis to pre_analysis
Move reverse DNS core management into iwla.py

HTML

Add domain and number of subscribers for feed parser

Config

Add "multimedia_re" filter to detect multimedia files by regular expression
Add "no_merge_feeds_parsers_list" configuration value
Add "robot_domains" configuration value
Add "ignore_url" configuration value

A demo instance (for forge.soutade.fr) is available here

How to do polymorphism in C ?

Thursday, 31 October 2024

Écrit par

Grégory Soutadé

#Programmation

#InEnglish

At work, I had to write a code architecture with types polymorphism in C language. The idea is very basic : one header with common functions and multiple backend implementations. At compile time, we decide which kind of implementation is taken. This can be achieved in a very elegant way using a not so much known C feature : forward definition.

First, a quick recap :

here is a declaration of a function (usually in a header):

int my_func(void); 

Here is a definition of a function (usually in a .c file):

int my_func(void) { return 4; }

This is the same for structures.

Good solution

When compiling, compiler checks that types match declaration, but it needs definition only when object is handled. So, we can create an opaque structure (lets say struct my_struct_s) that can have multiple implementations using its pointer version:

public_header.h

#ifndef _PUBLIC_HEADER_H_
#define _PUBLIC_HEADER_H_

/* Opaque type "my_struct_s" */
struct my_struct_s;
typedef struct my_struct_s* my_struct_t;

my_struct_t init(void);
void do_something(my_struct_t param);
void print_my_struct_t(my_struct_t param);
void delete(my_struct_t param);

my_struct_t init2(void);
void do_something2(my_struct_t param);
void print_my_struct_t2(my_struct_t param);
void delete2(my_struct_t param);

#endif

And two private implementations:

private.c

#include "stdlib.h"
#include "stdio.h"

#include "public_header.h"

/* Private implementation */
struct my_struct_s
{
    int member_i;
};

my_struct_t init(void)
{
    my_struct_t res;

    res = malloc(sizeof(*res));
    res->member_i = 0;

    return res;
}

void do_something(my_struct_t param)
{
    param->member_i++;
}

void print_my_struct_t(my_struct_t param)
{
    printf("I'm an integer with value %d\n",
           param->member_i);
}

void delete(my_struct_t param)
{
    free(param);
}

private2.c

#include "stdlib.h"
#include "stdio.h"

#include "public_header.h"

/* Private implementation */
struct my_struct_s
{
    char member_c;
};

my_struct_t init2(void)
{
    my_struct_t res;

    res = malloc(sizeof(*res));
    res->member_c = 'a';

    return res;
}

void do_something2(my_struct_t param)
{
    param->member_c++;
}

void print_my_struct_t2(my_struct_t param)
{
    printf("I'm a character with value '%c'\n",
           param->member_c);
}

void delete2(my_struct_t param)
{
    free(param);
}

In main.c

#include "public_header.h"

int main()
{
    my_struct_t var;

    var = init();
    do_something(var);
    print_my_struct_t(var);
    delete(var);

    var = init2();
    do_something2(var);
    print_my_struct_t2(var);
    delete2(var);

    return 0;
}

In this example, both implementations are present in output program. But, we can use only one implementation, selected at compile time, and thus have same function names in both private.c and private2.c.

This example works because my_struct_t is a pointer to struct my_struct_s. So, type is checked correctly, and it doesn't care about pointed value unless operation like increment, decrement or dereferencement is done on it. For example, in main.c :

struct my_struct_s var2;

Will generate an error:

error: storage size of ‘var2’ isn’t known

Bad solution

Another solution for pylomorphism is

typedef void* my_struct_t;

But, I do not recommend to write it, because in this case pointer type is not checked, void* is too generic and match all of them. This code compiles without warnings and can lead to type confusion error !

#include "stdio.h"
#include "public_header.h"

typedef void* my_struct_t2;

static void print_string(char* a)
{
    printf("Value of param '%s'\n", a);
}

int main()
{
    my_struct_t2 var;
    var = init();
    do_something(var);

    print_string(var);

    delete(var);

    return 0;
}

NB : In my examples, I use "stdio.h" only because "<" and ">" are removed by code coloration.