Programmation

Le coût de l'intelligence artificielle

Sunday, 09 February 2025
|
Écrit par
Grégory Soutadé

L'intelligence artificielle est un sujet récurrent ces derniers temps. Entre la génération d'image réaliste, la (plus ou moins grande) pertinence de ChatGPT, le modèle soi-disant ultra performant et à bas coût de DeepSeek, le projet de méga centre de calcul Stargate, l'échec de la première version publique de Lucie, la concurrence dans le domaine est acharnée, avec des investissements qui vont de pair.

D'ailleurs, on ne devrait pas parler d'une intelligence artificielle, mais de plusieurs. Chacune avec son modèle et sa finalité propre. C'est exactement comme pour le cerveau qui est découpé en différentes zones avec des domaines spécialisés (langage, mémoire, vision ...). Mais dans le cas du cerveau, nous avons tous les domaines en même temps !

Pourtant, quel que soit le domaine, l'intelligence artificielle n'est pas juste un modèle informatique. Avant de pouvoir être efficace, il faut réaliser toute une phase d'apprentissage à partir de données existantes. Sur ce point, le web représente une véritable aubaine, une mine d'or de données de type divers et varié. Ce n'est pas pour rien que Microsoft a racheté Github en 2018 pour la modique somme de 7,5 milliards de dollars... Il peut ainsi accéder à, désormais, 500 millions de projets ! En ce qui concerne les données multimédia (particulièrement les images), l'apprentissage est plus complexe car il faut des personnes pour "annoter" ce que contient l'image, c'est à dire spécifier dans un carré la nature d'un élément (une personne, une voiture ...). Sur ce point, les premières annotations à grande échelle ont été réalisées via les CAPTCHA afin de "prouver" que l'utilisateur n'est pas un robot. Aujourd'hui, il existe dans des pays où la mains d'œuvre est peu chère (notamment en Asie du Sud Est) des centres professionnels où les personnes sont payées pour annoter des millions d'images.

Tout cela est relativement transparent pour les utilisateurs finaux, qui se contentent d'envoyer des requêtes via le "prompt".

Dans la dernière version d'IWLA, j'ai implémenté la fusion des robots via leur identifiant ("compatible"). Quelle ne fut pas ma surprise de voir qu'en à peine 6 jours, le robot GPTBot/1.2 a téléchargé 19,5GB de contenu depuis mon site, alors que ce dernier n'évolue que très peu. Je suppose donc que GPTBot opère une ré indexation complète à chaque changement. L'utilisation de toute cette bande passante est là encore invisible pour l'utilisateur final, mais constitue, avec le stockage des données et son traitement, une consommation d'énergie très importante.

C'est pourquoi j'ai décidé de bloquer purement et simplement toutes les requêtes venant d'openai, comme je l'avais fait pour le très gourmand robot de Microsoft (bingbot).

Voici le script que j'utilise et qui est dans mon crontab. Il va analyser les logs du serveur Apache et bloquer les IP qui correspondent à un mot :

#!/bin/bash

function ban_ip()
{
    echo "Ban $1"
    nft add rule inet BAN_RU input ip saddr $1 drop
}

function ban_file()
{
    file=$1
    token=$2
    prefix_len=$3
    fields=""
    postfix=""
    case $prefix_len in
        "8")  fields="1";       postfix=".0.0.0";;
        "16") fields="1,2";     postfix=".0.0";;
        "24") fields="1,2,3";   postfix=".0";;
        "32") fields="1,2,3,4"; postfix="";;
        *) echo "Error: prefix_len must be 8, 16, 24 or 32"
           return;;
    esac
    ips=`cat $file|grep $token|cut -f2 -d' '|sort|uniq|cut -d'.' -f$fields|uniq`
    for ip in $ips ; do
        ban_ip "${ip}${postfix}/${prefix_len}"
    done
}

if [ -z "$1" -o $1 = "-h" -o $1 = "--help" ] ; then
    echo "$0 token prefix_len (8,16,24,32)"
fi

ban_file /var/log/apache2/access.log.1 $1 $2

Il se base sur le pare-feu Linux netfilter configuré comme tel :

nft create table inet BAN_RU
nft add chain inet BAN_RU input "{{ type filter hook input priority filter; }}"

La table en question est BAN_RU, déjà utilisée pour bloquer les IP venant de Russie. Le script ne supporte que des IPv4, mais je ne vois pas de connexion IPv6 de la part de robots aussi gourmands.

IWLA 0.8

Tuesday, 04 February 2025
|
Écrit par
Grégory Soutadé

Capture d'écran IWLA

Version 0.8 of IWLA (Intelligent Web Log Analyzer written in Python) is now released. While looking at the Changelog, I realized that IWLA is now 10 years old ! It's crazy to see that it's still so useful for me, I use it everyday. There is no real alternatives to it if we consider statistic analysis without cookies/embedded javascript. Only AWSTATS do the same, but it's one big unmaintainable PERL file. This is why I created my own tool which is, I think, more accurate and fine tuned. IWLA is my most actively developed personal project, but this not the only one to be old. I also developed gPass (12 years old), KissCount (15 years old), Denote (10 years old) and Dynastie (13 years old). For these projects, I only do maintenance, but I still use them a lot. For KissCount, I think it should be re wrote from scratch with a cleaner architecture. For Dynastie, there is no real alternative, any current static site generator has a dynamic/web frontend, but I think backend should be migrated to something else (more fast). Denote is quite simple and perfect for my needs.

Going back to IWLA, the changelog is :

Core

  • Awstats data updated (8.0)
  • Sanitize HTTP requests before analyze
  • Fix potential division by 0

Plugins

  • Add rule for robot : forbid "1 page and 1 hit"
  • Try to detect robots by "compatible" strings
  • Move feeds and reverse_dns plugins from post_analysis to pre_analysis
  • Move reverse DNS core management into iwla.py

HTML

  • Add domain and number of subscribers for feed parser

Config

  • Add "multimedia_re" filter to detect multimedia files by regular expression
  • Add "no_merge_feeds_parsers_list" configuration value
  • Add "robot_domains" configuration value
  • Add "ignore_url" configuration value

A demo instance (for forge.soutade.fr) is available here

How to do polymorphism in C ?

Thursday, 31 October 2024
|
Écrit par
Grégory Soutadé

At work, I had to write a code architecture with types polymorphism in C language. The idea is very basic : one header with common functions and multiple backend implementations. At compile time, we decide which kind of implementation is taken. This can be achieved in a very elegant way using a not so much known C feature : forward definition.

First, a quick recap :

here is a declaration of a function (usually in a header):

int my_func(void); 

Here is a definition of a function (usually in a .c file):

int my_func(void) { return 4; }

This is the same for structures.

Good solution

When compiling, compiler checks that types match declaration, but it needs definition only when object is handled. So, we can create an opaque structure (lets say struct my_struct_s) that can have multiple implementations using its pointer version:

public_header.h

#ifndef _PUBLIC_HEADER_H_
#define _PUBLIC_HEADER_H_

/* Opaque type "my_struct_s" */
struct my_struct_s;
typedef struct my_struct_s* my_struct_t;

my_struct_t init(void);
void do_something(my_struct_t param);
void print_my_struct_t(my_struct_t param);
void delete(my_struct_t param);

my_struct_t init2(void);
void do_something2(my_struct_t param);
void print_my_struct_t2(my_struct_t param);
void delete2(my_struct_t param);

#endif

And two private implementations:

private.c

#include "stdlib.h"
#include "stdio.h"

#include "public_header.h"

/* Private implementation */
struct my_struct_s
{
    int member_i;
};

my_struct_t init(void)
{
    my_struct_t res;

    res = malloc(sizeof(*res));
    res->member_i = 0;

    return res;
}

void do_something(my_struct_t param)
{
    param->member_i++;
}

void print_my_struct_t(my_struct_t param)
{
    printf("I'm an integer with value %d\n",
           param->member_i);
}

void delete(my_struct_t param)
{
    free(param);
}

private2.c

#include "stdlib.h"
#include "stdio.h"

#include "public_header.h"

/* Private implementation */
struct my_struct_s
{
    char member_c;
};

my_struct_t init2(void)
{
    my_struct_t res;

    res = malloc(sizeof(*res));
    res->member_c = 'a';

    return res;
}

void do_something2(my_struct_t param)
{
    param->member_c++;
}

void print_my_struct_t2(my_struct_t param)
{
    printf("I'm a character with value '%c'\n",
           param->member_c);
}

void delete2(my_struct_t param)
{
    free(param);
}

In main.c

#include "public_header.h"

int main()
{
    my_struct_t var;

    var = init();
    do_something(var);
    print_my_struct_t(var);
    delete(var);

    var = init2();
    do_something2(var);
    print_my_struct_t2(var);
    delete2(var);

    return 0;
}

In this example, both implementations are present in output program. But, we can use only one implementation, selected at compile time, and thus have same function names in both private.c and private2.c.

This example works because my_struct_t is a pointer to struct my_struct_s. So, type is checked correctly, and it doesn't care about pointed value unless operation like increment, decrement or dereferencement is done on it. For example, in main.c :

struct my_struct_s var2;

Will generate an error:

error: storage size of ‘var2’ isn’t known

Bad solution

Another solution for pylomorphism is

typedef void* my_struct_t;

But, I do not recommend to write it, because in this case pointer type is not checked, void* is too generic and match all of them. This code compiles without warnings and can lead to type confusion error !

#include "stdio.h"
#include "public_header.h"

typedef void* my_struct_t2;

static void print_string(char* a)
{
    printf("Value of param '%s'\n", a);
}

int main()
{
    my_struct_t2 var;
    var = init();
    do_something(var);

    print_string(var);

    delete(var);

    return 0;
}

NB : In my examples, I use "stdio.h" only because "<" and ">" are removed by code coloration.

IWLA 0.7

Sunday, 17 March 2024
|
Écrit par
Grégory Soutadé

Capture d'écran IWLA

Here is the work done for version 0.7 of IWLA (Intelligent Web Log Analyzer witten in Python) since one year and a half :

Core

  • Awstats data updated (7.9)
  • Remove detection from awstats dataset for browser
  • Don't analyze referer for non viewed hits/pages
  • Remove all trailing slashs of URL before starting analyze
  • Improve page/hit detection
  • Main key for visits is now "remote_ip" and not "remote_addr"
  • Add IP type plugin to support IPv4 and IPv6
  • --display-only switch now takes an argument (month/year), analyze is not yet necessary
  • Add --disable-display option

Plugins

  • Geo IP plugin updated (use of ip-api.com)
  • Update robot detection
  • Display visitor IP is now a filter
  • Add subdomains plugin

HTML

  • Generate HTML part in dry run mode (but don't write it to disk)
  • Set lang value in generated HTML page
  • Bugfix: flags management for feeds display
  • New way to display global statistics : with links in months names instead of "Details" button

Config

  • Add no_referrer_domains list to defaut_conf for website that defines this policy
  • Add excluded domain option
  • Set count_hit_only_visitors to False by default

A demo instance (for forge.soutade.fr) is available here

Tip: Fight against SPAM in comments

Sunday, 29 October 2023
|
Écrit par
Grégory Soutadé

What a surprise when you manage a server and see in the morning your mailbox full of mails telling there is an issue ! It starts like a bad day... First thing : connect to the server and blacklist the attacker IP. This one was not from China or Russia, but from Germany ! It tries to do code and SQL injection on all web pages, with a delay between each bunch of requests to remain undercover. Fortunately, all dynamic web pages in my server are behind a login form. Public part use only statically generated HTML pages. This is done by my static blog generator Dynastie. It's a 10 years old project written in Python/Django. If I have to write it again, I would use Python templates and not XML, but this one has been especially written for my needs and perfectly fit it. So, even if the IHM is very basic, the rendering is aux petits oignons.

One of great feature (not available in other static generators), except dynamic post management, is dynamic comment support. Unfortunately, a website offering public comments without registration is a target for spammers. My automatic comment filtering works well since 2014, but has been bypassed this week. Here is how I fixed it.

As robot doesn't load CSS and JavaScript resources, we can play with hidden fields in the comment's form and do checks on webserver side. So, I added an hidden field which is filled by Javascript when user press on "Comment" button. Value set is a timestamp + a magic number that is then checked by server. So, if the spammer doesn't run Javascript, it'll be blocked ! For sure, this trick is very easy to break and a spammer can easily bypass it with a smart/targeted robot or by doing manual SPAM. In this case, the only solutions is a complex captcha/registration and/or manual comment validation. But it requires more complex modules and work from both parts (user and webmaster), which is overkill for small a website.