← Retour au controleur

/data // Training Datasets

Three datasets, one lesson: the model is only as good as the features it trains on.

How data becomes product

Synthetic data (generate_data.py)

NDJSON — 700 rows, 8 features, 7 classes

Snake SAT training (train.py) — 15 layers, industrial profile

dispatch_model.json — loaded at startup

POST /comprendre → extraction → Snake classifies → trust_score → response

Every response from /comprendre is shaped by the training data. The model's 0.726 AUROC, the trust_score breakdown, the confusion between negociation and benchmark — all trace back to how the 700 rows distribute across the 8-feature space.

The three datasets

v1 — Label-Leaking Features DEPRECATED

10 files · ~7,090 rows · 25 features · data/v1_leaking/

The first attempt. 25 features including mentions_reclamation, mentions_devis, mentions_negociation — booleans that are 1-to-1 with the target class. Snake got 100% accuracy across all 10 cycles because the answer was in the features.

MetricValueWhy
Accuracy100%Label leaking — trivial
AUROC1.000No real learning
Real-life valueZeroThese features don't exist in production

Lesson: Inspect feature distributions before celebrating perfect scores. A model that memorizes labels isn't classifying.

Impact on product: none

Not used. Not deployed. Kept as a cautionary example.

v2 — Text + Structural Features EXPLORATION

10 files · ~7,090 rows · 10 features · data/v2_text/

Label-leaking booleans removed. Features derived from content: has_montant, has_ref, has_greeting, has_urgency, word_count. Plus keywords (extracted text) and contenu (full generated text). Ambiguity ramped from 0% to 60% across cycles.

MetricValueWhy
Accuracy100%keywords still contain unique headers per class
AUROC1.000Snake finds FACTURE/RECLAMATION via substring
Real-life valueLowFeature engineering pattern is reusable, data is not

Lesson: Text features from templates carry the template's fingerprint, not the domain's signal. Generated text with fixed headers (FACTURE, RECLAMATION) is still leaking — just through substrings instead of booleans.

Impact on product: indirect

The feature derivation pattern (has_montant from EUR regex, has_ref from FA-/CMD- patterns) became the regex fallback extractor in extraction.py. The data didn't train the model, but the engineering informed the extraction layer.

dispatch_production — The Live Model ACTIVE

1 file · 700 rows · 8 features · data/dispatch_production/

No text. No keywords. Eight categorical/boolean features with hand-tuned probability distributions that create genuine overlap between classes. This is the dataset behind the live dispatch model.

FeatureTypeWhy it matters
document_typefacture/email/message/formulairePrimary structural signal
has_refs_fournisseur0/1Separates facture/benchmark from email/devis
has_montant0/1Financial presence — facture/negociation vs prospect
has_client_ref0/1Order/invoice reference — strong for reclamation
has_produit_mention0/1Glass products named — devis, facture, benchmark
tonformel/informel/urgentSeparates reclamation (urgent) from benchmark (formel)
has_objection0/1Key discriminator: reclamation + negociation
canalportail/email/api_erpEntry channel — factures come via ERP, prospects via portail
Accuracy 47.9% AUROC 0.726 F1 0.487 3.3x random
ClassPrecisionRecallF1AUROC
controle_facture0.6250.5000.5560.747
classification_email0.5290.4500.4870.680
demande_devis0.3750.4500.4090.719
reclamation_client0.7650.6500.7030.826
scoring_prospect0.5290.4500.4870.659
benchmark_offre0.4090.4500.4290.721
negociation_offre0.2960.4000.3400.730

Lesson: Honest features produce honest metrics. 47.9% on 7 classes with 8 booleans is the real problem. The model is learning signal (3.3x random) but the feature space is too sparse to fully separate the classes.

Impact on product: direct

This dataset trains the model that runs every /comprendre call. Here's the chain:

User sends text
  → extraction.py derives the same 8 features (from Haiku or regex)
    → classification.py loads dispatch_model.json (trained on this data)
      → Snake.get_prediction() returns the task type
      → Snake.get_probability() feeds the trust_score
      → Snake.get_audit() becomes the XAI dispatch_audit
        → response includes prediction + probabilities + trust + audit

The trust_score reflects data quality. When Snake is confident (routing_confidence 40/40), it means the input's 8 features land in a region where the training data had a clear majority. When it's uncertain (routing_confidence 20/40), the features land in an overlap zone between classes — exactly the zones where negociation and benchmark share the same profile in the training data.

From synthetic to real

All three datasets are synthetic. The production model works because the 8 features are realistic — but the distributions are hand-tuned, not learned from real Monce data.

The upgrade path

Today:  700 synthetic rows, 8 features → AUROC 0.726
        trust_score range: 60-95

Step 1: 1000+ real labeled interactions from Monce ERP/email/portail
        Same 8 features, real distributions
        Expected: AUROC 0.80-0.85, trust calibration improves

Step 2: Feature expansion 8 → 20
        Add: word_count, sender_domain, n_product_lines, montant_range,
             has_deadline, has_attachment, n_entities, sentiment
        Expected: AUROC 0.85-0.90, trust_score becomes more granular

Step 3: Ensemble (Claude + Snake)
        Claude's tache_detectee + Snake's prediction
        Agreement → high trust, disagreement → flag for review
        Expected: AUROC 0.90-0.95

The bottleneck is labeled data. The model, the extraction, the trust scoring — all scale automatically once real data flows in.

How trust_score connects to data

Trust componentWhat it measuresData dependency
routing_confidence (40pts)Snake's top-class probabilityDirectly from model trained on this data. Higher with cleaner class boundaries.
probability_margin (20pts)Gap between #1 and #2 classWide margin = input falls in a pure region of training data. Narrow = overlap zone.
extraction_quality (20pts)Haiku vs regexNot data-dependent. Measures extraction reliability.
feature_richness (10pts)How many features are informativeTraining data defines what "informative" looks like. More features activated = more signal for the model.
input_signal (10pts)Text length + structureProxy for extraction coverage. Longer text = more features extracted = model sees richer input.

Better training data → sharper class boundaries → higher routing_confidence on real inputs → higher trust_scores across the board. The trust_score is a thermometer for data quality as much as prediction quality.

← Retour au controleur