/data // Training Datasets

Three datasets, one lesson: the model is only as good as the features it trains on.

How data becomes product

Synthetic data (generate_data.py)
↓
NDJSON — 700 rows, 8 features, 7 classes
↓
Snake SAT training (train.py) — 15 layers, industrial profile
↓
dispatch_model.json — loaded at startup
↓
POST /comprendre → extraction → Snake classifies → trust_score → response

Every response from /comprendre is shaped by the training data. The model's 0.726 AUROC, the trust_score breakdown, the confusion between negociation and benchmark — all trace back to how the 700 rows distribute across the 8-feature space.

The three datasets

v1 — Label-Leaking Features DEPRECATED

10 files · ~7,090 rows · 25 features · data/v1_leaking/

The first attempt. 25 features including mentions_reclamation, mentions_devis, mentions_negociation — booleans that are 1-to-1 with the target class. Snake got 100% accuracy across all 10 cycles because the answer was in the features.

Metric	Value	Why
Accuracy	100%	Label leaking — trivial
AUROC	1.000	No real learning
Real-life value	Zero	These features don't exist in production

Lesson: Inspect feature distributions before celebrating perfect scores. A model that memorizes labels isn't classifying.

Impact on product: none

Not used. Not deployed. Kept as a cautionary example.

v2 — Text + Structural Features EXPLORATION

10 files · ~7,090 rows · 10 features · data/v2_text/

Label-leaking booleans removed. Features derived from content: has_montant, has_ref, has_greeting, has_urgency, word_count. Plus keywords (extracted text) and contenu (full generated text). Ambiguity ramped from 0% to 60% across cycles.

Metric	Value	Why
Accuracy	100%	keywords still contain unique headers per class
AUROC	1.000	Snake finds FACTURE/RECLAMATION via substring
Real-life value	Low	Feature engineering pattern is reusable, data is not

Lesson: Text features from templates carry the template's fingerprint, not the domain's signal. Generated text with fixed headers (FACTURE, RECLAMATION) is still leaking — just through substrings instead of booleans.

Impact on product: indirect

The feature derivation pattern (has_montant from EUR regex, has_ref from FA-/CMD- patterns) became the regex fallback extractor in extraction.py. The data didn't train the model, but the engineering informed the extraction layer.

dispatch_production — The Live Model ACTIVE

1 file · 700 rows · 8 features · data/dispatch_production/

No text. No keywords. Eight categorical/boolean features with hand-tuned probability distributions that create genuine overlap between classes. This is the dataset behind the live dispatch model.

Feature	Type	Why it matters
`document_type`	facture/email/message/formulaire	Primary structural signal
`has_refs_fournisseur`	0/1	Separates facture/benchmark from email/devis
`has_montant`	0/1	Financial presence — facture/negociation vs prospect
`has_client_ref`	0/1	Order/invoice reference — strong for reclamation
`has_produit_mention`	0/1	Glass products named — devis, facture, benchmark
`ton`	formel/informel/urgent	Separates reclamation (urgent) from benchmark (formel)
`has_objection`	0/1	Key discriminator: reclamation + negociation
`canal`	portail/email/api_erp	Entry channel — factures come via ERP, prospects via portail

Accuracy 47.9% AUROC 0.726 F1 0.487 3.3x random

Class	Precision	Recall	F1	AUROC
controle_facture	0.625	0.500	0.556	0.747
classification_email	0.529	0.450	0.487	0.680
demande_devis	0.375	0.450	0.409	0.719
reclamation_client	0.765	0.650	0.703	0.826
scoring_prospect	0.529	0.450	0.487	0.659
benchmark_offre	0.409	0.450	0.429	0.721
negociation_offre	0.296	0.400	0.340	0.730

Lesson: Honest features produce honest metrics. 47.9% on 7 classes with 8 booleans is the real problem. The model is learning signal (3.3x random) but the feature space is too sparse to fully separate the classes.

Impact on product: direct

This dataset trains the model that runs every /comprendre call. Here's the chain:

User sends text
  → extraction.py derives the same 8 features (from Haiku or regex)
    → classification.py loads dispatch_model.json (trained on this data)
      → Snake.get_prediction() returns the task type
      → Snake.get_probability() feeds the trust_score
      → Snake.get_audit() becomes the XAI dispatch_audit
        → response includes prediction + probabilities + trust + audit

The trust_score reflects data quality. When Snake is confident (routing_confidence 40/40), it means the input's 8 features land in a region where the training data had a clear majority. When it's uncertain (routing_confidence 20/40), the features land in an overlap zone between classes — exactly the zones where negociation and benchmark share the same profile in the training data.

From synthetic to real

All three datasets are synthetic. The production model works because the 8 features are realistic — but the distributions are hand-tuned, not learned from real Monce data.

The upgrade path

Today:  700 synthetic rows, 8 features → AUROC 0.726
        trust_score range: 60-95

Step 1: 1000+ real labeled interactions from Monce ERP/email/portail
        Same 8 features, real distributions
        Expected: AUROC 0.80-0.85, trust calibration improves

Step 2: Feature expansion 8 → 20
        Add: word_count, sender_domain, n_product_lines, montant_range,
             has_deadline, has_attachment, n_entities, sentiment
        Expected: AUROC 0.85-0.90, trust_score becomes more granular

Step 3: Ensemble (Claude + Snake)
        Claude's tache_detectee + Snake's prediction
        Agreement → high trust, disagreement → flag for review
        Expected: AUROC 0.90-0.95

The bottleneck is labeled data. The model, the extraction, the trust scoring — all scale automatically once real data flows in.

How trust_score connects to data

Trust component	What it measures	Data dependency
routing_confidence (40pts)	Snake's top-class probability	Directly from model trained on this data. Higher with cleaner class boundaries.
probability_margin (20pts)	Gap between #1 and #2 class	Wide margin = input falls in a pure region of training data. Narrow = overlap zone.
extraction_quality (20pts)	Haiku vs regex	Not data-dependent. Measures extraction reliability.
feature_richness (10pts)	How many features are informative	Training data defines what "informative" looks like. More features activated = more signal for the model.
input_signal (10pts)	Text length + structure	Proxy for extraction coverage. Longer text = more features extracted = model sees richer input.

Better training data → sharper class boundaries → higher routing_confidence on real inputs → higher trust_scores across the board. The trust_score is a thermometer for data quality as much as prediction quality.

← Retour au controleur