Every response from /comprendre is shaped by the training data. The model's 0.726 AUROC, the trust_score breakdown, the confusion between negociation and benchmark — all trace back to how the 700 rows distribute across the 8-feature space.
10 files · ~7,090 rows · 25 features · data/v1_leaking/
The first attempt. 25 features including mentions_reclamation, mentions_devis, mentions_negociation — booleans that are 1-to-1 with the target class. Snake got 100% accuracy across all 10 cycles because the answer was in the features.
| Metric | Value | Why |
|---|---|---|
| Accuracy | 100% | Label leaking — trivial |
| AUROC | 1.000 | No real learning |
| Real-life value | Zero | These features don't exist in production |
Lesson: Inspect feature distributions before celebrating perfect scores. A model that memorizes labels isn't classifying.
Not used. Not deployed. Kept as a cautionary example.
10 files · ~7,090 rows · 10 features · data/v2_text/
Label-leaking booleans removed. Features derived from content: has_montant, has_ref, has_greeting, has_urgency, word_count. Plus keywords (extracted text) and contenu (full generated text). Ambiguity ramped from 0% to 60% across cycles.
| Metric | Value | Why |
|---|---|---|
| Accuracy | 100% | keywords still contain unique headers per class |
| AUROC | 1.000 | Snake finds FACTURE/RECLAMATION via substring |
| Real-life value | Low | Feature engineering pattern is reusable, data is not |
Lesson: Text features from templates carry the template's fingerprint, not the domain's signal. Generated text with fixed headers (FACTURE, RECLAMATION) is still leaking — just through substrings instead of booleans.
The feature derivation pattern (has_montant from EUR regex, has_ref from FA-/CMD- patterns) became the regex fallback extractor in extraction.py. The data didn't train the model, but the engineering informed the extraction layer.
1 file · 700 rows · 8 features · data/dispatch_production/
No text. No keywords. Eight categorical/boolean features with hand-tuned probability distributions that create genuine overlap between classes. This is the dataset behind the live dispatch model.
| Feature | Type | Why it matters |
|---|---|---|
document_type | facture/email/message/formulaire | Primary structural signal |
has_refs_fournisseur | 0/1 | Separates facture/benchmark from email/devis |
has_montant | 0/1 | Financial presence — facture/negociation vs prospect |
has_client_ref | 0/1 | Order/invoice reference — strong for reclamation |
has_produit_mention | 0/1 | Glass products named — devis, facture, benchmark |
ton | formel/informel/urgent | Separates reclamation (urgent) from benchmark (formel) |
has_objection | 0/1 | Key discriminator: reclamation + negociation |
canal | portail/email/api_erp | Entry channel — factures come via ERP, prospects via portail |
| Class | Precision | Recall | F1 | AUROC |
|---|---|---|---|---|
| controle_facture | 0.625 | 0.500 | 0.556 | 0.747 |
| classification_email | 0.529 | 0.450 | 0.487 | 0.680 |
| demande_devis | 0.375 | 0.450 | 0.409 | 0.719 |
| reclamation_client | 0.765 | 0.650 | 0.703 | 0.826 |
| scoring_prospect | 0.529 | 0.450 | 0.487 | 0.659 |
| benchmark_offre | 0.409 | 0.450 | 0.429 | 0.721 |
| negociation_offre | 0.296 | 0.400 | 0.340 | 0.730 |
Lesson: Honest features produce honest metrics. 47.9% on 7 classes with 8 booleans is the real problem. The model is learning signal (3.3x random) but the feature space is too sparse to fully separate the classes.
This dataset trains the model that runs every /comprendre call. Here's the chain:
User sends text
→ extraction.py derives the same 8 features (from Haiku or regex)
→ classification.py loads dispatch_model.json (trained on this data)
→ Snake.get_prediction() returns the task type
→ Snake.get_probability() feeds the trust_score
→ Snake.get_audit() becomes the XAI dispatch_audit
→ response includes prediction + probabilities + trust + audit
The trust_score reflects data quality. When Snake is confident (routing_confidence 40/40), it means the input's 8 features land in a region where the training data had a clear majority. When it's uncertain (routing_confidence 20/40), the features land in an overlap zone between classes — exactly the zones where negociation and benchmark share the same profile in the training data.
All three datasets are synthetic. The production model works because the 8 features are realistic — but the distributions are hand-tuned, not learned from real Monce data.
Today: 700 synthetic rows, 8 features → AUROC 0.726
trust_score range: 60-95
Step 1: 1000+ real labeled interactions from Monce ERP/email/portail
Same 8 features, real distributions
Expected: AUROC 0.80-0.85, trust calibration improves
Step 2: Feature expansion 8 → 20
Add: word_count, sender_domain, n_product_lines, montant_range,
has_deadline, has_attachment, n_entities, sentiment
Expected: AUROC 0.85-0.90, trust_score becomes more granular
Step 3: Ensemble (Claude + Snake)
Claude's tache_detectee + Snake's prediction
Agreement → high trust, disagreement → flag for review
Expected: AUROC 0.90-0.95
The bottleneck is labeled data. The model, the extraction, the trust scoring — all scale automatically once real data flows in.
| Trust component | What it measures | Data dependency |
|---|---|---|
| routing_confidence (40pts) | Snake's top-class probability | Directly from model trained on this data. Higher with cleaner class boundaries. |
| probability_margin (20pts) | Gap between #1 and #2 class | Wide margin = input falls in a pure region of training data. Narrow = overlap zone. |
| extraction_quality (20pts) | Haiku vs regex | Not data-dependent. Measures extraction reliability. |
| feature_richness (10pts) | How many features are informative | Training data defines what "informative" looks like. More features activated = more signal for the model. |
| input_signal (10pts) | Text length + structure | Proxy for extraction coverage. Longer text = more features extracted = model sees richer input. |
Better training data → sharper class boundaries → higher routing_confidence on real inputs → higher trust_scores across the board. The trust_score is a thermometer for data quality as much as prediction quality.
← Retour au controleur