tfrere HF Staff commited on
Commit
1f59ee1
·
2 Parent(s): 3050a37 dc8d285
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: 'Evaluation guidebook'
3
- short_desc: 'Understanding the tips and tricks of evaluating an LLM in 2025'
4
  emoji: 📝
5
  colorFrom: blue
6
  colorTo: indigo
 
1
  ---
2
+ title: 'Evaluation Guidebook'
3
+ short_desc: 'How to properly evaluate LLMs in the modern age'
4
  emoji: 📝
5
  colorFrom: blue
6
  colorTo: indigo
app/src/content/article.mdx CHANGED
@@ -28,9 +28,6 @@ import TroubleshootingInference from "./chapters/troubleshooting/troubleshooting
28
  import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
29
  import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
30
 
31
- - https://arxiv.org/abs/2109.02550
32
- - https://arxiv.org/abs/2511.21140
33
-
34
  <Intro />
35
 
36
  ## LLM basics to understand evaluation
@@ -94,8 +91,6 @@ Best (but rarest) metrics are functional or based on rule based verifiers (thoug
94
 
95
  ## Creating your own evaluation
96
 
97
-
98
-
99
  <DesigningAutomaticEvaluation />
100
 
101
 
 
28
  import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
29
  import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
30
 
 
 
 
31
  <Intro />
32
 
33
  ## LLM basics to understand evaluation
 
91
 
92
  ## Creating your own evaluation
93
 
 
 
94
  <DesigningAutomaticEvaluation />
95
 
96
 
app/src/content/assets/image/chat-templates-and-tokenisation.png ADDED

Git LFS Details

  • SHA256: 0a3e4762ba6d5ecb79519b2533eb739af571c284a1cae8332a28e81906fe018c
  • Pointer size: 131 Bytes
  • Size of remote file: 209 kB
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -20,8 +20,6 @@ When aggregating datasets, pay attention to whether
20
 
21
  <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
22
 
23
- #### Creating a dataset manually
24
-
25
  <UsingHumanAnnotators />
26
 
27
  #### Creating a dataset synthetically
@@ -45,33 +43,6 @@ Once this is done, you can do an automatic validation by using a model from a di
45
  No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
46
  </Note>
47
 
48
- #### Choosing a prompt
49
- The prompt is going to define:
50
- - how much information is given to your model about the task
51
- - how this information is presented to your model.
52
-
53
- A prompt for a general MCQA or QA is usually made of some of the following:
54
- - a task prompt (optional): introduces your task.
55
- - a context: provides additional context for your question.
56
- - *Eg: For a summarization or information extraction task, you could provide a content source*
57
- - a question: the actual core of your prompt.
58
- - in case of a multi choice evaluation, you can add options
59
- - connector words (`Question`, `Context`, `Choice`, ...)
60
-
61
- When defining your prompt, you need to be aware that:
62
- - even small changes in semantically equivalent prompts can make the results vary by quite a lot (see Section `Different prompt` in [Troubleshooting reproducibility](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-reproducibility.md)), and prompt formats might advantage or disadvantage specific models
63
- - How to mitigate this:
64
- - A costly way is to re-run the evaluation several times with prompt variations
65
- - A less costly way is to run your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty
66
- - you can provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall
67
- - for a number of metrics, you want a very constrained generation or output.
68
-
69
- <Note title="Models can overfit prompt formats" emoji="⚠️" variant="warning">
70
-
71
- Recent research shows models can overfit specific prompt formats rather than learning the underlying task. [This paper](https://arxiv.org/abs/2407.07890) is great on the topic, showing notably how some models can be over-evaluated because they have overfitted the test set **format**.
72
- On the Open LLM Leaderboard 2, we've notably observed that Llama 3.2 and Qwen 2.5 are no longer following the format of the prompt provided in a few-shot setup for this reason.
73
- </Note>
74
-
75
  #### Managing contamination
76
  In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
77
 
@@ -83,6 +54,21 @@ Solutions to mitigate this include:
83
 
84
  However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ### Choosing an inference method for your model
88
  You'll need to choose what kind of inference method you need.
@@ -122,8 +108,7 @@ However, nowadays most evaluations are generative: using generations (QA, questi
122
 
123
  If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
124
 
125
- If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something. Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
126
-
127
 
128
  ## The hardest part of evaluation: Scoring free form text
129
 
@@ -137,15 +122,16 @@ The easiest but least flexible match based metrics are **exact matches** of toke
137
  The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap.
138
  Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
139
 
140
- Once you have an accuracy score per sample, you can aggregate it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
141
 
142
  If your score is **binary**, look at the **precision** (critical when false positives are costly), **recall** (critical when missing positives is costly), **F1 score** (balances precision and recall, good for imbalanced data), or **MCC** (Matthews Correlation Coefficient, which works well with imbalanced datasets by considering all confusion matrix elements).
143
- If your score is **continuous**, you can use **mean squared error** (penalizes large errors but heavily weights outliers), **mean absolute error** (more balanced than MSE), or if you assume your data should follow a specific linear regression model, you can look at measures like the **R²** or correlation coefficients like **Pearson** (for linear relationships, assumes normality) or **Spearman** (for monotonic relationships without normality assumptions).
144
 
145
  More generally, when picking your metric and its aggregation, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
146
- <Sidenote>
147
- To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/). You'll also find a complete list of metrics and their uses in [this organisation](https://huggingface.co/evaluate-metric).
148
- </Sidenote>
 
149
 
150
 
151
  <Note title="Pros and cons of using automated metrics">
@@ -183,13 +169,14 @@ Normalizations can easily [be unfair if not designed well](https://huggingface.c
183
 
184
  They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
185
 
186
- #### Adding sampling
187
 
188
  When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
189
  This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
190
 
191
  Common sampling-based metrics are:
192
- - **pass@k over n**: Given n generated samples, measures whether at least k passes the test. <Sidenote> You'll find two functions for this metric: computed as: $\text{pass}@k = \mathbb{E}[\text{at least 1 correct among k samples}]$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
 
193
  - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
194
  - **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
195
  - **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
@@ -204,7 +191,7 @@ When you use sampling evaluations, make sure to always report all sampling param
204
  However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
205
  </Note>
206
 
207
- #### Using functional testing
208
  Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
209
 
210
  **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
@@ -223,7 +210,6 @@ This functional approach works particularly well for instruction following, but
223
  Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
224
  </Sidenote>
225
 
226
-
227
  ### With humans
228
  Human evaluation is simply asking humans to score predictions.
229
 
@@ -347,7 +333,7 @@ Provide some additional "reasoning" evaluation steps:
347
  - *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
348
 
349
  Specify the desired output format (adding fields will help consistency)
350
- - *Your answer should be provided in JSON, with the following format \{"Score": Your score, "Reasoning": The reasoning which led you to this score\}*
351
  </Note>
352
 
353
  You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
@@ -365,9 +351,7 @@ You can also improve accuracy using the following, possibly more costly, techniq
365
  - **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy
366
  - **CoT**: [improves accuracy for older gen models](https://arxiv.org/abs/2212.08073), if you ask the model to output its chain of thought **before** the score (also observed [here](https://x.com/seungonekim/status/1749289437165769177))
367
  - **Multiturn analysis**: can improve [factual error detection](https://arxiv.org/abs/2305.13281)
368
- - Using **a jury** (many judges, where you pick an aggregate of the answers): [gives better results](https://arxiv.org/abs/2404.18796) than using a single model.
369
- - It can be made considerably less costly by leveraging many smaller models instead of one big expensive model.
370
- - You can also experiment with using one model with variations on temperature
371
  - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
372
 
373
  If you are working on critical tasks (medical domain for example), make sure to use methodologies transferred from the humanities, and 1) compute inter-annotator agreement metrics to make sure your evaluators are as unbiased as possible, 2) Use proper survey design methodology when creating your scoring grid to mitigate bias. However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
@@ -511,3 +495,10 @@ On the other hand they:
511
  - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
512
  - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
513
  </Note>
 
 
 
 
 
 
 
 
20
 
21
  <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
22
 
 
 
23
  <UsingHumanAnnotators />
24
 
25
  #### Creating a dataset synthetically
 
43
  No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
44
  </Note>
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  #### Managing contamination
47
  In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
48
 
 
54
 
55
  However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
56
 
57
+ ### Choosing a prompt
58
+ The prompt is going to define how much information is given to your model about the task, and how this information is presented to the model.
59
+
60
+ A prompt for a general MCQA or QA is usually made of some of the following:
61
+ - a task prompt (optional): introduces your task.
62
+ - a context: provides additional context for your question.
63
+ - *Eg: For a summarization or information extraction task, you could provide a content source*
64
+ - a question: the actual core of your prompt.
65
+ - in case of a multi choice evaluation, you can add options
66
+ - connector words (`Question`, `Context`, `Choice`, ...)
67
+
68
+ When defining your prompt, you need to be aware that even small changes in semantically equivalent prompts can make the results vary by quite a lot, and prompt formats might advantage or disadvantage specific models (See [this section](https://huggingface.co/spaces/OpenEvals/evaluation-guidebook#different-prompt)).
69
+
70
+ ➡️ This can be mitigated by re-running the evaluation several times with prompt variations (but it can be costly), or simply running your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty.
71
+ ➡️ You can also provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall.
72
 
73
  ### Choosing an inference method for your model
74
  You'll need to choose what kind of inference method you need.
 
108
 
109
  If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
110
 
111
+ If you're looking at generative evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
 
112
 
113
  ## The hardest part of evaluation: Scoring free form text
114
 
 
122
  The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap.
123
  Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
124
 
125
+ Once you have an accuracy score per sample, you can **aggregate** it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
126
 
127
  If your score is **binary**, look at the **precision** (critical when false positives are costly), **recall** (critical when missing positives is costly), **F1 score** (balances precision and recall, good for imbalanced data), or **MCC** (Matthews Correlation Coefficient, which works well with imbalanced datasets by considering all confusion matrix elements).
128
+ If your score is **continuous** (less likely though), you can use **mean squared error** (penalizes large errors but heavily weights outliers) or **mean absolute error** (more balanced than MSE). <Sidenote> If you assume your data should follow a specific linear regression model (for example if you are studying model calibration), you can look at measures like the **R²** or correlation coefficients like **Pearson** (for linear relationships, assumes normality) or **Spearman** (for monotonic relationships without normality assumptions). However, it's a bit out of scope here. </Sidenote>
129
 
130
  More generally, when picking your metric and its aggregation, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
131
+ <Note title="To go further">
132
+ - This [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/) covers some of the challenges of evaluating LLMs.
133
+ - If you're looking for metrics, you'll also find a good list with description, score ranges and use cases in [this organisation](https://huggingface.co/evaluate-metric).
134
+ </Note>
135
 
136
 
137
  <Note title="Pros and cons of using automated metrics">
 
169
 
170
  They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
171
 
172
+ #### Sampling
173
 
174
  When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
175
  This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
176
 
177
  Common sampling-based metrics are:
178
+ - **pass@k over n**: Given n generated samples, measures whether at least k passes the test.
179
+ <Sidenote> You'll find two functions for this metric: computed trivially as: $\text{pass}@k = (c >= k)$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
180
  - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
181
  - **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
182
  - **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
 
191
  However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
192
  </Note>
193
 
194
+ #### Functional scorers
195
  Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
196
 
197
  **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
 
210
  Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
211
  </Sidenote>
212
 
 
213
  ### With humans
214
  Human evaluation is simply asking humans to score predictions.
215
 
 
333
  - *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
334
 
335
  Specify the desired output format (adding fields will help consistency)
336
+ - *Your answer should be provided in JSON, with the following format \{"Score": Your score, "Reasoning": The reasoning which led you to this score\}*
337
  </Note>
338
 
339
  You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
 
351
  - **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy
352
  - **CoT**: [improves accuracy for older gen models](https://arxiv.org/abs/2212.08073), if you ask the model to output its chain of thought **before** the score (also observed [here](https://x.com/seungonekim/status/1749289437165769177))
353
  - **Multiturn analysis**: can improve [factual error detection](https://arxiv.org/abs/2305.13281)
354
+ - Using **a jury** (many judges, where you pick an aggregate of the answers): [gives better results](https://arxiv.org/abs/2404.18796) than using a single model. It can be made considerably less costly by leveraging many smaller models instead of one big expensive model. You can also experiment with using one model with variations on temperature
 
 
355
  - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
356
 
357
  If you are working on critical tasks (medical domain for example), make sure to use methodologies transferred from the humanities, and 1) compute inter-annotator agreement metrics to make sure your evaluators are as unbiased as possible, 2) Use proper survey design methodology when creating your scoring grid to mitigate bias. However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
 
495
  - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
496
  - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
497
  </Note>
498
+
499
+ ### Calibration and confidence
500
+
501
+ When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
502
+
503
+ These confidence intervals can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.
504
+
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED
@@ -5,6 +5,7 @@ title: "Model inference and evaluation"
5
  import llmTk1 from '../../assets/image/llm_tk_1.png';
6
  import llmLogprob from '../../assets/image/llm_logprob.png';
7
  import llmGen from '../../assets/image/llm_gen.png';
 
8
  import Image from '../../../components/Image.astro';
9
  import Note from "../../../components/Note.astro";
10
  import Sidenote from "../../../components/Sidenote.astro";
@@ -71,8 +72,15 @@ However, if you want to allow your tokenizer to correctly split text in other la
71
 
72
  This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
73
 
 
 
 
 
 
 
 
74
  <Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
75
- - ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and it's worth playing around with the [demo space](https://huggingface.co/spaces/yenniejun/tokenizers-languages)
76
  - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
77
  </Note>
78
 
@@ -97,7 +105,7 @@ This means a number of models are going to perform terribly if you do not make s
97
 
98
  <Note title="Critical: Chat templates and tokenization" emoji="⚡" variant="danger">
99
 
100
- ![Spacing, tokenization and template](https://pbs.twimg.com/media/GPANfpiasAA9b6F?format=png&name=medium)
101
 
102
  Different tokenizers behave differently with spacing and special tokens. See this [visualization](https://x.com/danielhanchen/status/1796952220619157694) showing how spacing, tokenization, and templates interact. Never assume tokenizers behave identically!
103
  </Note>
@@ -106,24 +114,28 @@ Different tokenizers behave differently with spacing and special tokens. See thi
106
 
107
  When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
108
 
109
- However, some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices is not easy, as the context tokens can "bleed out" into them, messing up the comparison.
 
110
 
111
- <Sidenote>
112
 
113
- The [Llama tokenizer](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257) doesn't satisfy `enc(context + choice) = enc(context) + enc(choice)`, making log probability comparisons tricky. Tokenize separately and concatenate, removing special tokens.
114
- </Sidenote>
 
 
 
 
115
 
116
- So if this is the case for your model, you might want to compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
117
 
118
  **Paying attention to start and end of sentence tokens**
119
 
120
- Some models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
121
 
122
  You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
123
 
124
  **Multilinguality and tokenization**
125
 
126
- When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc.
127
 
128
  **Code evaluations and end of sentence tokens**
129
 
 
5
  import llmTk1 from '../../assets/image/llm_tk_1.png';
6
  import llmLogprob from '../../assets/image/llm_logprob.png';
7
  import llmGen from '../../assets/image/llm_gen.png';
8
+ import chatTemplatesTokenisation from '../../assets/image/chat-templates-and-tokenisation.png';
9
  import Image from '../../../components/Image.astro';
10
  import Note from "../../../components/Note.astro";
11
  import Sidenote from "../../../components/Sidenote.astro";
 
72
 
73
  This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
74
 
75
+ <iframe
76
+ src="https://OpenEvals-tokenizers-languages.hf.space"
77
+ frameborder="0"
78
+ width="850"
79
+ height="450"
80
+ ></iframe>
81
+
82
  <Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
83
+ - ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and the embedded space comes from her work.
84
  - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
85
  </Note>
86
 
 
105
 
106
  <Note title="Critical: Chat templates and tokenization" emoji="⚡" variant="danger">
107
 
108
+ <Image src={chatTemplatesTokenisation} alt="Spacing, tokenization and template" />
109
 
110
  Different tokenizers behave differently with spacing and special tokens. See this [visualization](https://x.com/danielhanchen/status/1796952220619157694) showing how spacing, tokenization, and templates interact. Never assume tokenizers behave identically!
111
  </Note>
 
114
 
115
  When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
116
 
117
+ <Note title="Should you tokenize the context with the choices always?">
118
+ Some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices only is not trivial, as the context tokens can "bleed out" into them, messing up the comparison.
119
 
120
+ To give a concrete example, say you have characters `C1`, `C2`, and `C3` as base tokens of your vocabulary, and `C1C2` also happens to be a single token learned during BPE.
121
 
122
+ Say your context is C1, and the choices C2 and C3.
123
+ If you tokenize the context with the choices, you compare `C1C2` (one token) with `C1+C3` (two tokens). Even if you normalize the logprobs by length, you are not comparing the same thing.
124
+ Comparing after tokenizing the context and choices separately means you compare `C1+C2` and `C1+C3`. But since `C1C2` is a token, the occurence of `C1+C2` is likely rare in the data your encoder saw, so it is an unlikely succession for your model, which can mess up your logprobabilities.
125
+
126
+ If this is the case for your model, the solution is usually to go for the least worst option, comparing the comparable: compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
127
+ </Note>
128
 
 
129
 
130
  **Paying attention to start and end of sentence tokens**
131
 
132
+ Some pretrained models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
133
 
134
  You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
135
 
136
  **Multilinguality and tokenization**
137
 
138
+ When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc. The number of tokens that the model is allowed to generate for an evaluation should also be language dependent, as not all languages are tokenized in similar amount of tokens (go back to the tokenization section to see why).
139
 
140
  **Code evaluations and end of sentence tokens**
141
 
app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx CHANGED
@@ -9,14 +9,12 @@ import Sidenote from "../../../components/Sidenote.astro";
9
 
10
  ### My results are very bad
11
 
12
- The first thing to do is always to inspect your model generations in detail. Some frequent things to look for when troubleshooting are:
13
- - too strict model output parsing (before computing the metric) which leads to the answer being lost
14
- - Fixing: adapt your parsing
15
- - unability of the models to follow your output format in few shot (frequent in recent models trained with instructions data, like llama 3.2 or Qwen 2.5)
16
- - Fixing: either adapt your prompt format, or just assume that models should be able to follow it in few shot
17
- - exceedingly verbose model which never gets to the correct answer (more frequent in long context models and something we observed with Qwen and CommandR models)
18
- - Fixing: either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly
19
 
 
 
 
 
20
 
21
  ### My model is very slow!
22
  ➡️ Changing the batch size
@@ -48,21 +46,6 @@ And that's it!
48
 
49
  I would actually recommend using `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
50
 
51
- <Note title="Estimating GPU memory requirements" emoji="💾" variant="info">
52
-
53
- **Quick formula:**
54
- `Memory (GB) = Params (billions) × Precision factor × 1.1`
55
-
56
- **Precision factors:**
57
- - float32: 4
58
- - float16/bfloat16: 2
59
- - 8-bit: 1
60
- - 4-bit: 0.5
61
-
62
- The 1.1 multiplier accounts for batch loading overhead. Example: A 7B model in float16 needs ~15.4GB (7 × 2 × 1.1).
63
-
64
- </Note>
65
-
66
  ### My model does not fit on a GPU
67
  ➡️ Quantization
68
 
@@ -73,18 +56,11 @@ However, using too low a precision can give worse results, so for some models (e
73
 
74
  Model parallelism includes a range of techniques which cut your model in smaller sub-model pieces, to load and run each of these smaller pieces on a single different GPU. This requires less memory since you never load the full model at once, but can be slower.
75
 
76
- The 2 main types of model parallelism are
77
- - Pipeline parallelism, where the model is split at the whole layer level, and the layers are dispatched on different GPUs. Since layer 1's output is layer 2's input, this leads to a slower execution, as GPUs will be idle while waiting, which is called a "bubble" (and data must be transferred from one GPU to the next). The bubble can be reduced by splitting the inputs into smaller batches. It's being natively added to PyTorch with the `PiPPy` [lib](https://github.com/pytorch/PiPPy), and this is what `accelerate` uses under the hood for parallelism.
78
- - Tensor parallelism, where the model is split at the matrix computation level. This means that the matrices will be split on rows or columns, and the total result aggregated. This is incredibly efficient as long as all GPUs are on the same node (to avoid inter node network bottlenecks), but can be hard to code. You'll find cool implementations of this in the `vllm` lib. It provides **insane speedups**.
79
-
80
  <Note title="Model parallelism strategies" emoji="🔀" variant="info">
81
 
82
- **Two main approaches to split models across GPUs:**
83
-
84
- - **Pipeline parallelism**: Split by layers, dispatch to different GPUs. Simpler but creates "bubbles" (idle GPU time waiting for previous layer). Reduce bubbles by using smaller micro-batches. Used by PyTorch PiPPy and Accelerate.
85
-
86
- - **Tensor parallelism**: Split matrix operations across GPUs within each layer. Much faster (insane speedups!) but requires all GPUs on same node to avoid network bottlenecks. Check out `vllm` for implementations.
87
-
88
  </Note>
89
 
90
  The best document on the different kinds of parallelism (including data parallelism, for speedups) is [here](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism).
 
9
 
10
  ### My results are very bad
11
 
12
+ The first thing to do is always to inspect your model generations in detail.
 
 
 
 
 
 
13
 
14
+ Some frequent problems you should look for when troubleshooting are:
15
+ - Is your model output parsing too strict before computing the metric? It can lead to the answer being lost (obvious fix is to make it less strict, but you'll get more false positives!)
16
+ - Is your model struggling to follow your output format in few shot? This frequently happens in recent models trained on too specific evaluation formats, and you can either adapt your prompt format, or just state that models should be able to follow it and that the ones struggling are not good enough for the task you are considering.
17
+ - Is your model exceedingly verbose? In this case, it likely never gets to the correct answer - this is more frequent in long context models (we observed it with Qwen and Command R models in 2024) and reasoning models, especially if the tasks stops generation too soon. You can either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly.
18
 
19
  ### My model is very slow!
20
  ➡️ Changing the batch size
 
46
 
47
  I would actually recommend using `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ### My model does not fit on a GPU
50
  ➡️ Quantization
51
 
 
56
 
57
  Model parallelism includes a range of techniques which cut your model in smaller sub-model pieces, to load and run each of these smaller pieces on a single different GPU. This requires less memory since you never load the full model at once, but can be slower.
58
 
 
 
 
 
59
  <Note title="Model parallelism strategies" emoji="🔀" variant="info">
60
 
61
+ The 2 main types of model parallelism are
62
+ - **Pipeline parallelism**, where the model is split at the whole layer level, and the layers are dispatched on different GPUs. Since layer 1's output is layer 2's input, this leads to a slower execution, as GPUs will be idle while waiting, which is called a "bubble" (and data must be transferred from one GPU to the next). The bubble can be reduced by splitting the inputs into smaller batches. It's being natively added to PyTorch with the `PiPPy` [lib](https://github.com/pytorch/PiPPy), and this is what `accelerate` uses under the hood for parallelism.
63
+ - **Tensor parallelism**, where the model is split at the matrix computation level. This means that the matrices will be split on rows or columns, and the total result aggregated. This is incredibly efficient as long as all GPUs are on the same node (to avoid inter node network bottlenecks), but can be hard to code. You'll find cool implementations of this in the `vllm` lib. It provides **insane speedups**.
 
 
 
64
  </Note>
65
 
66
  The best document on the different kinds of parallelism (including data parallelism, for speedups) is [here](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism).
app/src/content/embeds/d3-evaluation-decision-tree.html ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-evaluation-tree"></div>
2
+ <style>
3
+ .d3-evaluation-tree {
4
+ position: relative;
5
+ width: 100%;
6
+ min-height: 500px;
7
+ overflow: visible;
8
+ }
9
+ .d3-evaluation-tree svg {
10
+ display: block;
11
+ width: 100%;
12
+ height: auto;
13
+ }
14
+ .d3-evaluation-tree .node-rect {
15
+ stroke-width: 2;
16
+ rx: 8;
17
+ ry: 8;
18
+ cursor: pointer;
19
+ transition: all 0.2s ease;
20
+ }
21
+ .d3-evaluation-tree .decision-node {
22
+ stroke: var(--border-color);
23
+ }
24
+ .d3-evaluation-tree .result-node {
25
+ stroke: var(--border-color);
26
+ }
27
+ .d3-evaluation-tree .warning-node {
28
+ stroke: var(--border-color);
29
+ }
30
+ .d3-evaluation-tree .node-text {
31
+ fill: var(--text-color);
32
+ font-size: 12px;
33
+ font-weight: 500;
34
+ pointer-events: none;
35
+ user-select: none;
36
+ }
37
+ .d3-evaluation-tree .link {
38
+ fill: none;
39
+ stroke: var(--border-color);
40
+ stroke-width: 1.5;
41
+ opacity: 0.5;
42
+ }
43
+ .d3-evaluation-tree .link-label {
44
+ fill: var(--muted-color);
45
+ font-size: 10px;
46
+ font-weight: 500;
47
+ }
48
+ .d3-evaluation-tree .node-rect:hover {
49
+ filter: brightness(1.05);
50
+ stroke-width: 3;
51
+ }
52
+ .d3-evaluation-tree .d3-tooltip {
53
+ position: absolute;
54
+ top: 0;
55
+ left: 0;
56
+ transform: translate(-9999px, -9999px);
57
+ pointer-events: none;
58
+ padding: 8px 10px;
59
+ border-radius: 8px;
60
+ font-size: 12px;
61
+ line-height: 1.35;
62
+ border: 1px solid var(--border-color);
63
+ background: var(--surface-bg);
64
+ color: var(--text-color);
65
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
66
+ opacity: 0;
67
+ transition: opacity .12s ease;
68
+ max-width: 250px;
69
+ }
70
+ </style>
71
+ <script>
72
+ (() => {
73
+ const ensureD3 = (cb) => {
74
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
75
+ let s = document.getElementById('d3-cdn-script');
76
+ if (!s) {
77
+ s = document.createElement('script');
78
+ s.id = 'd3-cdn-script';
79
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
80
+ document.head.appendChild(s);
81
+ }
82
+ const onReady = () => {
83
+ if (window.d3 && typeof window.d3.select === 'function') cb();
84
+ };
85
+ s.addEventListener('load', onReady, { once: true });
86
+ if (window.d3) onReady();
87
+ };
88
+
89
+ const bootstrap = () => {
90
+ const scriptEl = document.currentScript;
91
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
92
+ if (!(container && container.classList && container.classList.contains('d3-evaluation-tree'))) {
93
+ const candidates = Array.from(document.querySelectorAll('.d3-evaluation-tree'))
94
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
95
+ container = candidates[candidates.length - 1] || null;
96
+ }
97
+ if (!container) return;
98
+ if (container.dataset) {
99
+ if (container.dataset.mounted === 'true') return;
100
+ container.dataset.mounted = 'true';
101
+ }
102
+
103
+ // Tooltip setup
104
+ container.style.position = container.style.position || 'relative';
105
+ let tip = container.querySelector('.d3-tooltip');
106
+ let tipInner;
107
+ if (!tip) {
108
+ tip = document.createElement('div');
109
+ tip.className = 'd3-tooltip';
110
+ tipInner = document.createElement('div');
111
+ tipInner.className = 'd3-tooltip__inner';
112
+ tipInner.style.textAlign = 'left';
113
+ tip.appendChild(tipInner);
114
+ container.appendChild(tip);
115
+ } else {
116
+ tipInner = tip.querySelector('.d3-tooltip__inner') || tip;
117
+ }
118
+
119
+ // Get colors from ColorPalettes with fallback
120
+ const getColors = () => {
121
+ if (window.ColorPalettes && window.ColorPalettes.getColors) {
122
+ return {
123
+ decision: window.ColorPalettes.getColors('sequential', 3)[0],
124
+ result: window.ColorPalettes.getColors('sequential', 3)[2],
125
+ warning: window.ColorPalettes.getColors('diverging', 3)[1]
126
+ };
127
+ }
128
+ // Fallback colors
129
+ return {
130
+ decision: '#60A5FA',
131
+ result: '#34D399',
132
+ warning: '#FBBF24'
133
+ };
134
+ };
135
+
136
+ // Define the decision tree structure
137
+ const treeData = {
138
+ name: "What are you\nevaluating?",
139
+ type: "decision",
140
+ tooltip: "Starting point: Identify your evaluation task",
141
+ children: [
142
+ {
143
+ name: "Have gold\nstandard?",
144
+ edgeLabel: "Start",
145
+ type: "decision",
146
+ tooltip: "Do you have a clear, correct reference answer?",
147
+ children: [
148
+ {
149
+ name: "Objective &\nverifiable?",
150
+ edgeLabel: "Yes",
151
+ type: "decision",
152
+ tooltip: "Is the answer factual and unambiguous?",
153
+ children: [
154
+ {
155
+ name: "Format\nconstrained?",
156
+ edgeLabel: "Yes",
157
+ type: "decision",
158
+ tooltip: "Can you verify output structure programmatically?",
159
+ children: [
160
+ {
161
+ name: "Functional\nTesting",
162
+ edgeLabel: "Yes",
163
+ type: "result",
164
+ tooltip: "Use IFEval-style functional tests or unit tests"
165
+ },
166
+ {
167
+ name: "Automated\nMetrics",
168
+ edgeLabel: "No",
169
+ type: "result",
170
+ tooltip: "Use exact match, F1, BLEU, etc."
171
+ }
172
+ ]
173
+ }
174
+ ]
175
+ },
176
+ {
177
+ name: "Human Eval\nor Judges",
178
+ edgeLabel: "Subjective",
179
+ type: "warning",
180
+ tooltip: "Multiple valid answers exist; need human judgment or model judges"
181
+ }
182
+ ]
183
+ },
184
+ {
185
+ name: "Budget &\nscale?",
186
+ edgeLabel: "No gold",
187
+ type: "decision",
188
+ tooltip: "No reference answer available",
189
+ children: [
190
+ {
191
+ name: "Expert Human\nAnnotators",
192
+ edgeLabel: "High",
193
+ type: "result",
194
+ tooltip: "Best for critical use cases (medical, legal)"
195
+ },
196
+ {
197
+ name: "Model Judges\n(validate!)",
198
+ edgeLabel: "Medium",
199
+ type: "warning",
200
+ tooltip: "Validate judge quality against human baseline"
201
+ },
202
+ {
203
+ name: "Arena or\nVibe-checks",
204
+ edgeLabel: "Low",
205
+ type: "warning",
206
+ tooltip: "Crowdsourced or exploratory evaluation"
207
+ }
208
+ ]
209
+ }
210
+ ]
211
+ };
212
+
213
+ // SVG setup
214
+ const svg = d3.select(container).append('svg');
215
+ const g = svg.append('g').attr('transform', 'translate(40, 30)');
216
+
217
+ let width = container.clientWidth || 900;
218
+ const nodeWidth = 140;
219
+ const nodeHeight = 50;
220
+
221
+ function render() {
222
+ const colors = getColors();
223
+ width = container.clientWidth || 900;
224
+
225
+ const treeLayout = d3.tree()
226
+ .size([width - 80, 500])
227
+ .separation((a, b) => (a.parent === b.parent ? 1.3 : 1.6));
228
+
229
+ const root = d3.hierarchy(treeData);
230
+ const treeNodes = treeLayout(root);
231
+
232
+ const maxDepth = root.height;
233
+ const height = (maxDepth + 1) * 120 + 60;
234
+
235
+ svg.attr('viewBox', `0 0 ${width} ${height}`)
236
+ .attr('preserveAspectRatio', 'xMidYMin meet');
237
+
238
+ // Clear previous
239
+ g.selectAll('*').remove();
240
+
241
+ // Links
242
+ g.selectAll('.link')
243
+ .data(treeNodes.links())
244
+ .join('path')
245
+ .attr('class', 'link')
246
+ .attr('d', d3.linkVertical()
247
+ .x(d => d.x)
248
+ .y(d => d.y)
249
+ );
250
+
251
+ // Link labels
252
+ g.selectAll('.link-label')
253
+ .data(treeNodes.links().filter(d => d.target.data.edgeLabel))
254
+ .join('text')
255
+ .attr('class', 'link-label')
256
+ .attr('x', d => d.target.x)
257
+ .attr('y', d => (d.source.y + d.target.y) / 2 - 5)
258
+ .attr('text-anchor', 'middle')
259
+ .text(d => d.target.data.edgeLabel);
260
+
261
+ // Node groups
262
+ const nodes = g.selectAll('.node')
263
+ .data(treeNodes.descendants())
264
+ .join('g')
265
+ .attr('class', 'node')
266
+ .attr('transform', d => `translate(${d.x},${d.y})`)
267
+ .on('mouseenter', function(event, d) {
268
+ if (d.data.tooltip) {
269
+ const [mx, my] = d3.pointer(event, container);
270
+ tip.style.opacity = '1';
271
+ tip.style.transform = `translate(${mx + 10}px, ${my - 10}px)`;
272
+ tipInner.textContent = d.data.tooltip;
273
+ }
274
+ })
275
+ .on('mouseleave', function() {
276
+ tip.style.opacity = '0';
277
+ tip.style.transform = 'translate(-9999px, -9999px)';
278
+ });
279
+
280
+ // Rectangles
281
+ nodes.append('rect')
282
+ .attr('class', d => {
283
+ if (d.data.type === 'result') return 'node-rect result-node';
284
+ if (d.data.type === 'warning') return 'node-rect warning-node';
285
+ return 'node-rect decision-node';
286
+ })
287
+ .attr('x', -nodeWidth / 2)
288
+ .attr('y', -nodeHeight / 2)
289
+ .attr('width', nodeWidth)
290
+ .attr('height', nodeHeight)
291
+ .attr('fill', d => {
292
+ if (d.data.type === 'result') return colors.result;
293
+ if (d.data.type === 'warning') return colors.warning;
294
+ return colors.decision;
295
+ });
296
+
297
+ // Text (multiline support)
298
+ nodes.each(function(d) {
299
+ const nodeG = d3.select(this);
300
+ const lines = d.data.name.split('\n');
301
+ const lineHeight = 14;
302
+ const startY = -(lines.length - 1) * lineHeight / 2;
303
+
304
+ lines.forEach((line, i) => {
305
+ nodeG.append('text')
306
+ .attr('class', 'node-text')
307
+ .attr('text-anchor', 'middle')
308
+ .attr('y', startY + i * lineHeight)
309
+ .attr('dy', '0.35em')
310
+ .text(line);
311
+ });
312
+ });
313
+ }
314
+
315
+ // Initial render
316
+ render();
317
+
318
+ // Responsive resize
319
+ if (window.ResizeObserver) {
320
+ const ro = new ResizeObserver(() => render());
321
+ ro.observe(container);
322
+ } else {
323
+ window.addEventListener('resize', render);
324
+ }
325
+ };
326
+
327
+ if (document.readyState === 'loading') {
328
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
329
+ } else {
330
+ ensureD3(bootstrap);
331
+ }
332
+ })();
333
+ </script>