evaluation-guidebook

Running

App Files Files Community

tfrere HF Staff commited on 9 days ago

Commit

1f59ee1

2 Parent(s): 3050a37 dc8d285

merge

Browse files

Files changed (7) hide show

README.md +2 -2
app/src/content/article.mdx +0 -5
app/src/content/assets/image/chat-templates-and-tokenisation.png +3 -0
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +35 -44
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx +21 -9
app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx +8 -32
app/src/content/embeds/d3-evaluation-decision-tree.html +333 -0

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: 'Evaluation guidebook'
-short_desc: 'Understanding the tips and tricks of evaluating an LLM in 2025'
 emoji: 📝
 colorFrom: blue
 colorTo: indigo

 ---
+title: 'Evaluation Guidebook'
+short_desc: 'How to properly evaluate LLMs in the modern age'
 emoji: 📝
 colorFrom: blue
 colorTo: indigo

app/src/content/article.mdx CHANGED Viewed

@@ -28,9 +28,6 @@ import TroubleshootingInference from "./chapters/troubleshooting/troubleshooting
 import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
 import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
-- https://arxiv.org/abs/2109.02550
-- https://arxiv.org/abs/2511.21140
 <Intro />
 ## LLM basics to understand evaluation
@@ -94,8 +91,6 @@ Best (but rarest) metrics are functional or based on rule based verifiers (thoug
 ## Creating your own evaluation
 <DesigningAutomaticEvaluation />

 import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
 import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
 <Intro />
 ## LLM basics to understand evaluation
 ## Creating your own evaluation
 <DesigningAutomaticEvaluation />

app/src/content/assets/image/chat-templates-and-tokenisation.png ADDED Viewed

Git LFS Details

SHA256: 0a3e4762ba6d5ecb79519b2533eb739af571c284a1cae8332a28e81906fe018c
Pointer size: 131 Bytes
Size of remote file: 209 kB

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -20,8 +20,6 @@ When aggregating datasets, pay attention to whether
 <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
-#### Creating a dataset manually
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically
@@ -45,33 +43,6 @@ Once this is done, you can do an automatic validation by using a model from a di
 No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
 </Note>
-#### Choosing a prompt
-The prompt is going to define:
-- how much information is given to your model about the task
-- how this information is presented to your model.
-A prompt for a general MCQA or QA is usually made of some of the following:
-- a task prompt (optional): introduces your task.
-- a context: provides additional context for your question.
-	- *Eg: For a summarization or information extraction task, you could provide a content source*
-- a question: the actual core of your prompt.
-- in case of a multi choice evaluation, you can add options
-- connector words (`Question`, `Context`, `Choice`, ...)
-When defining your prompt, you need to be aware that:
-- even small changes in semantically equivalent prompts can make the results vary by quite a lot (see Section `Different prompt` in [Troubleshooting reproducibility](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-reproducibility.md)), and prompt formats might advantage or disadvantage specific models
-	- How to mitigate this:
-		- A costly way is to re-run the evaluation several times with prompt variations
-		- A less costly way is to run your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty
-- you can provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall
-- for a number of metrics, you want a very constrained generation or output.
-<Note title="Models can overfit prompt formats" emoji="⚠️" variant="warning">
-Recent research shows models can overfit specific prompt formats rather than learning the underlying task. [This paper](https://arxiv.org/abs/2407.07890) is great on the topic, showing notably how some models can be over-evaluated because they have overfitted the test set **format**.
-On the Open LLM Leaderboard 2, we've notably observed that Llama 3.2 and Qwen 2.5 are no longer following the format of the prompt provided in a few-shot setup for this reason.
-</Note>
 #### Managing contamination
 In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
@@ -83,6 +54,21 @@ Solutions to mitigate this include:
 However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
 ### Choosing an inference method for your model
 You'll need to choose what kind of inference method you need.
@@ -122,8 +108,7 @@ However, nowadays most evaluations are generative: using generations (QA, questi
 If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
-If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something. Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
 ## The hardest part of evaluation: Scoring free form text
@@ -137,15 +122,16 @@ The easiest but least flexible match based metrics are **exact matches** of toke
 The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap.
 Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
-Once you have an accuracy score per sample, you can aggregate it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
 If your score is **binary**, look at the **precision** (critical when false positives are costly), **recall** (critical when missing positives is costly), **F1 score** (balances precision and recall, good for imbalanced data), or **MCC** (Matthews Correlation Coefficient, which works well with imbalanced datasets by considering all confusion matrix elements).
-If your score is **continuous**, you can use **mean squared error** (penalizes large errors but heavily weights outliers), **mean absolute error** (more balanced than MSE), or if you assume your data should follow a specific linear regression model, you can look at measures like the **R²** or correlation coefficients like **Pearson** (for linear relationships, assumes normality) or **Spearman** (for monotonic relationships without normality assumptions).
 More generally, when picking your metric and its aggregation, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
-<Sidenote>
-To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/). You'll also find a complete list of metrics and their uses in [this organisation](https://huggingface.co/evaluate-metric).
-</Sidenote>
 <Note title="Pros and cons of using automated metrics">
@@ -183,13 +169,14 @@ Normalizations can easily [be unfair if not designed well](https://huggingface.c
 They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
-#### Adding sampling
 When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
 This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
 Common sampling-based metrics are:
-- **pass@k over n**: Given n generated samples, measures whether at least k passes the test. <Sidenote> You'll find two functions for this metric: computed as: $\text{pass}@k = \mathbb{E}[\text{at least 1 correct among k samples}]$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
 - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
 - **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
 - **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
@@ -204,7 +191,7 @@ When you use sampling evaluations, make sure to always report all sampling param
 However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
 </Note>
-#### Using functional testing
 Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
 **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
@@ -223,7 +210,6 @@ This functional approach works particularly well for instruction following, but
 Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
 </Sidenote>
 ### With humans
 Human evaluation is simply asking humans to score predictions.
@@ -347,7 +333,7 @@ Provide some additional "reasoning" evaluation steps:
 - *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
 Specify the desired output format (adding fields will help consistency)
-- *Your answer should be provided in JSON, with the following format \{"Score": Your score, "Reasoning": The reasoning which led you to this score\}*
 </Note>
 You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
@@ -365,9 +351,7 @@ You can also improve accuracy using the following, possibly more costly, techniq
 - **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy
 - **CoT**: [improves accuracy for older gen models](https://arxiv.org/abs/2212.08073), if you ask the model to output its chain of thought **before** the score (also observed [here](https://x.com/seungonekim/status/1749289437165769177))
 - **Multiturn analysis**: can improve [factual error detection](https://arxiv.org/abs/2305.13281)
-- Using **a jury** (many judges, where you pick an aggregate of the answers): [gives better results](https://arxiv.org/abs/2404.18796) than using a single model.
-	- It can be made considerably less costly by leveraging many smaller models instead of one big expensive model.
-	- You can also experiment with using one model with variations on temperature
 - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
 If you are working on critical tasks (medical domain for example), make sure to use methodologies transferred from the humanities, and 1) compute inter-annotator agreement metrics to make sure your evaluators are as unbiased as possible, 2) Use proper survey design methodology when creating your scoring grid to mitigate bias. However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
@@ -511,3 +495,10 @@ On the other hand they:
 - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
 - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
 </Note>

 <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically
 No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
 </Note>
 #### Managing contamination
 In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
 However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
+### Choosing a prompt
+The prompt is going to define how much information is given to your model about the task, and how this information is presented to the model.
+A prompt for a general MCQA or QA is usually made of some of the following:
+- a task prompt (optional): introduces your task.
+- a context: provides additional context for your question.
+	- *Eg: For a summarization or information extraction task, you could provide a content source*
+- a question: the actual core of your prompt.
+- in case of a multi choice evaluation, you can add options
+- connector words (`Question`, `Context`, `Choice`, ...)
+When defining your prompt, you need to be aware that even small changes in semantically equivalent prompts can make the results vary by quite a lot, and prompt formats might advantage or disadvantage specific models (See [this section](https://huggingface.co/spaces/OpenEvals/evaluation-guidebook#different-prompt)).
+➡️ This can be mitigated by re-running the evaluation several times with prompt variations (but it can be costly), or simply running your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty.
+➡️ You can also provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall.
 ### Choosing an inference method for your model
 You'll need to choose what kind of inference method you need.
 If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
+If you're looking at generative evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
 ## The hardest part of evaluation: Scoring free form text
 The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap.
 Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
+Once you have an accuracy score per sample, you can **aggregate** it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
 If your score is **binary**, look at the **precision** (critical when false positives are costly), **recall** (critical when missing positives is costly), **F1 score** (balances precision and recall, good for imbalanced data), or **MCC** (Matthews Correlation Coefficient, which works well with imbalanced datasets by considering all confusion matrix elements).
+If your score is **continuous** (less likely though), you can use **mean squared error** (penalizes large errors but heavily weights outliers) or **mean absolute error** (more balanced than MSE). <Sidenote> If you assume your data should follow a specific linear regression model (for example if you are studying model calibration), you can look at measures like the **R²** or correlation coefficients like **Pearson** (for linear relationships, assumes normality) or **Spearman** (for monotonic relationships without normality assumptions). However, it's a bit out of scope here. </Sidenote>
 More generally, when picking your metric and its aggregation, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
+<Note title="To go further">
+- This [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/) covers some of the challenges of evaluating LLMs.
+- If you're looking for metrics, you'll also find a good list with description, score ranges and use cases in [this organisation](https://huggingface.co/evaluate-metric).
+</Note>
 <Note title="Pros and cons of using automated metrics">
 They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
+#### Sampling
 When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
 This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
 Common sampling-based metrics are:
+- **pass@k over n**: Given n generated samples, measures whether at least k passes the test.
+<Sidenote> You'll find two functions for this metric: computed trivially as: $\text{pass}@k = (c >= k)$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
 - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
 - **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
 - **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
 However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
 </Note>
+#### Functional scorers
 Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
 **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
 Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
 </Sidenote>
 ### With humans
 Human evaluation is simply asking humans to score predictions.
 - *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
 Specify the desired output format (adding fields will help consistency)
+- *Your answer should be provided in JSON, with the following format \{"Score": Your score, "Reasoning": The reasoning which led you to this score\}*
 </Note>
 You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
 - **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy
 - **CoT**: [improves accuracy for older gen models](https://arxiv.org/abs/2212.08073), if you ask the model to output its chain of thought **before** the score (also observed [here](https://x.com/seungonekim/status/1749289437165769177))
 - **Multiturn analysis**: can improve [factual error detection](https://arxiv.org/abs/2305.13281)
+- Using **a jury** (many judges, where you pick an aggregate of the answers): [gives better results](https://arxiv.org/abs/2404.18796) than using a single model. It can be made considerably less costly by leveraging many smaller models instead of one big expensive model. You can also experiment with using one model with variations on temperature
 - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
 If you are working on critical tasks (medical domain for example), make sure to use methodologies transferred from the humanities, and 1) compute inter-annotator agreement metrics to make sure your evaluators are as unbiased as possible, 2) Use proper survey design methodology when creating your scoring grid to mitigate bias. However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
 - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
 - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
 </Note>
+### Calibration and confidence
+When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
+These confidence intervals can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.

app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED Viewed

@@ -5,6 +5,7 @@ title: "Model inference and evaluation"
 import llmTk1 from '../../assets/image/llm_tk_1.png';
 import llmLogprob from '../../assets/image/llm_logprob.png';
 import llmGen from '../../assets/image/llm_gen.png';
 import Image from '../../../components/Image.astro';
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
@@ -71,8 +72,15 @@ However, if you want to allow your tokenizer to correctly split text in other la
 This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
 <Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
-- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and it's worth playing around with the [demo space](https://huggingface.co/spaces/yenniejun/tokenizers-languages)
 - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
 </Note>
@@ -97,7 +105,7 @@ This means a number of models are going to perform terribly if you do not make s
 <Note title="Critical: Chat templates and tokenization" emoji="⚡" variant="danger">
-![Spacing, tokenization and template](https://pbs.twimg.com/media/GPANfpiasAA9b6F?format=png&name=medium)
 Different tokenizers behave differently with spacing and special tokens. See this [visualization](https://x.com/danielhanchen/status/1796952220619157694) showing how spacing, tokenization, and templates interact. Never assume tokenizers behave identically!
 </Note>
@@ -106,24 +114,28 @@ Different tokenizers behave differently with spacing and special tokens. See thi
 When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
-However, some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices is not easy, as the context tokens can "bleed out" into them, messing up the comparison.
-<Sidenote>
-The [Llama tokenizer](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257) doesn't satisfy `enc(context + choice) = enc(context) + enc(choice)`, making log probability comparisons tricky. Tokenize separately and concatenate, removing special tokens.
-</Sidenote>
-So if this is the case for your model, you might want to compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
 **Paying attention to start and end of sentence tokens**
-Some models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
 You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
 **Multilinguality and tokenization**
-When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc.
 **Code evaluations and end of sentence tokens**

 import llmTk1 from '../../assets/image/llm_tk_1.png';
 import llmLogprob from '../../assets/image/llm_logprob.png';
 import llmGen from '../../assets/image/llm_gen.png';
+import chatTemplatesTokenisation from '../../assets/image/chat-templates-and-tokenisation.png';
 import Image from '../../../components/Image.astro';
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
+<iframe
+    src="https://OpenEvals-tokenizers-languages.hf.space"
+    frameborder="0"
+    width="850"
+    height="450"
+></iframe>
 <Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
+- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and the embedded space comes from her work.
 - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
 </Note>
 <Note title="Critical: Chat templates and tokenization" emoji="⚡" variant="danger">
+<Image src={chatTemplatesTokenisation} alt="Spacing, tokenization and template" />
 Different tokenizers behave differently with spacing and special tokens. See this [visualization](https://x.com/danielhanchen/status/1796952220619157694) showing how spacing, tokenization, and templates interact. Never assume tokenizers behave identically!
 </Note>
 When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
+<Note title="Should you tokenize the context with the choices always?">
+Some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices only is not trivial, as the context tokens can "bleed out" into them, messing up the comparison.
+To give a concrete example, say you have characters `C1`, `C2`, and `C3` as base tokens of your vocabulary, and `C1C2` also happens to be a single token learned during BPE.
+Say your context is C1, and the choices C2 and C3.
+If you tokenize the context with the choices, you compare `C1C2` (one token) with `C1+C3` (two tokens). Even if you normalize the logprobs by length, you are not comparing the same thing.
+Comparing after tokenizing the context and choices separately means you compare `C1+C2` and `C1+C3`. But since `C1C2` is a token, the occurence of `C1+C2` is likely rare in the data your encoder saw, so it is an unlikely succession for your model, which can mess up your logprobabilities.
+If this is the case for your model, the solution is usually to go for the least worst option, comparing the comparable: compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
+</Note>
 **Paying attention to start and end of sentence tokens**
+Some pretrained models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
 You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
 **Multilinguality and tokenization**
+When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc. The number of tokens that the model is allowed to generate for an evaluation should also be language dependent, as not all languages are tokenized in similar amount of tokens (go back to the tokenization section to see why).
 **Code evaluations and end of sentence tokens**

app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx CHANGED Viewed

@@ -9,14 +9,12 @@ import Sidenote from "../../../components/Sidenote.astro";
 ### My results are very bad
-The first thing to do is always to inspect your model generations in detail. Some frequent things to look for when troubleshooting are:
-- too strict model output parsing (before computing the metric) which leads to the answer being lost
-    - Fixing: adapt your parsing
-- unability of the models to follow your output format in few shot (frequent in recent models trained with instructions data, like llama 3.2 or Qwen 2.5)
-    - Fixing: either adapt your prompt format, or just assume that models should be able to follow it in few shot
-- exceedingly verbose model which never gets to the correct answer (more frequent in long context models and something we observed with Qwen and CommandR models)
-    - Fixing: either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly
 ### My model is very slow!
 ➡️ Changing the batch size
@@ -48,21 +46,6 @@ And that's it!
 I would actually recommend using  `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
-<Note title="Estimating GPU memory requirements" emoji="💾" variant="info">
-**Quick formula:**
-`Memory (GB) = Params (billions) × Precision factor × 1.1`
-**Precision factors:**
-- float32: 4
-- float16/bfloat16: 2
-- 8-bit: 1
-- 4-bit: 0.5
-The 1.1 multiplier accounts for batch loading overhead. Example: A 7B model in float16 needs ~15.4GB (7 × 2 × 1.1).
-</Note>
 ### My model does not fit on a GPU
 ➡️ Quantization
@@ -73,18 +56,11 @@ However, using too low a precision can give worse results, so for some models (e
 Model parallelism includes a range of techniques which cut your model in smaller sub-model pieces, to load and run each of these smaller pieces on a single different GPU. This requires less memory since you never load the full model at once, but can be slower.
-The 2 main types of model parallelism are
-- Pipeline parallelism, where the model is split at the whole layer level, and the layers are dispatched on different GPUs. Since layer 1's output is layer 2's input, this leads to a slower execution, as GPUs will be idle while waiting, which is called a "bubble" (and data must be transferred from one GPU to the next). The bubble can be reduced by splitting the inputs into smaller batches. It's being natively added to PyTorch with the `PiPPy` [lib](https://github.com/pytorch/PiPPy), and this is what `accelerate` uses under the hood for parallelism.
-- Tensor parallelism, where the model is split at the matrix computation level. This means that the matrices will be split on rows or columns, and the total result aggregated. This is incredibly efficient as long as all GPUs are on the same node (to avoid inter node network bottlenecks), but can be hard to code. You'll find cool implementations of this in the `vllm` lib. It provides **insane speedups**.
 <Note title="Model parallelism strategies" emoji="🔀" variant="info">
-**Two main approaches to split models across GPUs:**
-- **Pipeline parallelism**: Split by layers, dispatch to different GPUs. Simpler but creates "bubbles" (idle GPU time waiting for previous layer). Reduce bubbles by using smaller micro-batches. Used by PyTorch PiPPy and Accelerate.
-- **Tensor parallelism**: Split matrix operations across GPUs within each layer. Much faster (insane speedups!) but requires all GPUs on same node to avoid network bottlenecks. Check out `vllm` for implementations.
 </Note>
 The best document on the different kinds of parallelism (including data parallelism, for speedups) is [here](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism).

 ### My results are very bad
+The first thing to do is always to inspect your model generations in detail.
+Some frequent problems you should look for when troubleshooting are:
+- Is your model output parsing too strict before computing the metric? It can lead to the answer being lost (obvious fix is to make it less strict, but you'll get more false positives!)
+- Is your model struggling to follow your output format in few shot? This frequently happens in recent models trained on too specific evaluation formats, and you can either adapt your prompt format, or just state that models should be able to follow it and that the ones struggling are not good enough for the task you are considering.
+- Is your model exceedingly verbose? In this case, it likely never gets to the correct answer - this is more frequent in long context models (we observed it with Qwen and Command R models in 2024) and reasoning models, especially if the tasks stops generation too soon. You can either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly.
 ### My model is very slow!
 ➡️ Changing the batch size
 I would actually recommend using  `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
 ### My model does not fit on a GPU
 ➡️ Quantization
 Model parallelism includes a range of techniques which cut your model in smaller sub-model pieces, to load and run each of these smaller pieces on a single different GPU. This requires less memory since you never load the full model at once, but can be slower.
 <Note title="Model parallelism strategies" emoji="🔀" variant="info">
+The 2 main types of model parallelism are
+- **Pipeline parallelism**, where the model is split at the whole layer level, and the layers are dispatched on different GPUs. Since layer 1's output is layer 2's input, this leads to a slower execution, as GPUs will be idle while waiting, which is called a "bubble" (and data must be transferred from one GPU to the next). The bubble can be reduced by splitting the inputs into smaller batches. It's being natively added to PyTorch with the `PiPPy` [lib](https://github.com/pytorch/PiPPy), and this is what `accelerate` uses under the hood for parallelism.
+- **Tensor parallelism**, where the model is split at the matrix computation level. This means that the matrices will be split on rows or columns, and the total result aggregated. This is incredibly efficient as long as all GPUs are on the same node (to avoid inter node network bottlenecks), but can be hard to code. You'll find cool implementations of this in the `vllm` lib. It provides **insane speedups**.
 </Note>
 The best document on the different kinds of parallelism (including data parallelism, for speedups) is [here](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism).

app/src/content/embeds/d3-evaluation-decision-tree.html ADDED Viewed

	@@ -0,0 +1,333 @@

+<div class="d3-evaluation-tree"></div>
+<style>
+  .d3-evaluation-tree {
+    position: relative;
+    width: 100%;
+    min-height: 500px;
+    overflow: visible;
+  }
+  .d3-evaluation-tree svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-evaluation-tree .node-rect {
+    stroke-width: 2;
+    rx: 8;
+    ry: 8;
+    cursor: pointer;
+    transition: all 0.2s ease;
+  }
+  .d3-evaluation-tree .decision-node {
+    stroke: var(--border-color);
+  }
+  .d3-evaluation-tree .result-node {
+    stroke: var(--border-color);
+  }
+  .d3-evaluation-tree .warning-node {
+    stroke: var(--border-color);
+  }
+  .d3-evaluation-tree .node-text {
+    fill: var(--text-color);
+    font-size: 12px;
+    font-weight: 500;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-evaluation-tree .link {
+    fill: none;
+    stroke: var(--border-color);
+    stroke-width: 1.5;
+    opacity: 0.5;
+  }
+  .d3-evaluation-tree .link-label {
+    fill: var(--muted-color);
+    font-size: 10px;
+    font-weight: 500;
+  }
+  .d3-evaluation-tree .node-rect:hover {
+    filter: brightness(1.05);
+    stroke-width: 3;
+  }
+  .d3-evaluation-tree .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 8px 10px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.35;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity .12s ease;
+    max-width: 250px;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => {
+        if (window.d3 && typeof window.d3.select === 'function') cb();
+      };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-evaluation-tree'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-evaluation-tree'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      let tip = container.querySelector('.d3-tooltip');
+      let tipInner;
+      if (!tip) {
+        tip = document.createElement('div');
+        tip.className = 'd3-tooltip';
+        tipInner = document.createElement('div');
+        tipInner.className = 'd3-tooltip__inner';
+        tipInner.style.textAlign = 'left';
+        tip.appendChild(tipInner);
+        container.appendChild(tip);
+      } else {
+        tipInner = tip.querySelector('.d3-tooltip__inner') || tip;
+      }
+      // Get colors from ColorPalettes with fallback
+      const getColors = () => {
+        if (window.ColorPalettes && window.ColorPalettes.getColors) {
+          return {
+            decision: window.ColorPalettes.getColors('sequential', 3)[0],
+            result: window.ColorPalettes.getColors('sequential', 3)[2],
+            warning: window.ColorPalettes.getColors('diverging', 3)[1]
+          };
+        }
+        // Fallback colors
+        return {
+          decision: '#60A5FA',
+          result: '#34D399',
+          warning: '#FBBF24'
+        };
+      };
+      // Define the decision tree structure
+      const treeData = {
+        name: "What are you\nevaluating?",
+        type: "decision",
+        tooltip: "Starting point: Identify your evaluation task",
+        children: [
+          {
+            name: "Have gold\nstandard?",
+            edgeLabel: "Start",
+            type: "decision",
+            tooltip: "Do you have a clear, correct reference answer?",
+            children: [
+              {
+                name: "Objective &\nverifiable?",
+                edgeLabel: "Yes",
+                type: "decision",
+                tooltip: "Is the answer factual and unambiguous?",
+                children: [
+                  {
+                    name: "Format\nconstrained?",
+                    edgeLabel: "Yes",
+                    type: "decision",
+                    tooltip: "Can you verify output structure programmatically?",
+                    children: [
+                      {
+                        name: "Functional\nTesting",
+                        edgeLabel: "Yes",
+                        type: "result",
+                        tooltip: "Use IFEval-style functional tests or unit tests"
+                      },
+                      {
+                        name: "Automated\nMetrics",
+                        edgeLabel: "No",
+                        type: "result",
+                        tooltip: "Use exact match, F1, BLEU, etc."
+                      }
+                    ]
+                  }
+                ]
+              },
+              {
+                name: "Human Eval\nor Judges",
+                edgeLabel: "Subjective",
+                type: "warning",
+                tooltip: "Multiple valid answers exist; need human judgment or model judges"
+              }
+            ]
+          },
+          {
+            name: "Budget &\nscale?",
+            edgeLabel: "No gold",
+            type: "decision",
+            tooltip: "No reference answer available",
+            children: [
+              {
+                name: "Expert Human\nAnnotators",
+                edgeLabel: "High",
+                type: "result",
+                tooltip: "Best for critical use cases (medical, legal)"
+              },
+              {
+                name: "Model Judges\n(validate!)",
+                edgeLabel: "Medium",
+                type: "warning",
+                tooltip: "Validate judge quality against human baseline"
+              },
+              {
+                name: "Arena or\nVibe-checks",
+                edgeLabel: "Low",
+                type: "warning",
+                tooltip: "Crowdsourced or exploratory evaluation"
+              }
+            ]
+          }
+        ]
+      };
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const g = svg.append('g').attr('transform', 'translate(40, 30)');
+      let width = container.clientWidth || 900;
+      const nodeWidth = 140;
+      const nodeHeight = 50;
+      function render() {
+        const colors = getColors();
+        width = container.clientWidth || 900;
+        const treeLayout = d3.tree()
+          .size([width - 80, 500])
+          .separation((a, b) => (a.parent === b.parent ? 1.3 : 1.6));
+        const root = d3.hierarchy(treeData);
+        const treeNodes = treeLayout(root);
+        const maxDepth = root.height;
+        const height = (maxDepth + 1) * 120 + 60;
+        svg.attr('viewBox', `0 0 ${width} ${height}`)
+           .attr('preserveAspectRatio', 'xMidYMin meet');
+        // Clear previous
+        g.selectAll('*').remove();
+        // Links
+        g.selectAll('.link')
+          .data(treeNodes.links())
+          .join('path')
+          .attr('class', 'link')
+          .attr('d', d3.linkVertical()
+            .x(d => d.x)
+            .y(d => d.y)
+          );
+        // Link labels
+        g.selectAll('.link-label')
+          .data(treeNodes.links().filter(d => d.target.data.edgeLabel))
+          .join('text')
+          .attr('class', 'link-label')
+          .attr('x', d => d.target.x)
+          .attr('y', d => (d.source.y + d.target.y) / 2 - 5)
+          .attr('text-anchor', 'middle')
+          .text(d => d.target.data.edgeLabel);
+        // Node groups
+        const nodes = g.selectAll('.node')
+          .data(treeNodes.descendants())
+          .join('g')
+          .attr('class', 'node')
+          .attr('transform', d => `translate(${d.x},${d.y})`)
+          .on('mouseenter', function(event, d) {
+            if (d.data.tooltip) {
+              const [mx, my] = d3.pointer(event, container);
+              tip.style.opacity = '1';
+              tip.style.transform = `translate(${mx + 10}px, ${my - 10}px)`;
+              tipInner.textContent = d.data.tooltip;
+            }
+          })
+          .on('mouseleave', function() {
+            tip.style.opacity = '0';
+            tip.style.transform = 'translate(-9999px, -9999px)';
+          });
+        // Rectangles
+        nodes.append('rect')
+          .attr('class', d => {
+            if (d.data.type === 'result') return 'node-rect result-node';
+            if (d.data.type === 'warning') return 'node-rect warning-node';
+            return 'node-rect decision-node';
+          })
+          .attr('x', -nodeWidth / 2)
+          .attr('y', -nodeHeight / 2)
+          .attr('width', nodeWidth)
+          .attr('height', nodeHeight)
+          .attr('fill', d => {
+            if (d.data.type === 'result') return colors.result;
+            if (d.data.type === 'warning') return colors.warning;
+            return colors.decision;
+          });
+        // Text (multiline support)
+        nodes.each(function(d) {
+          const nodeG = d3.select(this);
+          const lines = d.data.name.split('\n');
+          const lineHeight = 14;
+          const startY = -(lines.length - 1) * lineHeight / 2;
+          lines.forEach((line, i) => {
+            nodeG.append('text')
+              .attr('class', 'node-text')
+              .attr('text-anchor', 'middle')
+              .attr('y', startY + i * lineHeight)
+              .attr('dy', '0.35em')
+              .text(line);
+          });
+        });
+      }
+      // Initial render
+      render();
+      // Responsive resize
+      if (window.ResizeObserver) {
+        const ro = new ResizeObserver(() => render());
+        ro.observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>