e5-base-v2-code-search
A fine-tuned code search embedding model based on intfloat/e5-base-v2 (110M parameters). Optimized for natural language → code retrieval across 9 programming languages.
Built for cqs, a semantic code search tool for AI agents.
Key Results
| Benchmark | Metric | Score | vs Base E5 |
|---|---|---|---|
| CodeSearchNet (6 lang avg) | NDCG@10 | 0.707 | +8.0pp |
| CoIR (9 tasks avg) | NDCG@10 | 49.19 | -0.29pp |
| CosQA (out-of-distribution) | NDCG@10 | 0.354 | +2.5pp |
Per-Language NDCG@10 (CodeSearchNet)
| Language | Base E5 | This Model | Delta |
|---|---|---|---|
| Go | 0.680 | 0.785 | +0.105 |
| Java | 0.560 | 0.656 | +0.096 |
| JavaScript | 0.474 | 0.545 | +0.071 |
| Ruby | 0.533 | 0.613 | +0.080 |
| Python | 0.915 | 0.975 | +0.060 |
| PHP | 0.601 | 0.667 | +0.066 |
Full CoIR Breakdown (9 tasks, NDCG@10)
| Task | Base E5 | This Model | Delta |
|---|---|---|---|
| stackoverflow-qa | 0.879 | 0.882 | +0.003 |
| codefeedback-st | 0.745 | 0.737 | -0.008 |
| codesearchnet (6 lang) | 0.627 | 0.707 | +0.080 |
| synthetic-text2sql | 0.554 | 0.558 | +0.004 |
| codesearchnet-ccr | 0.569 | 0.508 | -0.061 |
| codefeedback-mt | 0.416 | 0.382 | -0.034 |
| cosqa | 0.329 | 0.354 | +0.025 |
| codetrans-dl | 0.219 | 0.194 | -0.025 |
| apps | 0.115 | 0.105 | -0.010 |
| Average | 49.48 | 49.19 | -0.29pp |
Wins on 5/9 tasks. Gains concentrated on code search (CSN +8.0pp, CosQA +2.5pp). Losses on cross-code retrieval (CCR -6.1pp) and code translation (-2.5pp).
CoIR Leaderboard Context
Among 110M-class models on the CoIR benchmark (ACL 2025), E5-base-v2 is already #7 out of 13 entries. This LoRA fine-tune nearly matches the base model's generalist performance (49.19 vs 49.48) while dramatically improving code search (+8.0pp CSN). It extends the lead over other 110M models (BGE-Base 42.77, GTE-Base 36.75, UniXcoder 37.33) while remaining CPU-runnable in <100ms per query.
Training
- Method: LoRA (rank 16) fine-tuning of E5-base-v2
- Data: 200,000 pairs from 9 languages (CSN: Go/Java/JS/Ruby/Python/PHP + Stack: Rust/TS/C++), with CoRNStack hard negatives
- Loss: CachedGISTEmbedLoss (guide model: base E5, margin 0.05) + MatryoshkaLoss (768/384/192/128)
- Training time: ~7 hours on A6000
- Epochs: 1
- Hardware: NVIDIA A6000 48GB
Why v7?
We trained 10+ LoRA variants with different data sizes, ranks, epochs, losses, and data mixes. v7 uses GISTEmbedLoss with hard negatives to prevent the generalist degradation seen in earlier versions:
| Config | Data | Loss | CSN NDCG@10 | CoIR Overall |
|---|---|---|---|---|
| Base E5 | — | — | 0.627 | 49.48 |
| v5 (previous) | 166k CSN, MNR | MNR | 0.683 | 48.67 |
| v7 (current) | 200k 9-lang, hard negs | GIST+Matryoshka | 0.707 | 49.19 |
| v4 | 166k CSN | MNR, 3ep | 0.695 | — (over-specialized) |
v5 improved CSN but degraded generalist tasks (-0.81pp overall). v7 uses GISTEmbedLoss (false negative filtering via frozen guide model) and hard negatives (CoRNStack recipe) to recover most of the generalist loss while pushing CSN even higher.
Usage
With sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jamie8johnson/e5-base-v2-code-search")
queries = ["query: find all files matching a glob pattern"]
passages = ["passage: def glob_match(pattern, path): ..."]
q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
similarity = q_emb @ p_emb.T
With cqs (semantic code search CLI)
# Uses this model by default
cqs "find files matching glob" --json
# Override with base E5
CQS_EMBEDDING_MODEL=intfloat/e5-base-v2 cqs "find files matching glob"
With ONNX Runtime
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jamie8johnson/e5-base-v2-code-search")
session = ort.InferenceSession("model.onnx")
inputs = tokenizer("query: sort array by key", return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, dict(inputs))
# Mean pooling (E5 convention — NOT CLS token)
token_embeddings = outputs[0] # (batch, seq_len, hidden)
mask = inputs["attention_mask"][..., None]
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)
Enrichment Pipeline
This model achieves its best results when combined with cqs's NL enrichment pipeline, which prepends structured metadata to code before embedding:
- Type-aware signatures (free) — append
fn foo(x: i32) -> Result<()>to NL description - Call graph context (free) — append caller/callee function names
- LLM discriminating summaries (optional, ~$0.15/3k functions) — "what makes this function unique"
- Doc comment generation (optional, ~$1.50/3k functions) — structured parameter/return docs
The enrichment transforms code into richer text before the embedding model sees it. This is why a 110M model can compete with specialized models 3-20x larger — the intelligence is in the text transformation, not the model size.
Limitations
- Trained on 9 languages (Go, Java, JavaScript, Ruby, Python, PHP, Rust, TypeScript, C++). Other languages rely on E5's base capability.
- English queries only. E5-base-v2's multilingual capability is untested for code search.
- 512 token max sequence length. Long functions are truncated.
- Adversarial confusable-function pairs (e.g., 6 sorting variants) regress vs base model. Real-world NL→code retrieval improves.
Citation
@software{cqs2026,
title={cqs: Semantic Code Search with Local Embeddings},
author={Johnson, Jamie},
url={https://github.com/jamie8johnson/cqs},
year={2026}
}
License
Apache 2.0 (same as base E5-base-v2).
- Downloads last month
- 173
Model tree for jamie8johnson/e5-base-v2-code-search
Base model
intfloat/e5-base-v2Datasets used to train jamie8johnson/e5-base-v2-code-search
Evaluation results
- NDCG@10 (avg, 6 languages) on CodeSearchNetself-reported0.707
- NDCG@10 on CosQAself-reported0.354