e5-base-v2-code-search

A fine-tuned code search embedding model based on intfloat/e5-base-v2 (110M parameters). Optimized for natural language → code retrieval across 9 programming languages.

Built for cqs, a semantic code search tool for AI agents.

Key Results

Benchmark	Metric	Score	vs Base E5
CodeSearchNet (6 lang avg)	NDCG@10	0.707	+8.0pp
CoIR (9 tasks avg)	NDCG@10	49.19	-0.29pp
CosQA (out-of-distribution)	NDCG@10	0.354	+2.5pp

Per-Language NDCG@10 (CodeSearchNet)

Language	Base E5	This Model	Delta
Go	0.680	0.785	+0.105
Java	0.560	0.656	+0.096
JavaScript	0.474	0.545	+0.071
Ruby	0.533	0.613	+0.080
Python	0.915	0.975	+0.060
PHP	0.601	0.667	+0.066

Full CoIR Breakdown (9 tasks, NDCG@10)

Task	Base E5	This Model	Delta
stackoverflow-qa	0.879	0.882	+0.003
codefeedback-st	0.745	0.737	-0.008
codesearchnet (6 lang)	0.627	0.707	+0.080
synthetic-text2sql	0.554	0.558	+0.004
codesearchnet-ccr	0.569	0.508	-0.061
codefeedback-mt	0.416	0.382	-0.034
cosqa	0.329	0.354	+0.025
codetrans-dl	0.219	0.194	-0.025
apps	0.115	0.105	-0.010
Average	49.48	49.19	-0.29pp

Wins on 5/9 tasks. Gains concentrated on code search (CSN +8.0pp, CosQA +2.5pp). Losses on cross-code retrieval (CCR -6.1pp) and code translation (-2.5pp).

CoIR Leaderboard Context

Among 110M-class models on the CoIR benchmark (ACL 2025), E5-base-v2 is already #7 out of 13 entries. This LoRA fine-tune nearly matches the base model's generalist performance (49.19 vs 49.48) while dramatically improving code search (+8.0pp CSN). It extends the lead over other 110M models (BGE-Base 42.77, GTE-Base 36.75, UniXcoder 37.33) while remaining CPU-runnable in <100ms per query.

Training

Method: LoRA (rank 16) fine-tuning of E5-base-v2
Data: 200,000 pairs from 9 languages (CSN: Go/Java/JS/Ruby/Python/PHP + Stack: Rust/TS/C++), with CoRNStack hard negatives
Loss: CachedGISTEmbedLoss (guide model: base E5, margin 0.05) + MatryoshkaLoss (768/384/192/128)
Training time: ~7 hours on A6000
Epochs: 1
Hardware: NVIDIA A6000 48GB

Why v7?

We trained 10+ LoRA variants with different data sizes, ranks, epochs, losses, and data mixes. v7 uses GISTEmbedLoss with hard negatives to prevent the generalist degradation seen in earlier versions:

Config	Data	Loss	CSN NDCG@10	CoIR Overall
Base E5	—	—	0.627	49.48
v5 (previous)	166k CSN, MNR	MNR	0.683	48.67
v7 (current)	200k 9-lang, hard negs	GIST+Matryoshka	0.707	49.19
v4	166k CSN	MNR, 3ep	0.695	— (over-specialized)

v5 improved CSN but degraded generalist tasks (-0.81pp overall). v7 uses GISTEmbedLoss (false negative filtering via frozen guide model) and hard negatives (CoRNStack recipe) to recover most of the generalist loss while pushing CSN even higher.

Usage

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jamie8johnson/e5-base-v2-code-search")

queries = ["query: find all files matching a glob pattern"]
passages = ["passage: def glob_match(pattern, path): ..."]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)

similarity = q_emb @ p_emb.T

With cqs (semantic code search CLI)

# Uses this model by default
cqs "find files matching glob" --json

# Override with base E5
CQS_EMBEDDING_MODEL=intfloat/e5-base-v2 cqs "find files matching glob"

With ONNX Runtime

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jamie8johnson/e5-base-v2-code-search")
session = ort.InferenceSession("model.onnx")

inputs = tokenizer("query: sort array by key", return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, dict(inputs))
# Mean pooling (E5 convention — NOT CLS token)
token_embeddings = outputs[0]  # (batch, seq_len, hidden)
mask = inputs["attention_mask"][..., None]
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)

Enrichment Pipeline

This model achieves its best results when combined with cqs's NL enrichment pipeline, which prepends structured metadata to code before embedding:

Type-aware signatures (free) — append fn foo(x: i32) -> Result<()> to NL description
Call graph context (free) — append caller/callee function names
LLM discriminating summaries (optional, ~$0.15/3k functions) — "what makes this function unique"
Doc comment generation (optional, ~$1.50/3k functions) — structured parameter/return docs

The enrichment transforms code into richer text before the embedding model sees it. This is why a 110M model can compete with specialized models 3-20x larger — the intelligence is in the text transformation, not the model size.

Limitations

Trained on 9 languages (Go, Java, JavaScript, Ruby, Python, PHP, Rust, TypeScript, C++). Other languages rely on E5's base capability.
English queries only. E5-base-v2's multilingual capability is untested for code search.
512 token max sequence length. Long functions are truncated.
Adversarial confusable-function pairs (e.g., 6 sorting variants) regress vs base model. Real-world NL→code retrieval improves.

Citation

@software{cqs2026,
  title={cqs: Semantic Code Search with Local Embeddings},
  author={Johnson, Jamie},
  url={https://github.com/jamie8johnson/cqs},
  year={2026}
}

License

Apache 2.0 (same as base E5-base-v2).

Downloads last month: 173

Model tree for jamie8johnson/e5-base-v2-code-search

Base model

intfloat/e5-base-v2

Adapter

(1)

this model

Datasets used to train jamie8johnson/e5-base-v2-code-search

Evaluation results

NDCG@10 (avg, 6 languages) on CodeSearchNet
self-reported

0.707
NDCG@10 on CosQA
self-reported

0.354