e5-base-v2-code-search

A fine-tuned code search embedding model based on intfloat/e5-base-v2 (110M parameters). Optimized for natural language → code retrieval across 9 programming languages.

Built for cqs, a semantic code search tool for AI agents.

Key Results

Benchmark Metric Score vs Base E5
CodeSearchNet (6 lang avg) NDCG@10 0.707 +8.0pp
CoIR (9 tasks avg) NDCG@10 49.19 -0.29pp
CosQA (out-of-distribution) NDCG@10 0.354 +2.5pp

Per-Language NDCG@10 (CodeSearchNet)

Language Base E5 This Model Delta
Go 0.680 0.785 +0.105
Java 0.560 0.656 +0.096
JavaScript 0.474 0.545 +0.071
Ruby 0.533 0.613 +0.080
Python 0.915 0.975 +0.060
PHP 0.601 0.667 +0.066

Full CoIR Breakdown (9 tasks, NDCG@10)

Task Base E5 This Model Delta
stackoverflow-qa 0.879 0.882 +0.003
codefeedback-st 0.745 0.737 -0.008
codesearchnet (6 lang) 0.627 0.707 +0.080
synthetic-text2sql 0.554 0.558 +0.004
codesearchnet-ccr 0.569 0.508 -0.061
codefeedback-mt 0.416 0.382 -0.034
cosqa 0.329 0.354 +0.025
codetrans-dl 0.219 0.194 -0.025
apps 0.115 0.105 -0.010
Average 49.48 49.19 -0.29pp

Wins on 5/9 tasks. Gains concentrated on code search (CSN +8.0pp, CosQA +2.5pp). Losses on cross-code retrieval (CCR -6.1pp) and code translation (-2.5pp).

CoIR Leaderboard Context

Among 110M-class models on the CoIR benchmark (ACL 2025), E5-base-v2 is already #7 out of 13 entries. This LoRA fine-tune nearly matches the base model's generalist performance (49.19 vs 49.48) while dramatically improving code search (+8.0pp CSN). It extends the lead over other 110M models (BGE-Base 42.77, GTE-Base 36.75, UniXcoder 37.33) while remaining CPU-runnable in <100ms per query.

Training

  • Method: LoRA (rank 16) fine-tuning of E5-base-v2
  • Data: 200,000 pairs from 9 languages (CSN: Go/Java/JS/Ruby/Python/PHP + Stack: Rust/TS/C++), with CoRNStack hard negatives
  • Loss: CachedGISTEmbedLoss (guide model: base E5, margin 0.05) + MatryoshkaLoss (768/384/192/128)
  • Training time: ~7 hours on A6000
  • Epochs: 1
  • Hardware: NVIDIA A6000 48GB

Why v7?

We trained 10+ LoRA variants with different data sizes, ranks, epochs, losses, and data mixes. v7 uses GISTEmbedLoss with hard negatives to prevent the generalist degradation seen in earlier versions:

Config Data Loss CSN NDCG@10 CoIR Overall
Base E5 — — 0.627 49.48
v5 (previous) 166k CSN, MNR MNR 0.683 48.67
v7 (current) 200k 9-lang, hard negs GIST+Matryoshka 0.707 49.19
v4 166k CSN MNR, 3ep 0.695 — (over-specialized)

v5 improved CSN but degraded generalist tasks (-0.81pp overall). v7 uses GISTEmbedLoss (false negative filtering via frozen guide model) and hard negatives (CoRNStack recipe) to recover most of the generalist loss while pushing CSN even higher.

Usage

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jamie8johnson/e5-base-v2-code-search")

queries = ["query: find all files matching a glob pattern"]
passages = ["passage: def glob_match(pattern, path): ..."]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)

similarity = q_emb @ p_emb.T

With cqs (semantic code search CLI)

# Uses this model by default
cqs "find files matching glob" --json

# Override with base E5
CQS_EMBEDDING_MODEL=intfloat/e5-base-v2 cqs "find files matching glob"

With ONNX Runtime

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jamie8johnson/e5-base-v2-code-search")
session = ort.InferenceSession("model.onnx")

inputs = tokenizer("query: sort array by key", return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, dict(inputs))
# Mean pooling (E5 convention — NOT CLS token)
token_embeddings = outputs[0]  # (batch, seq_len, hidden)
mask = inputs["attention_mask"][..., None]
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)

Enrichment Pipeline

This model achieves its best results when combined with cqs's NL enrichment pipeline, which prepends structured metadata to code before embedding:

  1. Type-aware signatures (free) — append fn foo(x: i32) -> Result<()> to NL description
  2. Call graph context (free) — append caller/callee function names
  3. LLM discriminating summaries (optional, ~$0.15/3k functions) — "what makes this function unique"
  4. Doc comment generation (optional, ~$1.50/3k functions) — structured parameter/return docs

The enrichment transforms code into richer text before the embedding model sees it. This is why a 110M model can compete with specialized models 3-20x larger — the intelligence is in the text transformation, not the model size.

Limitations

  • Trained on 9 languages (Go, Java, JavaScript, Ruby, Python, PHP, Rust, TypeScript, C++). Other languages rely on E5's base capability.
  • English queries only. E5-base-v2's multilingual capability is untested for code search.
  • 512 token max sequence length. Long functions are truncated.
  • Adversarial confusable-function pairs (e.g., 6 sorting variants) regress vs base model. Real-world NL→code retrieval improves.

Citation

@software{cqs2026,
  title={cqs: Semantic Code Search with Local Embeddings},
  author={Johnson, Jamie},
  url={https://github.com/jamie8johnson/cqs},
  year={2026}
}

License

Apache 2.0 (same as base E5-base-v2).

Downloads last month
173
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jamie8johnson/e5-base-v2-code-search

Adapter
(1)
this model

Datasets used to train jamie8johnson/e5-base-v2-code-search

Evaluation results