NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization
NA-SapBERT is a biomedical sentence embedding model designed for encoding clinical mentions into dense vectors for downstream retrieval tasks.
This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for:
- abbreviations (e.g., "NAD", "DM")
- misspellings
- shorthand / telegraphic clinical text
- surface variation in real-world clinical notes
What This Model Is
NA-SapBERT is only an encoder.
It maps input text โ 768-dimensional normalized embedding vectors.
It does NOT include:
- retrieval logic
- FAISS index
- exact match
- rewrite modules
- reranking
These belong to downstream pipelines.
Key Idea
The model is trained using contrastive learning to align:
- noisy clinical mentions
- clean ontology concept names and synonyms
This improves embedding robustness and semantic consistency.
Model Architecture
- Backbone: PubMedBERT
- Pooling: Mean pooling (attention-mask aware)
- Output: 768-dim normalized embeddings
- Max sequence length: 32 (optimized for short clinical mentions)
Training Summary
- Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style)
- Data:
- SNOMED CT concepts (subset of key semantic types)
- synthetic noisy variants (LLM + abbreviation-based)
Training pairs:
- clean โ clean
- noisy โ clean
Usage (Recommended)
Use with Hugging Face Transformers + custom pooling.
Encoding Example
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
class Encoder:
def __init__(self, model_name, device="cuda", max_length=32):
self.device = device
self.max_length = max_length
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
if device == "cuda":
self.model = self.model.cuda()
self.model.eval()
def encode(self, texts, batch_size=256):
all_vecs = []
with torch.no_grad():
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
tokens = self.tokenizer(
batch,
padding=True,
truncation=True,
max_length=self.max_length,
return_tensors="pt"
)
if self.device == "cuda":
tokens = {k: v.cuda() for k, v in tokens.items()}
out = self.model(**tokens)
hidden = out.last_hidden_state
mask = tokens["attention_mask"].unsqueeze(-1)
pooled = (hidden * mask).sum(1) / mask.sum(1)
# IMPORTANT: normalize embeddings
pooled = torch.nn.functional.normalize(pooled, p=2, dim=1)
all_vecs.append(pooled.cpu().numpy())
return np.vstack(all_vecs).astype("float32")
Important Notes
- Mean pooling is required (CLS token is NOT used)
- L2 normalization is critical for similarity search
- Designed for short clinical mentions (max_length=32)
Intended Use
This model is intended for:
- clinical concept normalization pipelines
- dense retrieval over medical ontologies (SNOMED CT, UMLS)
- embedding generation for biomedical text
Not Intended For
- general-purpose sentence similarity
- long document encoding
- non-biomedical domains
Limitations
- Does not encode:
- negation
- temporality
- broader context
- Abbreviations remain ambiguous without external context
- Performance depends on downstream retrieval pipeline
- Downloads last month
- 1