NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization

NA-SapBERT is a biomedical sentence embedding model designed for encoding clinical mentions into dense vectors for downstream retrieval tasks.

This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for:

abbreviations (e.g., "NAD", "DM")
misspellings
shorthand / telegraphic clinical text
surface variation in real-world clinical notes

What This Model Is

NA-SapBERT is only an encoder.

It maps input text → 768-dimensional normalized embedding vectors.

It does NOT include:

retrieval logic
FAISS index
exact match
rewrite modules
reranking

These belong to downstream pipelines.

Key Idea

The model is trained using contrastive learning to align:

noisy clinical mentions
clean ontology concept names and synonyms

This improves embedding robustness and semantic consistency.

Model Architecture

Backbone: PubMedBERT
Pooling: Mean pooling (attention-mask aware)
Output: 768-dim normalized embeddings
Max sequence length: 32 (optimized for short clinical mentions)

Training Summary

Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style)
Data:
- SNOMED CT concepts (subset of key semantic types)
- synthetic noisy variants (LLM + abbreviation-based)

Training pairs:

clean → clean
noisy → clean

Usage (Recommended)

Use with Hugging Face Transformers + custom pooling.

Encoding Example

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel

class Encoder:

    def __init__(self, model_name, device="cuda", max_length=32):

        self.device = device
        self.max_length = max_length

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)

        if device == "cuda":
            self.model = self.model.cuda()

        self.model.eval()

    def encode(self, texts, batch_size=256):

        all_vecs = []

        with torch.no_grad():
            for i in range(0, len(texts), batch_size):

                batch = texts[i:i+batch_size]

                tokens = self.tokenizer(
                    batch,
                    padding=True,
                    truncation=True,
                    max_length=self.max_length,
                    return_tensors="pt"
                )

                if self.device == "cuda":
                    tokens = {k: v.cuda() for k, v in tokens.items()}

                out = self.model(**tokens)

                hidden = out.last_hidden_state
                mask = tokens["attention_mask"].unsqueeze(-1)

                pooled = (hidden * mask).sum(1) / mask.sum(1)

                # IMPORTANT: normalize embeddings
                pooled = torch.nn.functional.normalize(pooled, p=2, dim=1)

                all_vecs.append(pooled.cpu().numpy())

        return np.vstack(all_vecs).astype("float32")

Important Notes

Mean pooling is required (CLS token is NOT used)
L2 normalization is critical for similarity search
Designed for short clinical mentions (max_length=32)

Intended Use

This model is intended for:

clinical concept normalization pipelines
dense retrieval over medical ontologies (SNOMED CT, UMLS)
embedding generation for biomedical text

Not Intended For

general-purpose sentence similarity
long document encoding
non-biomedical domains

Limitations

Does not encode:
- negation
- temporality
- broader context
Abbreviations remain ambiguous without external context
Performance depends on downstream retrieval pipeline

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Tao-AI-Informatics/NA-SapBERT

Base model

cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token

Finetuned

(2)

this model