NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization

NA-SapBERT is a biomedical sentence embedding model designed for encoding clinical mentions into dense vectors for downstream retrieval tasks.

This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for:

  • abbreviations (e.g., "NAD", "DM")
  • misspellings
  • shorthand / telegraphic clinical text
  • surface variation in real-world clinical notes

What This Model Is

NA-SapBERT is only an encoder.

It maps input text โ†’ 768-dimensional normalized embedding vectors.

It does NOT include:

  • retrieval logic
  • FAISS index
  • exact match
  • rewrite modules
  • reranking

These belong to downstream pipelines.


Key Idea

The model is trained using contrastive learning to align:

  • noisy clinical mentions
  • clean ontology concept names and synonyms

This improves embedding robustness and semantic consistency.


Model Architecture

  • Backbone: PubMedBERT
  • Pooling: Mean pooling (attention-mask aware)
  • Output: 768-dim normalized embeddings
  • Max sequence length: 32 (optimized for short clinical mentions)

Training Summary

  • Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style)
  • Data:
    • SNOMED CT concepts (subset of key semantic types)
    • synthetic noisy variants (LLM + abbreviation-based)

Training pairs:

  • clean โ†’ clean
  • noisy โ†’ clean

Usage (Recommended)

Use with Hugging Face Transformers + custom pooling.

Encoding Example

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel

class Encoder:

    def __init__(self, model_name, device="cuda", max_length=32):

        self.device = device
        self.max_length = max_length

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)

        if device == "cuda":
            self.model = self.model.cuda()

        self.model.eval()

    def encode(self, texts, batch_size=256):

        all_vecs = []

        with torch.no_grad():
            for i in range(0, len(texts), batch_size):

                batch = texts[i:i+batch_size]

                tokens = self.tokenizer(
                    batch,
                    padding=True,
                    truncation=True,
                    max_length=self.max_length,
                    return_tensors="pt"
                )

                if self.device == "cuda":
                    tokens = {k: v.cuda() for k, v in tokens.items()}

                out = self.model(**tokens)

                hidden = out.last_hidden_state
                mask = tokens["attention_mask"].unsqueeze(-1)

                pooled = (hidden * mask).sum(1) / mask.sum(1)

                # IMPORTANT: normalize embeddings
                pooled = torch.nn.functional.normalize(pooled, p=2, dim=1)

                all_vecs.append(pooled.cpu().numpy())

        return np.vstack(all_vecs).astype("float32")

Important Notes

  • Mean pooling is required (CLS token is NOT used)
  • L2 normalization is critical for similarity search
  • Designed for short clinical mentions (max_length=32)

Intended Use

This model is intended for:

  • clinical concept normalization pipelines
  • dense retrieval over medical ontologies (SNOMED CT, UMLS)
  • embedding generation for biomedical text

Not Intended For

  • general-purpose sentence similarity
  • long document encoding
  • non-biomedical domains

Limitations

  • Does not encode:
    • negation
    • temporality
    • broader context
  • Abbreviations remain ambiguous without external context
  • Performance depends on downstream retrieval pipeline
Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Tao-AI-Informatics/NA-SapBERT