Sentence Similarity
sentence-transformers
Safetensors
Russian
English
bert
embeddings
vllm
inference-optimized
inference
text-embeddings-inference
Instructions to use WpythonW/rubert-tiny2-vllm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use WpythonW/rubert-tiny2-vllm with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("WpythonW/rubert-tiny2-vllm") sentences = [ "Это счастливый человек", "Это счастливая собака", "Это очень счастливый человек", "Сегодня солнечный день" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| language: | |
| - ru | |
| - en | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - embeddings | |
| - sentence-transformers | |
| - vllm | |
| - inference-optimized | |
| - inference | |
| license: mit | |
| base_model: cointegrated/rubert-tiny2 | |
| # rubert-tiny2-vllm | |
| **vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference. | |
| This model produces **numerically identical embeddings** to the original while enabling speedup through vLLM's optimized kernels and batching. | |
| ## Modifications | |
| - **No weight changes** - uses original query/key/value weights directly | |
| - vLLM automatically converts Q/K/V to fused qkv_proj format during loading | |
| - Removed pretraining heads (MLM/NSP) - not needed for embeddings | |
| - Changed architecture to `BertModel` for vLLM compatibility | |
| ## Usage | |
| ### vLLM Server | |
| ```bash | |
| # IMPORTANT: Use fp32 for exact numerical match with original model | |
| vllm serve WpythonW/rubert-tiny2-vllm --dtype float32 | |
| ``` | |
| ### OpenAI-compatible API | |
| ```python | |
| from openai import OpenAI | |
| client = OpenAI( | |
| base_url="http://localhost:8000/v1", | |
| api_key="dummy" | |
| ) | |
| response = client.embeddings.create( | |
| input="Привет мир", | |
| model="WpythonW/rubert-tiny2-vllm" | |
| ) | |
| print(response.data[0].embedding[:5]) | |
| ``` | |
| ### Transformers | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel | |
| tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm") | |
| model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm") | |
| def embed_bert_cls(text, model, tokenizer): | |
| t = tokenizer(text, padding=True, truncation=True, return_tensors='pt') | |
| with torch.no_grad(): | |
| model_output = model(**{k: v.to(model.device) for k, v in t.items()}) | |
| embeddings = model_output.last_hidden_state[:, 0, :] | |
| embeddings = torch.nn.functional.normalize(embeddings) | |
| return embeddings[0].cpu().numpy() | |
| print(embed_bert_cls('привет мир', model, tokenizer).shape) | |
| # (312,) | |
| ``` | |
| ### Sentence Transformers | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer('WpythonW/rubert-tiny2-vllm') | |
| sentences = ["привет мир", "hello world", "здравствуй вселенная"] | |
| embeddings = model.encode(sentences) | |
| print(embeddings.shape) | |
| ``` | |
| ## Validation Results | |
| Comparison between vLLM and SentenceTransformers on identical inputs: | |
| ``` | |
| Max embedding difference: 3.375e-7 | |
| Mean embedding difference: 1.136e-7 | |
| Cosine similarity matrices: Identical (np.allclose with default tolerances) | |
| ``` | |
| This confirms **bit-level equivalence** within float32 precision limits. | |
| ## Conversion | |
| Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW) | |
| **Conversion process:** | |
| 1. Load original cointegrated/rubert-tiny2 weights | |
| 2. Remove `bert.` prefix from weight names | |
| 3. Remove unused heads (cls.*, bert.pooler.*) | |
| 4. Keep query/key/value weights as-is (vLLM handles fusion automatically) | |
| Tested on Google Colab Tesla T4 with: | |
| - vLLM 0.11.2 | |
| - Transformers 4.57.2 | |
| - PyTorch 2.9.0+cu126 | |
| ## Original Model | |
| For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) |