Feature Extraction
Transformers
Safetensors
sentence-transformers
embeddings
lora
sociology
retrieval
Instructions to use CodeSoulco/THETA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CodeSoulco/THETA with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="CodeSoulco/THETA")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("CodeSoulco/THETA", dtype="auto") - sentence-transformers
How to use CodeSoulco/THETA with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("CodeSoulco/THETA") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| language: | |
| - zh | |
| - en | |
| - de | |
| - fr | |
| license: mit | |
| pipeline_tag: feature-extraction | |
| library_name: transformers | |
| tags: | |
| - embeddings | |
| - lora | |
| - sociology | |
| - retrieval | |
| - feature-extraction | |
| - sentence-transformers | |
| # THETA: Textual Hybrid Embeddingβbased Topic Analysis | |
| ## Model Description | |
| THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). It is designed to generate dense vector representations for texts in the sociology and social science domain. | |
| The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG). | |
| **Base Models:** | |
| - [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) | |
| - [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) | |
| **Fine-tuning Methods:** | |
| - **Unsupervised:** SimCSE (contrastive learning) | |
| - **Supervised:** Label-guided contrastive learning with LoRA | |
| ## Intended Use | |
| This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations. | |
| It is **not** designed for text generation or decision-making in high-risk scenarios. | |
| ## Model Architecture | |
| | Component | Detail | | |
| |---|---| | |
| | Base model | Qwen3-Embedding (0.6B / 4B) | | |
| | Fine-tuning | LoRA (Low-Rank Adaptation) | | |
| | Output dimension | 896 (0.6B) / 2560 (4B) | | |
| | Framework | Transformers (PyTorch) | | |
| ## Repository Structure | |
| ``` | |
| CodeSoulco/THETA/ | |
| βββ 0.6B/ | |
| β βββ supervised/ | |
| β βββ unsupervised/ | |
| βββ 4B/ | |
| β βββ supervised/ | |
| β βββ unsupervised/ | |
| βββ logs/ | |
| ``` | |
| Pre-computed embeddings are available in a separate dataset repo: [CodeSoulco/THETA-embeddings](https://huggingface.co/datasets/CodeSoulco/THETA-embeddings) | |
| ## Training Details | |
| - **Fine-tuning method:** LoRA | |
| - **Training domain:** Sociology and social science texts | |
| - **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health | |
| - **Objective:** Improve domain-specific semantic representation | |
| - **Hardware:** Dual NVIDIA GPU | |
| ## How to Use | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel | |
| from peft import PeftModel | |
| import torch | |
| # Load base model | |
| base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True) | |
| tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True) | |
| # Load LoRA adapter | |
| model = PeftModel.from_pretrained( | |
| base_model, | |
| "CodeSoulco/THETA", | |
| subfolder="0.6B/unsupervised/germanCoal" | |
| ) | |
| # Generate embeddings | |
| text = "Social structure and individual behavior" | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| embeddings = outputs.last_hidden_state[:, 0, :] # CLS token | |
| ``` | |
| ## Limitations | |
| - Fine-tuned for sociology/social science domain; may not generalize well to unrelated topics. | |
| - Performance depends on input text length and quality. | |
| - Does not generate text and should not be used for generative tasks. | |
| ## License | |
| This model is released under the **MIT License**. | |
| ## Citation | |
| ```bibtex | |
| @misc{theta2026, | |
| title={THETA: Textual Hybrid Embedding--based Topic Analysis}, | |
| author={CodeSoul}, | |
| year={2026}, | |
| publisher={Hugging Face}, | |
| url={https://huggingface.co/CodeSoulco/THETA} | |
| } | |
| ``` | |