AI & ML interests
Santali / Adivasi Dataset
🌾 Sohr.ai — Santali AI Platform
Sohr.ai (pronounced Sohr-ai, inspired by the Santali harvest festival Sohrai) is an initiative to build a complete AI platform for the Santali language (sat, Ol Chiki).
We focus on models, datasets, and tools that make it easy to build Santali-first applications.
🎯 What We Do
- Develop language models for Santali text generation, chat, and understanding.
- Build embedding models for search, retrieval, and recommendations.
- Provide spellcheck, normalization, and translation tools for real-world Santali content.
- (Planned) Release speech models for ASR and TTS.
- Host open datasets and training scripts for transparent, reproducible research.
- Offer a simple API layer and examples so developers can integrate Santali AI quickly.
🌍 Why Santali?
Santali is a major Adivasi language with millions of speakers, but it remains severely underrepresented in modern AI systems.
Our goal is to help make Santali a first-class language in AI by:
- Publishing open models and data.
- Documenting limitations, ethics, and use cases through model and dataset cards.
- Enabling researchers, builders, and communities to collaborate around a shared stack.
📚 What You’ll Find Here
Within the Sohr.ai organization on the Hub:
Models
- Base and instruction-tuned Santali language models
- Embedding models for semantic search
- Task-specific models (spellcheck, classification, translation)
Datasets
- Curated Santali text corpora (Ol Chiki and Roman)
- Parallel English ↔ Santali resources
- Domain-specific subsets (education, govt, jobs, culture)
Spaces
- Demos for proofreading, translation, and chat
- Tools showcasing how to use our models via API / SDKs
Each model and dataset includes a detailed model card describing training data, intended use, limitations, and ethical considerations.
🤝 Collaborate With Us
We welcome:
- Contributions of new or cleaned Santali data (with clear licensing).
- Improvements to tokenization, training, and evaluation.
- Demo apps and Spaces built on top of Sohr.ai models.
- Research collaborations around low-resource and Adivasi languages.
If you are interested in contributing, please open an Issue or Discussion on the relevant repository, or contact the maintainer listed in the model/dataset cards.
📩 Contact
- Organization: Sohr.ai — Santali AI Platform
- Focus: Santali (sat, Ol Chiki) models, datasets, and tools
- GitHub: github.com/sohr-ai
🌾 From Sohrai walls to Sohr.ai models — giving Santali a visible place in the AI ecosystem.