SohrAI

non-profit
Activity Feed

AI & ML interests

Santali / Adivasi Dataset

Organization Card
Sohr.ai Logo

🌾 Sohr.ai — Santali AI Platform

Sohr.ai (pronounced Sohr-ai, inspired by the Santali harvest festival Sohrai) is an initiative to build a complete AI platform for the Santali language (sat, Ol Chiki).

We focus on models, datasets, and tools that make it easy to build Santali-first applications.


🎯 What We Do

  • Develop language models for Santali text generation, chat, and understanding.
  • Build embedding models for search, retrieval, and recommendations.
  • Provide spellcheck, normalization, and translation tools for real-world Santali content.
  • (Planned) Release speech models for ASR and TTS.
  • Host open datasets and training scripts for transparent, reproducible research.
  • Offer a simple API layer and examples so developers can integrate Santali AI quickly.

🌍 Why Santali?

Santali is a major Adivasi language with millions of speakers, but it remains severely underrepresented in modern AI systems.
Our goal is to help make Santali a first-class language in AI by:

  • Publishing open models and data.
  • Documenting limitations, ethics, and use cases through model and dataset cards.
  • Enabling researchers, builders, and communities to collaborate around a shared stack.

📚 What You’ll Find Here

Within the Sohr.ai organization on the Hub:

  • Models

    • Base and instruction-tuned Santali language models
    • Embedding models for semantic search
    • Task-specific models (spellcheck, classification, translation)
  • Datasets

    • Curated Santali text corpora (Ol Chiki and Roman)
    • Parallel English ↔ Santali resources
    • Domain-specific subsets (education, govt, jobs, culture)
  • Spaces

    • Demos for proofreading, translation, and chat
    • Tools showcasing how to use our models via API / SDKs

Each model and dataset includes a detailed model card describing training data, intended use, limitations, and ethical considerations.


🤝 Collaborate With Us

We welcome:

  • Contributions of new or cleaned Santali data (with clear licensing).
  • Improvements to tokenization, training, and evaluation.
  • Demo apps and Spaces built on top of Sohr.ai models.
  • Research collaborations around low-resource and Adivasi languages.

If you are interested in contributing, please open an Issue or Discussion on the relevant repository, or contact the maintainer listed in the model/dataset cards.


📩 Contact

  • Organization: Sohr.ai — Santali AI Platform
  • Focus: Santali (sat, Ol Chiki) models, datasets, and tools
  • GitHub: github.com/sohr-ai

🌾 From Sohrai walls to Sohr.ai models — giving Santali a visible place in the AI ecosystem.

models 0

None public yet

datasets 0

None public yet