Language Detection
A lightweight language detection tool that uses character-level n-gram features and logistic regression to identify the language of a given text.
Supported languages out of the box: English, French, German, Turkish.
Model repository: https://huggingface.co/Isa0/language-detection/
Installation
Requires Python 3.11 or higher. Install dependencies with uv:
uv sync
Usage
Train
Train the model on the datasets in the datasets/ directory:
uv run main.py --train
You can point it to a different directory with --dir:
uv run main.py --train --dir path/to/datasets
Each .txt file in the directory should contain one sentence per line. The filename (without extension) is used as the language label.
Detect
Detect the language of a text string:
uv run main.py --detect "Bonjour, comment allez-vous?"
Output includes the predicted language and a confidence score.
Adding Languages
Add a new .txt file to the datasets/ directory named after the language (e.g. spanish.txt), with one sentence per line, then retrain.
How It Works
Text is converted into character-level n-gram counts (1 to 3 characters), which capture language-specific patterns like accents, letter combinations, and suffixes. A logistic regression classifier is trained on these features and saved to disk for reuse.