nerdv2's picture
Initial commit
097488c verified
metadata
language:
  - id
license: mit
tags:
  - spacy
  - token-classification
  - named-entity-recognition
  - indonesian
  - ner
datasets:
  - grit-id/id_nergrit_corpus
metrics:
  - precision
  - recall
  - f1
model-index:
  - name: id_nergrit_indonesian_spacy
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: Nergrit Corpus
          type: grit-id/id_nergrit_corpus
        metrics:
          - type: f1
            value: 0.7484
            name: F1 Score
          - type: precision
            value: 0.7748
            name: Precision
          - type: recall
            value: 0.7237
            name: Recall
widget:
  - text: Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023.
  - text: Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen.
  - text: >-
      Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000
      triliun.

Indonesian Named Entity Recognition Model

This is a spaCy model trained on the Nergrit Corpus for Indonesian Named Entity Recognition.

Model Description

This model recognizes 19 entity types in Indonesian text:

  • PER (Person): Names of people
  • ORG (Organization): Companies, institutions
  • GPE (Geopolitical Entity): Countries, cities, states
  • LOC (Location): Non-GPE locations, facilities
  • DAT (Date): Absolute or relative dates
  • MON (Money): Monetary values
  • PRC (Percent): Percentages
  • TIM (Time): Times of day
  • QTY (Quantity): Measurements and quantities
  • CRD (Cardinal): Cardinal numbers
  • ORD (Ordinal): Ordinal numbers
  • EVT (Event): Named events
  • FAC (Facility): Buildings, airports, stations
  • LAW (Law): Legal documents, laws
  • LAN (Language): Named languages
  • NOR (Political Organization): Political entities
  • PRD (Product): Products, brands
  • REG (Religion): Religious groups
  • WOA (Work of Art): Titles of books, songs, etc.

Performance

Metric Score
F1 Score 74.84%
Precision 77.48%
Recall 72.37%

Top Performing Entities

Entity F1 Score
PRC (Percent) 93.72%
DAT (Date) 92.41%
MON (Money) 92.56%
TIM (Time) 88.51%
CRD (Cardinal) 86.23%

Usage

Installation

pip install spacy
pip install https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy/resolve/main/id_nergrit_indonesian_spacy-1.0.0-py3-none-any.whl

Basic Usage

import spacy

# Load the model
nlp = spacy.load("id_nergrit_indonesian_spacy")

# Process text
text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

Output:

Joko Widodo -> PER
Jakarta -> GPE
17 Agustus 2023 -> DAT

Batch Processing

import spacy

nlp = spacy.load("id_nergrit_indonesian_spacy")

texts = [
    "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen.",
    "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
]

for doc in nlp.pipe(texts):
    print([(ent.text, ent.label_) for ent in doc.ents])

Using with Hugging Face Hub

import spacy

# Load directly from Hugging Face
nlp = spacy.load("id_nergrit_indonesian_spacy")
doc = nlp("Universitas Indonesia terletak di Depok, Jawa Barat.")

for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

Training Data

The model was trained on the Nergrit Corpus dataset:

  • Training examples: 12,532
  • Validation examples: 2,521
  • Test examples: 2,399

Dataset source: grit-id/id_nergrit_corpus

Training Procedure

Model Architecture

  • Base: spaCy's Tok2Vec + Transition-based NER
  • Embedding: MultiHashEmbed with 96-dimensional vectors
  • Encoder: MaxoutWindowEncoder (depth=4, window=1)
  • Parser: Transition-based with 64 hidden units

Training Configuration

  • Optimizer: Adam (lr=0.001)
  • Batch size: Dynamic (100-1000 words)
  • Max steps: 20,000
  • Dropout: 0.1
  • Evaluation frequency: Every 200 steps

Limitations

  • The model performs best on formal Indonesian text
  • Some entity types (WOA, FAC, PRD) have lower performance due to limited training data
  • May not generalize well to informal/colloquial Indonesian or social media text
  • Performance may vary on domain-specific texts

Citation

If you use this model, please cite:

@misc{id_nergrit_indonesian_spacy,
  author = {nerdv2},
  title = {Indonesian Named Entity Recognition Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy}
}

Acknowledgments

License

MIT License

Contact

For questions or issues, please open an issue on the Hugging Face model page.