Korean Language Model

A ~100M parameter Korean language model trained from scratch on Korean Wikipedia.

Quick Start

from transformers import AutoModelForCausalLM
import sentencepiece as spm
import torch

# Load model (trust_remote_code required for custom architecture)
model = AutoModelForCausalLM.from_pretrained(
    "shopkeeper/korean-wiki-120125",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
model = model.to("cuda")

# Load tokenizer
from huggingface_hub import hf_hub_download
tokenizer_path = hf_hub_download(repo_id="shopkeeper/korean-wiki-120125", filename="ko_wiki_tokenizer.model")
tokenizer = spm.SentencePieceProcessor()
tokenizer.load(tokenizer_path)

# Generate text
prompt = "한국의 역사는"
input_ids = torch.tensor([[tokenizer.bos_id()] + tokenizer.encode(prompt)]).to("cuda")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=100,
        temperature=0.8,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_id(),
    )

print(tokenizer.decode(output[0].tolist()))

example output

한국의 역사는 삼국시대로부터 시작한다. 고구려와 백제, 신라의 세 왕국, 신라의 신라는 삼국통일을 이룩하였다. 이후 신라는 고구려의 계승국가인 고구려를 계승하여 삼국통일을 달성하였다. 남북국 시대가 되자 남북국 시대를 거치며, 삼국 시대가 마무리 되었다. 신라의 삼국통일은 신라로부터 신라로 넘어가는 과정을 통해 이루어져야 할 정치적, 경제적, 사회적, 문화적, 정치적, 문화적, 문화적 변화를 겪게 되었다. 신라의 삼국통일은 삼국 통일의 첫 번째 단계이다. 신라. 신

Model Details

Property	Value
Parameters	~100M
Hidden dimension	768
Layers	12
Attention heads	12
Max sequence length	1024
Vocabulary size	32000

Architecture Features

RMSNorm (pre-normalization)
Rotary Position Embeddings (RoPE)
SwiGLU activation
Multi-Head Attention with KV Cache
No bias terms
Weight tying (embedding & output)

Training

Data: Korean Wikipedia (120125 dump)
Epochs: 3
Precision: bfloat16
Optimizer: AdamW with cosine LR schedule
Tokenizer: SentencePiece (32k vocab)

License

MIT

Downloads last month: 111