Korean Language Model
A ~100M parameter Korean language model trained from scratch on Korean Wikipedia.
Quick Start
from transformers import AutoModelForCausalLM
import sentencepiece as spm
import torch
# Load model (trust_remote_code required for custom architecture)
model = AutoModelForCausalLM.from_pretrained(
"shopkeeper/korean-wiki-120125",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
model = model.to("cuda")
# Load tokenizer
from huggingface_hub import hf_hub_download
tokenizer_path = hf_hub_download(repo_id="shopkeeper/korean-wiki-120125", filename="ko_wiki_tokenizer.model")
tokenizer = spm.SentencePieceProcessor()
tokenizer.load(tokenizer_path)
# Generate text
prompt = "ํ๊ตญ์ ์ญ์ฌ๋"
input_ids = torch.tensor([[tokenizer.bos_id()] + tokenizer.encode(prompt)]).to("cuda")
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_id(),
)
print(tokenizer.decode(output[0].tolist()))
example output
ํ๊ตญ์ ์ญ์ฌ๋ ์ผ๊ตญ์๋๋ก๋ถํฐ ์์ํ๋ค. ๊ณ ๊ตฌ๋ ค์ ๋ฐฑ์ , ์ ๋ผ์ ์ธ ์๊ตญ, ์ ๋ผ์ ์ ๋ผ๋ ์ผ๊ตญํต์ผ์ ์ด๋ฃฉํ์๋ค. ์ดํ ์ ๋ผ๋ ๊ณ ๊ตฌ๋ ค์ ๊ณ์น๊ตญ๊ฐ์ธ ๊ณ ๊ตฌ๋ ค๋ฅผ ๊ณ์นํ์ฌ ์ผ๊ตญํต์ผ์ ๋ฌ์ฑํ์๋ค. ๋จ๋ถ๊ตญ ์๋๊ฐ ๋์ ๋จ๋ถ๊ตญ ์๋๋ฅผ ๊ฑฐ์น๋ฉฐ, ์ผ๊ตญ ์๋๊ฐ ๋ง๋ฌด๋ฆฌ ๋์๋ค. ์ ๋ผ์ ์ผ๊ตญํต์ผ์ ์ ๋ผ๋ก๋ถํฐ ์ ๋ผ๋ก ๋์ด๊ฐ๋ ๊ณผ์ ์ ํตํด ์ด๋ฃจ์ด์ ธ์ผ ํ ์ ์น์ , ๊ฒฝ์ ์ , ์ฌํ์ , ๋ฌธํ์ , ์ ์น์ , ๋ฌธํ์ , ๋ฌธํ์ ๋ณํ๋ฅผ ๊ฒช๊ฒ ๋์๋ค. ์ ๋ผ์ ์ผ๊ตญํต์ผ์ ์ผ๊ตญ ํต์ผ์ ์ฒซ ๋ฒ์งธ ๋จ๊ณ์ด๋ค. ์ ๋ผ. ์
Model Details
| Property | Value |
|---|---|
| Parameters | ~100M |
| Hidden dimension | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Max sequence length | 1024 |
| Vocabulary size | 32000 |
Architecture Features
- RMSNorm (pre-normalization)
- Rotary Position Embeddings (RoPE)
- SwiGLU activation
- Multi-Head Attention with KV Cache
- No bias terms
- Weight tying (embedding & output)
Training
- Data: Korean Wikipedia (120125 dump)
- Epochs: 3
- Precision: bfloat16
- Optimizer: AdamW with cosine LR schedule
- Tokenizer: SentencePiece (32k vocab)
License
MIT
- Downloads last month
- 111