synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier Paper โข 2601.16113 โข Published 12 days ago
ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining Paper โข 2601.01091 โข Published Jan 3
600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script Paper โข 2601.01088 โข Published Jan 3