Evaluation Results for PleIAs/Monad & Suggested Model Card Update

#5
by GODELEV - opened

TO: PleIAs Team

DATE: June 2, 2026

SUBJECT: Evaluation Results for PleIAs/Monad & Suggested Model Card Update

First of all, I wanted to reach out and congratulate you on building such a fabulous model! The work you've done with PleIAs/Monad is truly impressive, and it's exciting to see its capabilities in action.

I have recently run a comprehensive evaluation suite on the model. I wanted to share these statistics directly with you, as they highlight some very strong baseline performance metrics across a diverse set of language and reasoning benchmarks.

Below is the structured breakdown of the evaluation metrics extracted from my testing logs.

Benchmark Evaluation Results

Benchmark Task Metric Score / Value Few-Shot Status
BLiMP Linguistics / Grammar Accuracy 70.46%
0-shot Success
BoolQ Reading Comprehension Accuracy 61.25%
0-shot Success
PIQA Physical Commonsense Normalized Accuracy 54.79%
0-shot Success
WinoGrande Coreference Resolution Accuracy 52.25%
5-shot Success
ARC-Easy General Science QA Normalized Accuracy 44.40%
25-shot Success
HellaSwag Commonsense Reasoning Normalized Accuracy 30.20%
10-shot Success
OpenBookQA Scientific QA Normalized Accuracy 28.40%
0-shot Success
ARC-Challenge Hard Science QA Normalized Accuracy 25.09%
25-shot Success
MMLU Multi-task Knowledge Accuracy 24.85%
5-shot Success
CommonsenseQA Commonsense Reasoning Accuracy 19.82%
7-shot Success
LAMBADA Language Modeling Accuracy 19.39%
0-shot Success
WikiText-2 Language Modeling Word Perplexity 57.59
0-shot Success

Note on Failed Tasks: The arithmetic benchmark run was unsuccessful during this run due to a script compatibility error ("Dataset scripts are no longer supported, but found arithmetic.py"), resulting in no score.


Recommendation

Given how thoroughly this maps out the model's performance—particularly its standout accuracy on BLiMP (70.46%) and BoolQ (61.25%)I highly recommend adding these evaluation benchmarks to your official model card.

Providing these transparent benchmarks will give the open-source community a much clearer picture of PleIAs/Monad’s strengths, especially in linguistic validity and reading comprehension.

Kudos again on a fantastic model release! I look forward to seeing how PleIAs continues to evolve.

Best regards,

Akshit

Sign up or log in to comment