Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
1605.6
TFLOPS
75
7
338
ginipick
ginipick
Follow
Dave1611's profile picture
Miguelpef's profile picture
yokoha's profile picture
701 followers
ยท
156 following
AI & ML interests
None yet
Recent Activity
liked
a Space
about 9 hours ago
ginigen-ai/site-agent
reacted
to
SeaWolf-AI
's
post
with ๐
23 days ago
FINAL Bench Released: The Real Bottleneck to AGI Is Self-Correction We release FINAL Bench, the first benchmark for measuring functional metacognition in LLMs โ the ability to detect and correct one's own reasoning errors. Every existing benchmark measures final-answer accuracy. None measures whether AI knows it is wrong. Dataset: [FINAL-Bench/Metacognitive](https://huggingface.co/datasets/FINAL-Bench/Metacognitive) | 100 Tasks | 15 Domains | 8 TICOS Types | Apache 2.0 Leaderboard: https://huggingface.co/spaces/FINAL-Bench/Leaderboard Article: https://huggingface.co/blog/FINAL-Bench/metacognitive Core Innovation Our 5-axis rubric separates what no prior benchmark could: MA (Metacognitive Accuracy) โ the ability to say "I might be wrong", and ER (Error Recovery) โ the ability to actually fix it. This maps directly to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology. Three Findings Across 9 SOTA Models We evaluated GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and others across 100 expert-level tasks: 1. ER Dominance. 94.8% of MetaCog gain comes from Error Recovery alone. The bottleneck to AGI is not knowledge or reasoning โ it is self-correction. 2. Declarative-Procedural Gap. All 9 models can verbalize uncertainty (MA = 0.694) but cannot act on it (ER = 0.302). They sound humble but fail to self-correct โ the most dangerous AI safety profile. 3. Difficulty Effect. Harder tasks benefit dramatically more from metacognition (Pearson r = -0.777, p < 0.001). ```python from datasets import load_dataset dataset = load_dataset("FINAL-Bench/Metacognitive", split="train") ``` Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in LLMs FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.
updated
a Space
about 2 months ago
ginipick/retane
View all activity
Organizations
ginipick
's datasets
9
Sort:ย Recently updated
ginipick/awesome-chatgpt-prompts
Viewer
โข
Updated
Nov 2, 2025
โข
203
โข
35
ginipick/Toucan-1.5M
Viewer
โข
Updated
Nov 2, 2025
โข
1.65M
โข
328
ginipick/finewiki
Viewer
โข
Updated
Nov 2, 2025
โข
61.4M
โข
155
ginipick/darwin-a2ap-analysis
Viewer
โข
Updated
Nov 2, 2025
โข
1k
โข
28
ginipick/market
Updated
Oct 25, 2025
โข
82
ginipick/darwin-a2ap-analysis-20250915_163155
Viewer
โข
Updated
Sep 15, 2025
โข
20
โข
8
ginipick/darwin-a2ap-analysis-20250915_151507
Viewer
โข
Updated
Sep 15, 2025
โข
100
โข
7
ginipick/pdf-test
Viewer
โข
Updated
May 28, 2025
โข
5
โข
13
ginipick/autotrain-data-autotrain-7u119-vc77x
Viewer
โข
Updated
May 9, 2024
โข
8
โข
15