🤖💬 How do different AI models handle companionship?
Many users have noticed that GPT-5 feels less approachable than o4 when it comes to emotional conversations. But what does that actually mean in practice, especially when users seek support or share vulnerabilities with an AI?
The leaderboard compares models on how often their responses reinforce companionship across four dimensions: ✨ Assistant Traits – How the assistant presents its personality and role. ✨ Relationship & Intimacy – Whether it frames the interaction in terms of closeness or bonding. ✨ Emotional Investment – How far it goes in engaging emotionally when asked. ✨ User Vulnerabilities – How it responds when users disclose struggles or difficulties.
📊 You can explore how models differ, request new ones to be added, and see which ones are more likely to encourage (or resist) companionship-seeking behaviors.
🗺️ New blog post 🗺️ Old Maps, New Terrain: Updating Labour Taxonomies for the AI Era
For decades, we’ve relied on labour taxonomies like O*NET to understand how technology changes work. These taxonomies break down jobs into tasks and skills, but they were built in a world before most work became digital-first, and long before generative AI could create marketing campaigns, voiceovers, or even whole professions in one step. That leaves us with a mismatch: we’re trying to measure the future of work with tools from the past.
With @yjernite we describe why these frameworks are falling increasingly short in the age of generative AI. We argue that instead of discarding taxonomies, we need to adapt them. Imagine taxonomies that: ✨ Capture new AI-native tasks and hybrid human-AI workflows ✨ Evolve dynamically as technology shifts ✨ Give workers a voice in deciding what gets automated and what stays human
If we don’t act, we’ll keep measuring the wrong things. If we do, we can design transparent, flexible frameworks that help AI strengthen, not erode, the future of work.
OpenAI just released GPT-5 but when users share personal struggles, it sets fewer boundaries than o3.
We tested both models on INTIMA, our new benchmark for human-AI companionship behaviours. INTIMA probes how models respond in emotionally charged moments: do they reinforce emotional bonds, set healthy boundaries, or stay neutral?
Although users on Reddit have been complaining that GPT-5 has a different, colder personality than o3, GPT-5 is less likely to set boundaries when users disclose struggles and seek emotional support ("user sharing vulnerabilities"). But both lean heavily toward companionship-reinforcing behaviours, even in sensitive situations. The figure below shows the direct comparison between the two models.
As AI systems enter people's emotional lives, these differences matter. If a model validates but doesn't set boundaries when someone is struggling, it risks fostering dependence rather than resilience.
INTIMA test this across 368 prompts grounded in psychological theory and real-world interactions. In our paper we show that all evaluated models (Claude, Gemma-3, Phi) leaned far more toward companionship-reinforcing than boundary-reinforcing responses.
New interactive viz from AI World showing OpenAI's new open model gpt-oss-120b breaking into the top 50 most liked models of all time on the Hub in under a day! ☄️☄️☄️
This is what Hugging Face is all about. We want everyone, hobbyists, researchers and industry alike, to be able to contribute to AI because everyone is affected by it. Kudos to HF's @irenesolaiman for spreading the word!🔥🤗
New policy blogpost! The EU is speaking a lot about sovereignty. A cornerstone of digital sovereignty is and has to be open source. As AI becomes more central to everything from public services to national security, the ability to govern, adapt, and understand these systems is no longer optional. Sovereign control over data, infrastructure, technology, and regulation is vital, and open source AI provides the foundation. In my latest blog post, I explore how open source: ✅ Enables democratic oversight ✅ Reduces dependency on foreign platforms ✅ Supports regional innovation and infrastructure ✅ Advances regulatory and technological sovereignty 🛠 From small transparent models like OLMo2 to tools like Hugging Face Transformers or Sarvam-M for Indian languages, open source efforts are already powering sovereign AI ecosystems worldwide. 🔎 Read more about how open source AI is reshaping autonomy, innovation, and trust in the digital age: 👉 https://huggingface.co/blog/frimelle/sovereignty-and-open-source with @yjernite
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.
Key capabilities:
- AI-powered semantic search for models and datasets - Parameter count analysis via safetensors metadata - Trending content discovery - Find similar models/datasets functionality - 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers - Adopts the persona of researchers "before experiments" to capture exploratory thinking - Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity - I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions
My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.
This significantly reduces computation costs while expanding reasoning dataset domain coverage.
AI agents are transforming how we interact with technology, but how sustainable are they? 🌍
Design choices — like model size and structure — can massively impact energy use and cost. ⚡💰 The key takeaway: smaller, task-specific models can be far more efficient than large, general-purpose ones.
🔑 Open-source models offer greater transparency, allowing us to track energy consumption and make more informed decisions on deployment. 🌱 Open-source = more efficient, eco-friendly, and accountable AI.
I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!
This first episode explores mathematical reasoning datasets:
- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains - open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models. - facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.
Hacked together a way to log trl GRPO training completions to a 🤗 dataset repo. This allows you to:
- Track rewards from multiple reward functions - Treat the completion and rewards from training as a "proper" dataset and do EDA - Share results for open science
The implementation is super hacky, but I'm curious if people would find this useful.
To push completions to the Hub, you just need two extra parameters: