• 📆 ThursdAI - Jun 5, 2025 - Live from AI Engineer with Swyx, new Gemini 2.5 with Logan K and Jack Rae, Self Replicating agents with Morph Labs
    Jun 6 2025
    Hey folks, this is Alex, coming to you LIVE from the AI Engineer Worlds Fair! What an incredible episode this week, we recorded live from floor 30th at the Marriott in SF, while Yam was doing live correspondence from the floor of the AI Engineer event, all while Swyx, the cohost of Latent Space podcast, and the creator of AI Engineer (both the conference and the concept itself) joined us for the whole stream - here’s the edited version, please take a look. We've had around 6500 people tune in, and at some point we got 2 surprise guests, straight from the keynote stage, Logan Kilpatrick (PM for AI Studio and lead cheerleader for Gemini) and Jack Rae (principal scientist working on reasoning) joined us for a great chat about Gemini! Mind was absolutely blown! They have just launched the new Gemini 2.5 Pro and I though it would only be fitting to let their new model cover this podcast this week (so below is fully AI generated ... non slop I hope). The show notes and TL;DR is as always in the end. Okay, enough preamble… let's dive into the madness!🤯 Google Day at AI Engineer: New Gemini 2.5 Pro and a Look Inside the Machine's MindFor the first year of this podcast, a recurring theme was us asking, "Where's Google?" Well, it's safe to say that question has been answered with a firehose of innovation. We were lucky enough to be joined by Google DeepMind's Logan Kilpatrick and Jack Rae, the tech lead for "thinking" within Gemini, literally moments after they left the main stage.Surprise! A New Gemini 2.5 Pro Drops LiveLogan kicked things off with a bang, officially announcing a brand new, updated Gemini 2.5 Pro model right there during his keynote. He called it "hopefully the final update to 2.5 Pro," and it comes with a bunch of performance increases, closing the gap on feedback from previous versions and hitting SOTA on benchmarks like Aider.It's clear that the organizational shift to bring the research and product teams together under the DeepMind umbrella is paying massive dividends. Logan pointed out that Google has seen a 50x increase in AI inference over the past year. The flywheel is spinning, and it's spinning fast.How Gemini "Thinks"Then things got even more interesting. Jack Rae gave us an incredible deep dive into what "thinking" actually means for a language model. This was one of the most insightful parts of the conference for me.For years, the bottleneck for LLMs has been test-time compute. Models were trained to respond immediately, applying a fixed amount of computation to go from a prompt to an answer, no matter how hard the question. The only way to get a "smarter" response was to use a bigger model.Jack explained that "Thinking" shatters this limitation. Mechanically, Gemini now has a "thinking stage" where it can generate its own internal text—hypothesizing, testing, correcting, and reasoning—before committing to a final answer. It's an iterative loop of computation that the model can dynamically control, using more compute for harder problems. It learns how to think using reinforcement learning, getting a simple "correct" or "incorrect" signal and backpropagating that to shape its reasoning strategies.We're already seeing the results of this. Jack showed a clear trend: as models get better at reasoning, they're also using more test-time compute. This paradigm also gives developers a "thinking budget" slider in the API for Gemini 2.5 Flash and Pro, allowing a continuous trade-off between cost and performance.The future of this is even wilder. They're working on DeepThink, a high-budget mode for extremely hard problems that uses much deeper, parallel chains of thought. On the tough USA Math Olympiad, where the SOTA was negligible in January, 2.5 Pro reached the 50th percentile of human participants. DeepThink pushes that to the 65th percentile.Jack’s ultimate vision is inspired by the mathematician Ramanujan, who derived incredible theorems from a single textbook by just thinking deeply. The goal is for models to do the same—contemplate a small set of knowledge so deeply that they can push the frontiers of human understanding. Absolutely mind-bending stuff.🤖 MorphLabs and the Audacious Quest for Verified SuperintelligenceJust when I thought my mind couldn't be bent any further, we were joined by Jesse Han, the founder and CEO of MorphLabs. Fresh off his keynote, he laid out one of the most ambitious visions I've heard: building the infrastructure for the Singularity and developing "verified superintelligence."The big news was that Christian Szegedy is joining MorphLabs as Chief Scientist. For those who don't know, Christian is a legend—he invented batch norm and adversarial examples, co-founded XAI, and led code reasoning for Grok. That's a serious hire.Jesse’s talk was framed around a fascinating question: "What does it mean to have empathy for the machine?" He argues that as AI develops personhood, we need to think about what it wants. And what it wants, according ...
    Show more Show less
    1 hr and 44 mins
  • 📆 ThursdAI - May 29 - DeepSeek R1 Resurfaces, VEO3 viral moments, Opus 4 a week after, Flux Kontext image editing & more AI news
    May 29 2025
    Hey everyone, Alex here 👋Welcome back to another absolutely wild week in AI! I'm coming to you live from the Fontainebleau Hotel in Vegas at the Imagine AI conference, and wow, what a perfect setting to discuss how AI is literally reimagining our world. After last week's absolute explosion of releases (Claude Opus 4, Google I/O madness, OpenAI Codex and Jony colab), this week gave us a chance to breathe... sort of. Because even in a "quiet" week, we still got a new DeepSeek model that's pushing boundaries, and the entire internet discovered that we might all just be prompts. Yeah, it's been that kind of week!Before we dive in, quick shoutout to everyone who joined us live - we had some technical hiccups with the Twitter Spaces audio (sorry about that!), but the YouTube stream was fire. And speaking of fire, we had two incredible guests join us: Charlie Holtz from Chorus (the multi-model chat app that's changing how we interact with AI) and Linus Eckenstam, who's been traveling the AI conference circuit and bringing us insights from the frontlines of the generative AI revolution.Open Source AI & LLMs: DeepSeek Whales & Mind-Bending PapersDeepSeek dropped R1-0528 out of nowhere, an update to their reasoning beast with some serious jumps in performance. We’re talking AIME at 91 (beating previous scores by a mile), LiveCodeBench at 73, and SWE verified at 57.6. It’s edging closer to heavyweights like o3, and folks on X are already calling it “clearer thinking.” There was hype it might’ve been R2, but the impact didn’t quite crash the stock exchange like past releases. Still, it’s likely among the best open-weight models out there.So what's new? Early reports and some of my own poking around suggest this model "thinks clearer now." Nisten mentioned that while previous DeepSeek models sometimes liked to "vibe around" and explore the latent space before settling on an answer, this one feels a bit more direct.And here’s the kicker—they also released an 8B distilled version based on Qwen3, runnable on your laptop. Yam called it potentially the best 8B model to date, and you can try it on Ollama right now. No need for a monster rig! The Mind-Bending "Learning to Reason Without External Rewards" PaperOkay, this paper result broke my brain, and apparently everyone else's too. This paper shows that models can improve through reinforcement learning with its own intuition of whether or not it's correct. 😮It's like the placebo effect for AI! The researchers trained models without telling them what was good or bad, but rather, utilized a new framework called Intuitor, where the reward was based on how the "self certainty". The thing that took my whole timeline by storm is, it works! GRPO (Group Policy Optimization) - the framework that DeepSeek gave to the world with R1 is based on external rewards (human optimize) and Intuitor seems to be mathcing or even exceeding some of GRPO results when Qwen2.5 3B was used to finetune. Incredible incredible stuffBig Companies LLMs & APIsClaude Opus 4: A Week Later – The Dev Darling?Claude Opus 4, whose launch we celebrated live on the show, has had a week to make its mark. Charlie Holtz, who's building Chorus (more on that amazing app in a bit!), shared that while it's sometimes "astrology" to judge the vibes of a new model, Opus 4 feels like a step change, especially in coding. He mentioned that Claude Code, powered by Opus 4 (and Sonnet 4 for implementation), is now tackling GitHub issues that were too complex just weeks ago. He even had a coworker who "vibe coded three websites in a weekend" with it – that's a tangible productivity boost!Linus Eckenstam highlighted how Lovable.dev saw their syntax error rates plummet by nearly 50% after integrating Claude 4. That’s quantifiable proof of improvement! It's clear Anthropic is leaning heavily into the developer/coding space. Claude Opus is now #1 on the LMArena WebDev arena, further cementing its reputation.I had my own magical moment with Opus 4 this week. I was working on an MCP observability talk for the AI Engineer conference and trying to integrate Weave (our observability and evals framework at Weights & Biases) into a project. Using Windsurf's Cascade agent (which now lets you bring your own Opus 4 key, by the way – good move, Windsurf!), Opus 4 not only tried to implement Weave into my agent but, when it got stuck, it figured out it had access to the Weights & Biases support bot via our MCP tool. It then formulated a question to the support bot (which is also AI-powered!), got an answer, and used that to fix the implementation. It then went back and checked if the Weave trace appeared in the dashboard! Agents talking to agents to solve a problem, all while I just watched – my jaw was on the floor. Absolutely mind-blowing.Quick Hits: Voice Updates from OpenAI & AnthropicOpenAI’s Advanced Voice Mode finally sings—yes, I’ve been waiting for this! It can belt out tunes like Mariah Carey, ...
    Show more Show less
    1 hr and 28 mins
  • 📆 ThursdAI - Veo3, Google IO25, Claude 4 Opus/Sonnet, OpenAI x Jony Ive, Codex, Copilot Agent - INSANE AI week
    May 23 2025
    Hey folks, Alex here, welcome back to ThursdAI! And folks, after the last week was the calm before the storm, "The storm came, y'all" – that's an understatement. This wasn't just a storm; it was an AI hurricane, a category 5 of announcements that left us all reeling (in the best way possible!). From being on the ground at Google I/O to live-watching Anthropic drop Claude 4 during our show, it's been an absolute whirlwind.This week was so packed, it felt like AI Christmas, with tech giants and open-source heroes alike showering us with gifts. We saw OpenAI play their classic pre-and-post-Google I/O chess game, Microsoft make some serious open-source moves, Google unleash an avalanche of updates, and Anthropic crash the party with Claude 4 Opus and Sonnet live stream in the middle of ThursdAI!So buckle up, because we're about to try and unpack this glorious chaos. As always, we're here to help you collectively know, learn, and stay up to date, so you don't have to. Let's dive in! (TL;DR and links in the end) Open Source LLMs Kicking Things OffEven with the titans battling, the open-source community dropped some serious heat this week. It wasn't the main headline grabber, but the releases were significant!Gemma 3n: Tiny But Mighty MatryoshkaFirst up, Google's Gemma 3n. This isn't just another small model; it's a "Nano-plus" preview, a 4-billion parameter MatFormer (Matryoshka Transformer – how cool is that name?) model designed for mobile-first multimodal applications. The really slick part? It has a nested 2-billion parameter sub-model that can run entirely on phones or Chromebooks.Yam was particularly excited about this one, pointing out the innovative "model inside another model" design. The idea is you can use half the model, not depth-wise, but throughout the layers, for a smaller footprint without sacrificing too much. It accepts interleaved text, image, audio, and video, supports ASR and speech translation, and even ships with RAG and function-calling libraries for edge apps. With a 128K token window and responsible AI features baked in, Gemma 3n is looking like a powerful tool for on-device AI. Google claims it beats prior 4B mobile models on MMLU-Lite and MMMU-Mini. It's an early preview in Google AI Studio, but it definitely flies on mobile devices.Mistral & AllHands Unleash Devstral 24BThen we got a collaboration from Mistral and AllHands: Devstral, a 24-billion parameter, state-of-the-art open model focused on code. We've been waiting for Mistral to drop some open-source goodness, and this one didn't disappoint.Nisten was super hyped, noting it beats o3-Mini on SWE-bench verified – a tough benchmark! He called it "the first proper vibe coder that you can run on a 3090," which is a big deal for coders who want local power and privacy. This is a fantastic development for the open-source coding community.The Pre-I/O Tremors: OpenAI & Microsoft Set the StageAs we predicted, OpenAI couldn't resist dropping some news right before Google I/O.OpenAI's Codex Returns as an AgentOpenAI launched Codex – yes, that Codex, but reborn as an asynchronous coding agent. This isn't just a CLI tool anymore; it connects to GitHub, does pull requests, fixes bugs, and navigates your codebase. It's powered by a new coding model fine-tuned for large codebases and was SOTA on SWE Agent when it dropped. Funnily, the model is also called Codex, this time, Codex-1. And this gives us a perfect opportunity to talk about the emerging categories I'm seeing among Code Generator agents and tools:* IDE-based (Cursor, Windsurf): Live pair programming in your editor* Vibe coding (Lovable, Bolt, v0): "Build me a UI" style tools for non-coders* CLI tools (Claude Code, Codex-cli): Terminal-based assistants* Async agents (Claude Code, Jules, Codex, GitHub Copilot agent, Devin): Work on your repos while you sleep, open pull requests for you to review, asyncCodex (this new one) falls into category number 4, and with today's release, Cursor seems to also strive to get to category number 4 with background processing. Microsoft BUILD: Open Source Copilot and Copilot Agent ModeThen came Microsoft Build, their huge developer conference, with a flurry of announcements.The biggest one for me? GitHub Copilot's front-end code is now open source! The VS Code editor part was already open, but the Copilot integration itself wasn't. This is a massive move, likely a direct answer to the insane valuations of VS Code clones like Cursor. Now, you can theoretically clone GitHub Copilot with VS Code and swing for the fences.GitHub Copilot also launched as an asynchronous coding assistant, very similar in function to OpenAI's Codex, allowing it to be assigned tasks and create/update PRs. This puts Copilot right into category 4 of code assistants, and with the native Github Integration, they may actually have a leg up in this race!And if that wasn't enough, Microsoft is adding MCP (Model Context Protocol) support directly into the Windows OS. The ...
    Show more Show less
    1 hr and 28 mins
  • 📆 ThursdAI - May 15 - Genocidal Grok, ChatGPT 4.1, AM-Thinking, Distributed LLM training & more AI news
    May 16 2025
    Hey yall, this is Alex 👋What a wild week, it started super slow, and it still did feel slow as releases are concerned, but the most interesting story was yet another AI gone "rogue" (have you even heard about "kill the boar", if not, Grok will tell you all about it) Otherwise it seemed fairly quiet in AI land this week, besides another Chinese newcomer called AM-thinking 32B that beats DeepSeek and Qwen, and Stability making a small comeback, we focused on distributed LLM training and ChatGPT 4.1We've had a ton of fun on this episode, this one was being recorded from the Weights & Biases SF Office (I'm here to cover Google IO next week!)Let’s dig in—because what looks like a slow week on the surface was anything but dull under the hood (TL'DR and show notes at the end as always)Big Companies & APIsWhy does XAI Grok talk about White Genocide and "Kill the boar"??Just after we're getting over the chatGPT glazing incident , folks started noticing that @grok - XAI's frontier LLM that is also responding to X replies, started talking about White Genocide in South Africa and something called "Kill the boer" with no reference to any of these things in the question! Since we recorded the episode, XAI official X account posted that an "unauthorized modification" happened to the system prompt, and that going forward they would open source all the prompts (and they did). Whether or not they would keep updating that repository though, remains unclear (see the "open sourced" x algorithm to which the last push was over a year ago, or the promised Grok 2 that was never open sourced) While it's great to have some more clarity from the Xai team, this behavior raises a bunch of questions about the increasing roles of AI's in our lives and the trust that many folks are giving them. Adding fuel to the fire, are Uncle Elon's recent tweets that are related to South Africa, and this specific change seems to be related to those views at least partly. Remember also, Grok was meant as "maximally truth seeking" AI! I really hope this transparency continues!Open Source LLMs: The Decentralization TsunamiAM-Thinking v1: Dense Reasoning, SOTA Math, Single-Checkpoint DeployabilityOpen source starts with the kind of progress that would have been unthinkable 18 months ago: a 32B dense LLM, openly released, that takes on the big mixture-of-experts models and comes out on top for math and code. AM-Thinking v1 (paper here) hits 85.3% on AIME 2024, 70.3% on LiveCodeBench v5, and 92.5% on Arena-Hard. It even runs at 25 tokens/sec on a single 80GB GPU with INT4 quantization.The model supports a /think reasoning toggle (chain-of-thought on demand), comes with a permissive license, and is fully tooled for vLLM, LM Studio, and Ollama. Want to see where dense models can still push the limits? This is it. And yes, they’re already working on a multilingual RLHF pass and 128k context window.Personal note: We haven’t seen this kind of “out of nowhere” leaderboard jump since the early days of Qwen or DeepSeek. This company's debut on HuggingFace with a model that crushes! Decentralized LLM Training: Nous Research Psyche & Prime Intellect INTELLECT-2This week, open source LLMs didn’t just mean “here are some weights.” It meant distributed, decentralized, and—dare I say—permissionless AI. Two labs stood out:Nous Research launches PsycheDylan Rolnick from Nous Research joined the show to explain Psyche: a Rust-powered, distributed LLM training network where you can watch a 40B model (Consilience-40B) evolve in real time, join the training with your own hardware, and even have your work attested on a Solana smart contract. The core innovation? DisTrO (Decoupled Momentum) which we covered back in December that drastically compresses the gradient exchange so that training large models over the public internet isn’t a pipe dream—it’s happening right now.Live dashboard here, open codebase, and the testnet already humming with early results. This massive 40B attempt is going to show whether distributed training actually works! The cool thing about their live dashboard is, it's WandB behind the scenes, but with a very thematic and cool Nous Research reskin! This model saves constant checkpoints to the hub as well, so the open source community can enjoy a full process of seeing a model being trained! Prime Intellect INTELLECT-2Not to be outdone, Prime Intellect’s INTELLECT-2 released a globally decentralized, 32B RL-trained reasoning model, built on a permissionless swarm of GPUs. Using their own PRIME-RL framework, SHARDCAST checkpointing, and an LSH-based rollout verifier, they’re not just releasing a model—they’re proving it’s possible to scale serious RL outside a data center. OpenAI's HealthBench: Can LLMs Judge Medical Safety?One of the most intriguing drops of the week is HealthBench, a physician-crafted benchmark for evaluating LLMs in clinical settings. Instead of just multiple-choice “gotcha” tests, ...
    Show more Show less
    1 hr and 29 mins
  • ThursdAI - May 8th - new Gemini pro, Mistral Medium, OpenAI restructuring, HeyGen Realistic Avatars & more AI news
    May 9 2025
    Hey folks, Alex here (yes, real me, not my AI avatar, yet)Compared to previous weeks, this week was pretty "chill" in the world of AI, though we did get a pretty significant Gemini 2.5 Pro update, it basically beat itself on the Arena. With Mistral releasing a new medium model (not OSS) and Nvidia finally dropping Nemotron Ultra (both ignoring Qwen 3 performance) there was also a few open source updates. To me the highlight of this week was a breakthrough in AI Avatars, with Heygen's new IV model, Beating ByteDance's OmniHuman (our coverage) and Hedra labs, they've set an absolute SOTA benchmark for 1 photo to animated realistic avatar. Hell, Iet me record all this real quick and show you how good it is! How good is that?? I'm still kind of blown away. I have managed to get a free month promo code for you guys, look for it in the TL;DR section at the end of the newsletter. Of course, if you’re rather watch than listen or read, here’s our live recording on YTOpenSource AINVIDIA's Nemotron Ultra V1: Refining the Best with a Reasoning Toggle 🧠NVIDIA also threw their hat further into the ring with the release of Nemotron Ultra V1, alongside updated Super and Nano versions. We've talked about Nemotron before – these are NVIDIA's pruned and distilled versions of Llama 3.1, and they've been impressive. The Ultra version is the flagship, a 253 billion parameter dense model (distilled and pruned from Llama 3.1 405B), and it's packed with interesting features.One of the coolest things is the dynamic reasoning toggle. You can literally tell the model "detailed thinking on" or "detailed thinking off" via a system prompt during inference. This is something Qwen also supports, and it looks like the industry is converging on this idea of letting users control the "depth" of thought, which is super neat.Nemotron Ultra boasts a 128K context window and, impressively, can fit on a single 8xH100 node thanks to Neural Architecture Search (NAS) and FFN-Fusion. And performance-wise, it actually outperforms the Llama 3 405B model it was distilled from, which is a big deal. NVIDIA shared a chart from Artificial Analysis (dated April 2025, notably before Qwen3's latest surge) showing Nemotron Ultra standing strong among models like Gemini 2.5 Flash and Opus 3 Mini.What's also great is NVIDIA's commitment to openness here: they've released the models under a commercially permissive NVIDIA Open Model License, the complete post-training dataset (Llama-Nemotron-Post-Training-Dataset), and their training codebases (NeMo, NeMo-Aligner, Megatron-LM). This allows for reproducibility and further community development. Yam Peleg pointed out the cool stuff they did with Neural Architecture Search to optimally reduce parameters without losing performance.Absolute Zero: AI Learning to Learn, Zero (curated) Data Required! (Arxiv)LDJ brought up a fascinating paper that ties into this theme of self-improvement and reinforcement learning: "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" from Andrew Zhao (Tsinghua University) and a few othersThe core idea here is a system that self-evolves its training curriculum and reasoning ability. Instead of needing a pre-curated dataset of problems, the model creates the problems itself (e.g., code reasoning tasks) and then uses something like a Code Executor to validate its proposed solutions, serving as a unified source of verifiable reward. It's open-ended yet grounded learning.By having a verifiable environment (code either works or it doesn't), the model can essentially teach itself to code without external human-curated data.The paper shows fine-tunes of Qwen models (like Qwen Coder) achieving state-of-the-art results on benchmarks like MBBP and AIME (Math Olympiad) with no pre-existing data for those problems. The model hallucinates questions, creates its own rewards, learns, and improves. This is a step beyond synthetic data, where humans are still largely in charge of generation. It's wild, and it points towards a future where AI systems could become increasingly autonomous in their learning.Big Companies & APIsGoogle dropped another update to their Gemini 2.5 Pro, this time the "IO edition" preview, specifically touting enhanced coding performance. This new version jumped to the #1 spot on WebDev Arena (a benchmark where human evaluators choose between two side-by-side code generations in VS Code), with a +147 Elo point gain, surpassing Claude 3.7 Sonnet. It also showed improvements on benchmarks like LiveCodeBench (up 7.39%) and Aider Polyglot (up ~3-6%). Google also highlighted its state-of-the-art video understanding (84.8% on VideoMME) with examples like generating code from a video of an app. Which essentially lets you record a drawing of how your app interaction will happen, and the model will use that video instructions! It's pretty cool. Though, not everyone was as impressed, folks noted that while gaining in a few evals, this model also regressed in several others ...
    Show more Show less
    1 hr and 34 mins
  • 📆 ThursdAI - May 1- Qwen 3, Phi-4, OpenAI glazegate, RIP GPT4, LlamaCon, LMArena in hot water & more AI news
    May 1 2025
    Hey everyone, Alex here 👋Welcome back to ThursdAI! And wow, what a week. Seriously, strap in, because the AI landscape just went through some seismic shifts. We're talking about a monumental open-source release from Alibaba with Qwen 3 that has everyone buzzing (including us!), Microsoft dropping Phi-4 with Reasoning, a rather poignant farewell to a legend (RIP GPT-4 – we'll get to the wake shortly), major drama around ChatGPT's "glazing" incident and the subsequent rollback, updates from LlamaCon, a critical look at Chatbot Arena, and a fantastic deep dive into the world of AI evaluations with two absolute experts, Hamel Husain and Shreya Shankar.This week felt like a whirlwind, with open source absolutely dominating the headlines. Qwen 3 didn't just release a model; they dropped an entire ecosystem, setting a potential new benchmark for open-weight releases. And while we pour one out for GPT-4, we also have to grapple with the real-world impact of models like ChatGPT, highlighted by the "glazing" fiasco. Plus, video consistency takes a leap forward with Runway, and we got breaking news live on the show from Claude!So grab your coffee (or beverage of choice), settle in, and let's unpack this incredibly eventful week in AI.Open-Source LLMsQwen 3 — “Hybrid Thinking” on TapAlibaba open-weighted the entire Qwen 3 family this week, releasing two MoE titans (up to 235 B total / 22 B active) and six dense siblings all the way down to 0 .6 B, all under Apache 2.0. Day-one support landed in LM Studio, Ollama, vLLM, MLX and llama.cpp.The headline trick is a runtime thinking toggle—drop “/think” to expand chain-of-thought or “/no_think” to sprint. On my Mac, the 30 B-A3B model hit 57 tokens/s when paired with speculative decoding (drafted by the 0 .6 B sibling).Other goodies:* 36 T pre-training tokens (2 × Qwen 2.5)* 128 K context on ≥ 8 B variants (32 K on the tinies)* 119-language coverage, widest in open source* Built-in MCP schema so you can pair with Qwen-Agent* The dense 4 B model actually beats Qwen 2.5-72B-Instruct on several evals—at Raspberry-Pi footprintIn short: more parameters when you need them, fewer when you don’t, and the lawyers stay asleep. Read the full drop on the Qwen blog or pull weights from the HF collection.Performance & Efficiency: "Sonnet at Home"?The benchmarks are where things get really exciting.* The 235B MoE rivals or surpasses models like DeepSeek-R1 (which rocked the boat just months ago!), O1, O3-mini, and even Gemini 2.5 Pro on coding and math.* The 4B dense model incredibly beats the previous generation's 72B Instruct model (Qwen 2.5) on multiple benchmarks! 🤯* The 30B MoE (with only 3B active parameters) is perhaps the star. Nisten pointed out people are getting 100+ tokens/sec on MacBooks. Wolfram achieved an 80% MMLU Pro score locally with a quantized version. The efficiency math is crazy – hitting Qwen 2.5 performance with only ~10% of the active parameters.Nisten dubbed the larger model "Sonnet 3.5 at home," and while acknowledging Sonnet still has an edge in complex "vibe coding," the performance, especially in reasoning and tool use, is remarkably close for an open model you can run yourself.I ran the 30B MoE (3B active) locally using LLM Studio (shoutout for day-one support!) through my Weave evaluation dashboard (Link). On a set of 20 hard reasoning questions, it scored 43%, beating GPT 4.1 mini and nano, and getting close to 4.1 – impressive for a 3B active parameter model running locally!Phi-4-Reasoning — 14B That Punches at 70B+Microsoft’s Phi team layered 1.4 M chain-of-thought traces plus a dash of RL onto Phi-4 to finally ship a resoning Phi and shipped two MIT-licensed checkpoints:* Phi-4-Reasoning (SFT)* Phi-4-Reasoning-Plus (SFT + RL)Phi-4-R-Plus clocks 78 % on AIME 25, edging DeepSeek-R1-Distill-70B, with 32 K context (stable to 64 K via RoPE). Scratch-pads hide in tags. Full details live in Microsoft’s tech report and HF weights.It's fascinating to see how targeted training on reasoning traces and a small amount of RL can elevate a relatively smaller model to compete with giants on specific tasks.Other Open Source Updates* MiMo-7B: Xiaomi entered the ring with a 7B parameter, MIT-licensed model family, trained on 25T tokens and featuring rule-verifiable RL. (HF model hub)* Helium-1 2B: KyutAI (known for their Mochi voice model) released Helium-1, a 2B parameter model distilled from Gemma-2-9B, focused on European languages, and licensed under CC-BY 4.0. They also open-sourced 'dactory', their data processing pipeline. (Blog, Model (2 B), Dactory pipeline)* Qwen 2.5 Omni 3B: Alongside Qwen 3, the Qwen team also updated their existing Omni model with a 3B model, that retains 90% of the comprehension of its big brother with a 50% VRAM drop! (HF)* JetBrains open sources Mellum: Trained on over 4 trillion tokens with a context window of 8192 tokens across multiple programming languages, they haven't released any ...
    Show more Show less
    1 hr and 30 mins
  • ThursdAI - Apr 23rd - GPT Image & Grok APIs Drop, OpenAI ❤️ OS? Dia's Wild TTS & Building Better Agents!
    Apr 24 2025
    Hey everyone, Alex here 👋Welcome back to ThursdAI! After what felt like ages of non-stop, massive model drops (looking at you, O3 and GPT-4!), we finally got that "chill week" we've been dreaming of since maybe... forever? It seems the big labs are taking a breather, probably gearing up for even bigger things next week (maybe some open source 👀).But "chill" doesn't mean empty! This week was packed with fascinating developments, especially in the open source world and with long-awaited API releases. We actually had time to dive deeper into things, which was a refreshing change. We had a fantastic lineup of guests joining us too: Kwindla Kramer (@kwindla), our resident voice expert, dropped in to talk about some mind-blowing TTS and her own open-source VAD release. Maziyar Panahi (@MaziyarPanahi) gave us the inside scoop on OpenAI's recent meeting with the open source community. And Dex Horthy (@dexhorthy) from HumanLayer shared some invaluable insights on building robust AI agents that actually work in the real world. It was great having them alongside the usual ThursdAI crew: LDJ, Yam, Wolfram, and Nisten!So, instead of rushing through a million headlines, we took a more relaxed pace. We explored NVIDIA's cool new Describe Anything model, dug into Google's Quantization Aware Training for Gemma, celebrated the much-anticipated API release for OpenAI's GPT Image generation (finally!), checked out the new Grok API, got absolutely blown away by a tiny, open-source TTS model from Korea called Dia, and debated the principles of building better AI agents. Plus, a surprise drop from Send AI with a powerful video model!Let's dive in!Open Source AI Highlights: Community, Vision, and EfficiencyEven with the big players quieter on the model release front, the open source scene was buzzing. It feels like this "chill" period gave everyone a chance to focus on refining tools, releasing datasets, and engaging with the community.OpenAI Inches Closer to Open Source? Insights from the Community MeetingPerhaps the biggest non-release news of the week was OpenAI actively engaging with the open source community. Friend of the show Maziyar Panahi was actually in the room (well, the Zoom room) and joined us to share what went down It sounds like OpenAI came prepared, with Sam Altman himself spending significant time answering questions . Maziyar gave us the inside scoop, mentioning that OpenAI's looking to offload some GPU pressure by embracing open source – a win-win where they help the community, and the community helps lighten their load. He painted a picture of a company genuinely trying to listen and figure out how to best contribute. It felt less like a checkbox exercise and more like genuine engagement, which is awesome to see.What did the community ask for? Based on Maziyar's recap, there was a strong consensus on several key points:* Model Size: The sweet spot seemed to be not tiny, but not astronomically huge either. Something in the 70B-200B parameter range that could run reasonably on, say, 4 GPUs, leaving room for other models. People want power they can actually use without needing a supercomputer.* Capabilities: A strong desire for reliable structured output. Surprisingly, there was less emphasis on complex, built-in reasoning, or at least the ability to toggle reasoning off. This likely stems from practical concerns about cost and latency in production environments. The community seems to value control and efficiency for specific tasks.* Multilingual: Good support for European languages (at least 20) was a major request, reflecting the global nature of the open source community. Needs to be as good as English support.* Base Models: A huge ask was for OpenAI to release base models. The reasoning? Empower the community to handle fine-tuning for specific tasks like coding, roleplay, or supporting underrepresented languages . Let the experts in those niches build on a solid foundation.* Focus: Usefulness over chasing leaderboard glory. The community urged OpenAI to provide a solid, practical model rather than aiming for a temporary #1 spot that gets outdated in days or weeks . Stability, reliability, and long-term utility were prized over fleeting benchmark wins.* Safety: A preference for separate guardrail models (similar to LlamaGuard or GemmaGuard) rather than overly aligning the main model, which often hurts performance and flexibility . Give users the tools to implement safety layers as needed, rather than baking in limitations that might stifle creativity or utility.Perhaps most excitingly, Maziyar mentioned OpenAI seemed committed to regular open model releases, not just a one-off thin=! This, combined with recent moves like approving a community Pull Request to make their open-source Codex agent work with non-OpenAI models (as Yam Peleg excitedly pointed out!), suggests a potentially significant shift. Remember, it's been a long time since GPT-2 and Whisper were OpenAI's main open contributions! ...
    Show more Show less
    1 hr and 37 mins
  • ThursdAI - Apr 17 - OpenAI o3 is SOTA llm, o4-mini, 4.1, mini, nano, G. Flash 2.5, Kling 2.0 and 🐬 Gemma? Huge AI week + A2A protocol interview
    Apr 17 2025
    Hey everyone, Alex here 👋Wow. Just… wow. What a week, folks. Seriously, this has been one for the books. This week was dominated by OpenAI's double whammy: first the GPT-4.1 family dropped with a mind-boggling 1 million token context window, followed swiftly by the new flagship reasoning models, o3 and o4-mini, which are already blowing minds with their agentic capabilities. We also saw significant moves from Google with VEO-2 going GA, the fascinating A2A protocol launch (we had an amazing interview with Google's Todd Segal about it!), and even an attempt to talk to dolphins with DolphinGemma. Kling stepped up its video game, Cohere dropped SOTA multimodal embeddings, and ByteDance made waves in image generation. Plus, the open-source scene had some interesting developments, though perhaps overshadowed by the closed-source giants this time.o3 has absolutely taken the crown as the conversation piece, so lets start with it (as always, TL;DR and shownotes at the end, and here's the embedding of our live video show) Big Company LLMs + APIsOpenAI o3 & o4‑mini: SOTA Reasoning Meets Tool‑Use (Blog, Watch Party)The long awaited o3 models (promised to us in the last days of x-mas) is finally here, and it did NOT disappoint and well.. even surprised! o3 is not only SOTA on nearly all possible logic, math and code benchmarks, which is to be expected from the top reasoning model, it also, and I think for the first time, is able to use tools during its reasoning process. Tools like searching the web, python coding, image gen (which it... can zoom and rotate and crop images, it's nuts) to get to incredible responses faster. Tool using reasoner are... almost AGI? This is the headline feature for me. For the first time, these o-series models have full, autonomous access to all built-in tools (web search, Python code execution, file search, image generation with Sora-Image/DALL-E, etc.). They don't just use tools when told; they decide when and how to chain multiple tool calls together to solve a problem. We saw logs with 600+ consecutive tool calls! This is agent-level reasoning baked right in.Anecdote: We tested this live with a complex prompt: "generate an image of a cowboy that on his head is the five last digits of the hexadecimal code of the MMMU score of the latest Gemini model." o3 navigated this multi-step task flawlessly: figuring out the latest model was Gemini 2.5, searching for its MMMU score, using the Python tool to convert it to hex and extract the digits, and then using the image generation tool. It involved multiple searches and reasoning steps. Absolutely mind-blowing 🤯.Thinking visually with imagesThis one also blew my mind, this model is SOTA on multimodality tasks, and a reason for this, is these models can manipulate and think about the images they received. Think... cropping, zooming, rotating. The models can now perform all these tasks to multimodal requests from users. Sci-fi stuff! Benchmark Dominance: As expected, these models crush existing benchmarks.o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more. It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k!o4-mini is no slouch either. It hits 99.5% on AIME (math problems) when allowed to use its Python interpreter and beats the older o3-mini on general tasks. It’s a reasoning powerhouse at a fraction of the cost.Incredible Long Context PerformanceYam highlighted this – on the Fiction Life benchmark testing deep comprehension over long contexts, o3 maintained nearly 100% accuracy up to 120,000 tokens, absolutely destroying previous models including Gemini 2.5 Pro and even the new GPT-4.1 family on this specific eval. While its context window is currently 200k (unlike 4.1's 1M), its performance within that window is unparalleled.Cost-Effective Reasoning: They're not just better, they're cheaper for the performance you get.* o3: $10 input / $2.50 cached / $40 output per million tokens.* o4-mini: $1.10 input / $0.275 cached / $4.40 output per million tokens. (Cheaper than GPT-4.0!)Compute Scaling Validated: OpenAI confirmed these models used >10x the compute of o1 and leverage test-time compute scaling (spending longer on harder problems), further proving their scaling law research.Memory Integration: Both models integrate with ChatGPT's recently upgraded memory feature which has access to all your previous conversations (which we didn't talk about but is absolutely amazing, try asking o3 stuff it knows about you and have ti draw conclusions!)Panel Takes & Caveats:While the excitement was palpable, Yam noted some community observations about potential "rush" – occasional weird hallucinations or questionable answers compared to predecessors, possibly a side effect of cramming so much training data. Nisten, while impressed, still found the style of GPT-4.1 preferable ...
    Show more Show less
    1 hr and 56 mins
adbl_web_global_use_to_activate_T1_webcro805_stickypopup