Latest

Fresh from the feed

Filter by timeframe and category to zero in on the moves that matter.

Phi-4 proves that a 'data-first' SFT methodology is the new differentiator
news
VentureBeat – AI4 days ago

AI engineers often chase performance by scaling up LLM parameters and data, but the trend toward smaller, more efficient, and better-focused models has accelerated. The Phi-4 fine-tuning methodology is the cleanest public example of a training approach that smaller enterprise teams can copy. It shows how a carefully chosen dataset and fine-tuning strategy can make a 14B model compete with much larger ones. The Phi-4 model was trained on just 1.4 million carefully chosen prompt-response pairs. Instead of brute force, the Microsoft Phi-4 research team focused on “teachable” examples at the edge of the model’s abilities and rigorous data curation. The Phi-4 reasoning smart data playbook demonstrates how strategic data curation with replicable SFT and RL can elevate a 14B model beyond much larger counterparts. Why Phi-4 stands apart Smaller reasoning models, such as OpenAI’s o1-mini and Google’s Gemma , are becoming more common, and models like Alibaba’s Qwen3 (8B and 14B) are seeing wide adoption across use cases. That adoption is important, but it doesn’t displace the value of Phi-4 as an experimental proof: Phi-4 was designed as a testbed for a data-first training methodology, and its documentation reads like a smart data playbook for teams that want to replicate that approach. The Phi-4 team has shared a repeatable SFT playbook that includes a 1.4-million-prompt response set. It’s built around “ teachable ” edge examples, questions that are neither too easy nor too difficult, chosen to push the model’s reasoning. Each topic, such as math or code, is tuned separately and then combined with synthetic rewrites that turn complex tasks into forms that can be checked automatically. The paper outlines the data selection and filtering process in enough detail for smaller teams to reproduce it with open-source models and evaluators. For enterprise teams, that level of transparency turns a research result into a practical, copyable training recipe they can implement and measure quickly. The data-first philosophy: Why less can be more Traditional approaches to LLM reasoning have often relied on scaling datasets massively to encourage generalization. Phi-4 reasoning takes a different path, showing that carefully curated data can achieve similar or even better results with far less. The team assembled a dataset covering STEM, coding, and safety. Despite its small size, it outperformed models trained on orders of magnitude more data. In benchmarks, the 14B Phi-4 reasoning model outperformed OpenAI’s o1-mini and DeepSeek’s 70B distilled model across most reasoning tasks, and approached the full DeepSeek-R1 (671B) on challenging math (AIME) questions. With just 14 billion parameters, Phi-4 reasoning delivers the following results when compared to other leading models: Benchmark (task) Phi-4 reasoning Comparison model (size) Comparison score Date / Source AIME 2024 (math olympiad) 75.3% o1-mini 63.6% Microsoft Phi-4 model card (Apr 2025). ( Hugging Face ) AIME 2025 (math olympiad) 62.9% DeepSeek-R1-Distill-70B 51.5% Microsoft Phi-4 model card (April 2025). ( Hugging Face ) OmniMath 76.6% DeepSeek-R1-Distill-70B 63.4% Microsoft Phi-4 model card (April 2025). ( Hugging Face ) GPQA-Diamond (graduate-level science) 65.8% o1-mini 60.0% Microsoft Phi-4 model card (April 2025). ( Hugging Face ) OmniMath (same benchmark, different comparison) 76.6% Claude-3.7-Sonnet 54.6% Microsoft Phi-4 model card (April 2025). ( Hugging Face ) Table : Phi-4 reasoning performance across benchmarks compared to other models. Source: Microsoft The key to this is filtering for quality over quantity . Much of the generic data is either too easy (the base model already knows it) or too hard (no learning signal). The Phi-4 team explicitly discards such examples. “Given the strong baseline reasoning capabilities of Phi-4, many initial seed questions are already handled competently,” they note. “To make further learning impactful, we specifically target seeds situated at the edge of Phi-4’s current abilities.” In practice, they rely on LLM-based evaluation. For each candidate question, a strong reference model (like GPT-4) generates an “answer key,” and the answers from weaker models are compared. If the weaker model disagrees enough, it indicates a teachable gap. Those questions are retained, while trivially solved or utterly unsolvable questions are dropped. For example, a simple arithmetic problem might be dropped (too easy), and an extremely obscure theorem proof might be dropped (too hard) as well. But a moderately challenging geometry problem that Phi-4 gets wrong is included. This “sweet spot” approach ensures every example forces the model to stretch its reasoning. By focusing on multi-step problems rather than rote recall, they pack maximum learning into 1.4M examples. As the authors explain, training on these carefully chosen seeds “leads to broad generalization across both reasoning-specific and general-purpose tasks.” In effect, Phi-4 reasoning demonstrates that intelligent data selection can outperform brute force scaling. Independent domain optimization Phi-4 reasoning’s data are grouped by domain (math, coding, puzzles, safety, etc.). Rather than blending everything at once, the team tunes each domain’s mix separately and then merges them. This relies on an “ additive property ” : Optimizing math data in isolation and code data in isolation yields weights that, when concatenated, still give gains in both areas. In practice, they first tuned the math dataset to saturation on math benchmarks, then did the same for code, and finally simply added the code data into the math recipe. The result was improved performance on both math and coding tasks, without retraining from scratch. This modular approach offers clear practical advantages. This means a small team can first refine just the math dataset, achieve strong math performance, and then later add the coding data without redoing the math tuning. However, the Phi-4 authors caution that scaling this method to many domains remains an open question. While the approach “worked very well” for their math+code mix, they note, “it is not known whether this method can scale to dozens or hundreds of domains,” a direction they acknowledge as a valuable area for future research. In short, the additive strategy is effective, but expanding into new domains must be approached carefully, as it may introduce unforeseen interactions. Despite potential pitfalls, the additive strategy proved effective in Phi-4 reasoning. By treating each domain independently, the team avoided complex joint optimization and narrowed the search space for data mixtures. This approach allows incremental scaling of domains. Teams can begin by tuning the math SFT, then incorporate the code dataset, and later expand to additional specialized tasks, all while maintaining prior performance gains. This is a practical advantage for resource-constrained teams. Instead of requiring a large group of experts to manage a complex, multi-domain dataset, a small team can focus on one data silo at a time. Synthetic data transformation Some reasoning problems, such as abstract proofs or creative tasks, are difficult to verify automatically. Yet automated verification (for RL reward shaping) is very valuable. Phi-4 reasoning tackled this by transforming hard prompts into easier-to-check forms. For example, the team rewrote a subset of coding problems as word puzzles or converted some math problems to have concise numeric answers. These “synthetic seed data” preserve the underlying reasoning challenge but make correctness easier to test. Think of it as giving the model a simplified version of the riddle that still teaches the same logic. This engineering hack enables downstream RL to use clear reward signals on tasks that would otherwise be too open-ended. Here’s an example of synthetic data transformation: Raw web data Synthetic data On the sides AB and BC of triangle ABC, points M and N are taken, respectively. It turns out that the perimeter of △AMC is equal to the perimeter of △CNA, and the perimeter of △ANB is equal to the perimeter of △CMB. Prove that △ABC is isosceles. ABC is a triangle with AB=13 and BC=10. On the sides AB and BC of triangle ABC, points M and N are taken, respectively. It turns out that the perimeter of △AMC is equal to the perimeter of △CNA, and the perimeter of △ANB is equal to the perimeter of △CMB. What is AC? Table : Rewriting seed data from the web (left) into verifiable synthetic questions for SFT and RL (right). Source: Microsoft Note that by assigning numeric values (AB=13, BC=10) and asking “What is AC?”, the answer becomes a single number, which can be easily checked for correctness. Other teams have applied similar domain-specific tricks. For example, chemistry LLMs like FutureHouse’s ether0 model generate molecules under strict pKa or structural constraints, using crafted reward functions to ensure valid chemistry. In mathematics, the Kimina-Prover model by Numina translates natural-language theorems into the Lean formal system, so reinforcement learning can verify correct proofs. These examples highlight how synthetic augmentation, when paired with verifiable constraints, can push models to perform well in highly specialized domains. In practical terms, engineers should embrace synthetic data but keep it grounded. Heuristics like “convert to numeric answers” or “decompose a proof into checkable steps” can make training safer and more efficient. At the same time, maintain a pipeline of real (organic) problems as well, to ensure breadth. The key is balance. Use synthetic transformations to unlock difficult verification problems, but don’t rely on them exclusively. Real-world diversity still matters. Following this approach, the model is guided toward a clearly defined, discrete objective. Here are some results on Phi-4 reasoning models: Practical implementation for enterprises AI teams looking to apply Phi-4 reasoning’s insights can follow a series of concrete steps to implement the approach effectively. Identifying the model’s edge Detect your model’s “edge” by identifying where the base LLM struggles. One way is to use its confidence or agreement scores. For example, generate several answers per prompt (using a tool like Hugging Face’s vLLM for fast sampling) and see where consensus breaks. Those prompts at the margin of confidence are your teachable examples. By focusing on these low-confidence questions rather than the questions it already gets right, you ensure each new example is worth learning. Isolating domains for targeted tuning Tune one domain at a time rather than mixing all data genres upfront. Pick the highest-value domain for your app (math, code, legal, etc.) and craft a small SFT dataset for just that. Iterate on the mix (balancing difficulty, source types, etc.) until performance saturates on domain-specific benchmarks. Then freeze that mix and add the next domain. This modular tuning follows Phi-4 reasoning’s “additive” strategy. It avoids cross-talk since you preserve gains in domain A even as you improve domain B. Expanding with synthetic augmentation Leverage synthetic augmentation when gold-standard answers are scarce or unverifiable. For instance, if you need to teach a proof assistant but can’t autocheck proofs, transform them into arithmetic puzzles or shorter proofs that can be verified. Use your LLM to rewrite or generate these variants (Phi-4 used this to turn complex word problems into numeric ones). Synthetic augmentation also lets you expand data cheaply. Once you have a validated small set, you can “multiply” it by having the LLM generate paraphrases, variations, or intermediate reasoning steps. Scaling through a two-phase strategy Use a two-phase training strategy that begins with exploration followed by scaling. In Phase 1 (exploration), run short fine-tuning experiments on a focused dataset (e.g., one domain) with limited compute. Track a few key metrics (benchmarks or held-out tasks) each run. Rapidly iterate hyperparameters and data mixes. The Phi-4 paper demonstrates that this speeds up progress, as small experiments helped the team discover a robust recipe before scaling up. Only once you see consistent gains do you move to Phase 2 (scaling), where you combine your verified recipes across domains and train longer (in Phi-4’s case, ~16 billion tokens). Although this stage is more compute-intensive, the risk is significantly reduced by the prior experimentation. Monitor for trigger points such as a significant uplift on validation tasks or stable metric trends. When those appear, it’s time to scale. If not, refine the recipe more first. This disciplined two-phase loop saves resources and keeps the team agile. In practice, many teams at Hugging Face and elsewhere have followed similar advice. For example, while developing conversational model SmolLM2 , the team noticed poor chat performance in Phase 1. They then generated ~500K synthetic multi-turn dialogues and re-trained, which “significantly improved both downstream performance and its overall ‘vibes,’” as one researcher reports. This represents a concrete win, achieved through a targeted synthetic data injection based on an initial feedback loop. How to do this now Here’s a simple checklist that you can follow to put these ideas into action. Pick a target domain/task. Choose one area (e.g., math, coding, or a specific application) where you need better performance. This keeps the project focused. Collect a small seed dataset. Gather, say, a few thousand prompt–answer pairs in that domain from existing sources (textbooks, GitHub, etc.). Filter for edge-of-ability examples. Use a strong model (e.g., GPT-4) to create an answer key for each prompt. Run your base model on those prompts. Keep examples that the base model often misses, discard ones it already solves or is hopeless on. This yields “teachable” examples. Fine-tune your model (Phase 1). Run a short SFT job on this curated data. Track performance on a held-out set or benchmark. Iterate: Refine the data mix, remove easy questions, add new teachable ones, until gains taper off. Add synthetic examples if needed. If some concepts lack auto-verifiable answers (like long proofs), create simpler numeric or single-answer variants using your LLM. This gives clear rewards for RL. Keep a balance with real problems. Expand to the next domain. Once one domain is tuned, “freeze” its dataset. Pick a second high-value domain and repeat steps 3 to 5 to tune that data mix. Finally, merge the data for both domains, and do a final longer training run (Phase 2). Monitor benchmarks carefully. Use a consistent evaluation methodology (like majority-voting runs) to avoid misleading results. Only proceed to a full-scale training if small experiments show clear improvements. Limits and trade-offs Despite the effectiveness of the Phi-4 training method, several limitations and practical considerations remain. One key challenge is domain scaling. While Phi-4’s additive method worked well for math and code, it has yet to be proven across many domains. The authors acknowledge that it remains an open question whether this approach can scale smoothly to dozens of topics. Another concern is the use of synthetic data. Relying too heavily on synthetic rewrites can reduce the diversity of the dataset, so it’s crucial to maintain a balance between real and synthetic examples to preserve the model's ability to reason effectively. Lastly, while the repeatable SFT method helps reduce computational costs, it doesn’t eliminate the need for thoughtful curation. Even though the approach is more efficient than brute-force scaling, it still requires careful data selection and iteration. Lessons from Phi-4 The Phi-4 reasoning story is clear: Bigger isn’t always better for reasoning models. Instead of blindly scaling, the team asked where learning happens and engineered their data to hit that sweet spot. They show that “the benefit of careful data curation for supervised fine-tuning extends to reasoning models.” In other words, with a smart curriculum, you can squeeze surprising capability out of modest models. For engineers, the takeaway is actionable. You don’t need a billion-dollar cluster or an endless internet crawl to improve reasoning. For resource-strapped teams, this is good news, as a careful data strategy lets you punch above your weight. Phi-4 reasoning proves that methodical data and training design, not sheer parameter count, drives advanced reasoning. Focusing on teachable data and iterative tuning, even a 14B model surpassed much larger rivals. For AI teams today, this offers a practical blueprint. Refine the data, iterate fast, and scale only when the signals are right. These steps can unlock breakthrough reasoning performance without breaking the bank.

#ai
#llm
#research
#open_source
Score · 2.35
Decoding Binary Numeric Expressions, (Mon, Nov 17th)
news
SANS Internet Storm Center, InfoCON: green4 days ago

In diary entry " Formbook Delivered Through Multiple Scripts ", Xavier mentions that the following line:

Score · 2.43
Crowded EM Trades Draw Warnings From Global Money Managers - Bloomberg.com
news
Bloomberg (Google News)4 days ago

Crowded EM Trades Draw Warnings From Global Money Managers Bloomberg.com

Score · 2.29
Paul McCartney joins music industry protest against AI with silent track
news
The Guardian4 days ago

Former Beatle and artists including Sam Fender, Kate Bush and Hans Zimmer record silent LP Is This What We Want At two minutes 45 seconds it’s about the same length as With a Little Help From My Friends. But Paul McCartney’s first new recording in five years lacks the sing-along tune and jaunty guitar chops because there’s barely anything there. The former Beatle, arguably Britain’s greatest living songwriter, is releasing a track of an almost completely silent recording studio as part of a music industry protest against copyright theft by artificial intelligence companies. Continue reading...

#ai
Score · 2.33
How to hide from a bubble
news
Financial Times4 days ago

It ain’t easy

#ai
Score · 2.37
Rust Adoption Drives Android Memory Safety Bugs Below 20% for First Time
news
The Hacker News4 days ago

Google has disclosed that the company's continued adoption of the Rust programming language in Android has resulted in the number of memory safety vulnerabilities falling below 20% of total vulnerabilities for the first time. "We adopted Rust for its security and are seeing a 1000x reduction in memory safety vulnerability density compared to Android’s C and C++ code. But the biggest surprise was

Score · 2.37
Morning Bid: Nvidia earnings likely to overshadow delayed US data - Reuters
news
Reuters (Google News)4 days ago

Morning Bid: Nvidia earnings likely to overshadow delayed US data Reuters

Score · 2.27
Korean Giants Pledge Record Investments to Spur Growth at Home - Bloomberg.com
news
Bloomberg (Google News)5 days ago

Korean Giants Pledge Record Investments to Spur Growth at Home Bloomberg.com

Score · 2.26
ISC Stormcast For Monday, November 17th, 2025 https://isc.sans.edu/podcastdetail/9702, (Mon, Nov 17th)
news
SANS Internet Storm Center, InfoCON: green5 days ago

(c) SANS Internet Storm Center. https://isc.sans.edu Creative Commons Attribution-Noncommercial 3.0 United States License.

#ai
Score · 2.38
Your Amazon driver may soon deliver with these smart glasses on - why that's good news - ZDNET
news
ZDNET (Google News)5 days ago

Your Amazon driver may soon deliver with these smart glasses on - why that's good news ZDNET

Score · 2.19
Easily Build and Share ROCm Kernels with Hugging Face
news
Hugging Face5 days ago
Score · 2.42
Warburg-Backed Princeton Enters Korea With First Data Center - Bloomberg.com
news
Bloomberg (Google News)5 days ago

Warburg-Backed Princeton Enters Korea With First Data Center Bloomberg.com

Score · 2.16
US Sanctions Propel Chinese AI Prodigy to $23 Billion Fortune - Bloomberg.com
news
Bloomberg (Google News)5 days ago

US Sanctions Propel Chinese AI Prodigy to $23 Billion Fortune Bloomberg.com

#ai
Score · 2.20
Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent For Complex 3D Virtual Worlds
news
MarkTechPost5 days ago

Google DeepMind has released SIMA 2 to test how far generalist embodied agents can go inside complex 3D game worlds. SIMA’s (Scalable Instructable Multiworld Agent) new version upgrades the original instruction follower into a Gemini driven system that reasons about goals, explains its plans, and improves from self play in many different environments. From SIMA […] The post Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent For Complex 3D Virtual Worlds appeared first on MarkTechPost .

#ai
#product
Score · 2.25
AI Interview Series #2: Explain Some of the Common Model Context Protocol (MCP) Security Vulnerabilities
news
MarkTechPost5 days ago

In this part of the Interview Series, we’ll look at some of the common security vulnerabilities in the Model Context Protocol (MCP) — a framework designed to let LLMs safely interact with external tools and data sources. While MCP brings structure and transparency to how models access context, it also introduces new security risks if […] The post AI Interview Series #2: Explain Some of the Common Model Context Protocol (MCP) Security Vulnerabilities appeared first on MarkTechPost .

#ai
#llm
Score · 2.25
Physicists reveal a new quantum state where electrons run wild
news
ScienceDaily5 days ago

Electrons can freeze into strange geometric crystals and then melt back into liquid-like motion under the right quantum conditions. Researchers identified how to tune these transitions and even discovered a bizarre “pinball” state where some electrons stay locked in place while others dart around freely. Their simulations help explain how these phases form and how they might be harnessed for advanced quantum technologies.

#ai
#research
Score · 2.12
#301 Hemant Banavar & Ryan Paulson: The AI Safety System Driving Toward Zero Harm
podcast
Eye on AI5 days ago

How are AI and telematics changing safety for fleets in the real world, and what does it take to get from basic recordings to true accident prevention? In this episode of Eye on AI, host Craig Smith speaks with Hemant, Chief Product Officer at Motive, and Ryan, CIO at Fusion Site Services, to explore how AI powered cameras and telematics are transforming safety, productivity, and profitability across the physical economy, from trucking and construction to field services. We look at what makes safety AI trustworthy at scale, how to reduce false alerts that drivers ignore, and how to combine in cab coaching, human review, and rich telematics data to drive down risky behaviors. Learn how Fusion Site Services cut unsafe events by more than ninety percent while tripling in size, slashed insurance claims and premiums, and used real time insights to tackle idling, under utilized assets, and the hidden costs of unsafe operations. You will also hear how leading fleets run side by side vendor tests, design incentive programs that get drivers on board with cameras, and build a culture around zero preventable accidents. If you are responsible for safety, operations, or risk, this episode will show you how to evaluate AI and telematics platforms, which benchmarks to demand, and how to turn your data into safer roads and stronger unit economics. Stay Updated: Craig Smith on X: https://x.com/craigss Eye on A.I. on X: https://x.com/EyeOn_AI

#ai
#product
Score · 2.10
How Google’s DeepMind tool is ‘more quickly’ forecasting hurricane behavior
news
The Guardian5 days ago

‘Less expensive and time consuming’ model helps with fast and accurate predictions, possibly saving lives and property When then Tropical Storm Melissa was churning south of Haiti, Philippe Papin, a National Hurricane Center (NHC) meteorologist, had confidence it was about to grow into a monster hurricane. As the lead forecaster on duty, he predicted that in just 24 hours the storm would become a category 4 hurricane and begin a turn towards the coast of Jamaica. No NHC forecaster had ever issued such a bold forecast for rapid strengthening. Continue reading...

#ai
Score · 2.20
Use Google Gemini and ChatGPT to Organize Your Life With Scheduled Actions
news
WIRED5 days ago

The AI’s latest trick is following the schedule you set for it.

#ai
Score · 2.19
Comparing the Top 4 Agentic AI Browsers in 2025: Atlas vs Copilot Mode vs Dia vs Comet
news
MarkTechPost5 days ago

Agentic AI browsers are moving the model from ‘answering about the web’ to operating on the web. In 2025, four AI browsers define this space: OpenAI’s ChatGPT Atlas, Microsoft Edge with Copilot Mode, The Browser Company’s Dia, and Perplexity’s Comet. Each makes different design choices around autonomy, memory, and privacy. This article compares their architectures, […] The post Comparing the Top 4 Agentic AI Browsers in 2025: Atlas vs Copilot Mode vs Dia vs Comet appeared first on MarkTechPost .

#ai
Score · 2.17
Page 90 of 92