Back Original

Muse Spark: Scaling towards personal superintelligence

Today, we’re excited to introduce Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.

Muse Spark is the first step on our scaling ladder and the first product of a ground-up overhaul of our AI efforts. To support further scaling, we are making strategic investments across the entire stack — from research and model training to infrastructure, including the Hyperion data center.

In this post, we'll first explore Muse Spark's new capabilities and applications. After these results, we’ll look behind the curtain at the scaling axes driving our progress toward personal superintelligence.

Muse Spark is available today at meta.ai and the Meta AI app. We’re opening a private API preview to select users.

Capabilities for Personal Superintelligence

Muse Spark offers competitive performance in multimodal perception, reasoning, health, and agentic tasks. We continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows.

With larger models in development, these results demonstrate that our stack is scaling effectively.

We’re also releasing Contemplating mode, which orchestrates multiple agents that reason in parallel. This allows Muse Spark to compete with the extreme reasoning modes of frontier models such as Gemini Deep Think and GPT Pro. Contemplating mode provides significant capability improvements in challenging tasks, achieving 58% in Humanity’s Last Exam and 38% in FrontierScience Research.

Applications

Muse Spark is the first step toward a personal superintelligence that understands your world. From analyzing your immediate environment to supporting your wellness, the advanced reasoning capabilities of Muse Spark enable powerful, highly personal use cases.

Multimodal. Muse Spark is built from the ground up to integrate visual information across domains and tools. It achieves strong performance on visual STEM questions, entity recognition, and localization. These capabilities come together to enable interactive experiences like creating fun minigames or troubleshooting your home appliances with dynamic annotations.

Health. One major application of personal superintelligence is to help people learn about and improve their health. To improve Muse Spark's health reasoning capabilities, we collaborated with over 1,000 physicians to curate training data that enables more factual and comprehensive responses. Muse Spark can generate interactive displays that unpack and explain health information such as the nutritional content of various foods or muscles activated during exercise.

Prompt: Can you turn this into a sudoku game that I can play in the web?

Prompt: Identify the key components of the coffee machine and grinder, and create an interactive tutorial of using this machine to make a latte with a simple webpage, when I hover on the steps, it will highlight bounding boxes of the components.

Prompt: I am pescatarian with high cholesterol. Put green dots on recommended food and red dots on not recommended food. Don’t duplicate dots and make sure the dots are localized properly. When hovering over the dot, show personalized justification and “health score” out of 10, along with calories and carbs, protein, and fat. Health score numbers should appear right above the dot without hovering. The description that shows when hovering should go above all other dots.

Prompt: For both images, show me which muscles are being stretched and its difficulty. When hovering over the dot, tell me more about the muscle group with how to fix my form. I want to get better at yoga. Make a side by side with my partner, and rate both of us on a scale of 1 to 10.

Scaling Axes

To build personal superintelligence, our model’s capabilities should scale predictably and efficiently. Below, we share how we study and track Muse Spark's scaling properties along three axes: pretraining, reinforcement learning, and test-time reasoning.

Pretraining. The pretraining phase is where Muse Spark acquires its core multimodal understanding, reasoning, and coding abilities — the foundation that reinforcement learning and test-time compute build upon.

Over the last nine months, we rebuilt our pretraining stack with improvements to model architecture, optimization, and data curation. Together, these advancements increase the capability we can extract from every unit of compute. To rigorously evaluate our new recipe, we fit a scaling law to a series of small models and compare the training FLOPs required to hit a specific level of performance. The results are clear: we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison.

Reinforcement Learning. After pretraining, reinforcement learning (RL) leverages compute to scalably amplify model capabilities. Even though large-scale RL is notoriously prone to instability, our new stack delivers smooth, predictable gains.

The plots below show the benefits of scaling RL compute (measured in steps) for Muse Spark. On the left, we see log-linear growth in pass@1 and pass@16 (at least one success across 16 attempts) on the training data. This indicates that RL is improving model reliability without compromising reasoning diversity. On the right, accuracy growth on a held-out evaluation set establishes that the gains from RL predictably generalize: Muse Spark smoothly improves on tasks that were not seen in training.

Test-Time Reasoning. RL trains our models to "think" before they answer — a process known as test-time reasoning. Serving this capability to billions of users requires efficient use of reasoning tokens. To achieve this, we rely on two key levers: thinking time penalties to optimize token use, and multi-agent orchestration that boosts performance without slowing down response times.

To deliver the most intelligence per token, our RL training maximizes correctness subject to a penalty on thinking time. On a subset of evaluations such as AIME, this causes a phase transition. After an initial period where the model improves by thinking longer, the length penalty causes thought compression — Muse Spark compresses its reasoning to solve problems using significantly fewer tokens. After compressing, the model again extends its solutions to achieve stronger performance.

To spend more test-time reasoning without drastically increasing latency, we can scale the number of parallel agents that collaborate to solve hard problems. The figure below illustrates the benefits of this approach. While standard test-time scaling has a single agent think for longer, scaling Muse Spark with multi-agent thinking enables superior performance with comparable latency.

Safety

Muse Spark has broad reasoning capabilities across dual-use scientific domains, so we conducted extensive safety evaluations before deployment. Our process follows the updated Advanced AI Scaling Framework, which defines threat models, evaluation protocols, and deployment thresholds for our most advanced models. We evaluated Muse Spark both before and after applying safety mitigations across frontier risk categories, behavioral alignment, and adversarial robustness.

We found that Muse Spark demonstrates strong refusal behavior across high-risk domains such as biological and chemical weapons, enabled by pretraining data filtering, safety-focused post-training, and system-level guardrails. In the Cybersecurity and Loss of Control domains, Muse Spark does not exhibit the autonomous capability or hazardous tendencies needed to realize threat scenarios. Our evaluations show Muse Spark falls within safe margins across all frontier risk categories we measured given its deployment context. Full results will be available in our upcoming Safety & Preparedness Report.

In third-party evaluations on a near-launch checkpoint, Apollo Research found that Muse Spark demonstrated the highest rate of evaluation awareness of models they have observed. The model frequently identified scenarios as "alignment traps" and reasoned that it should behave honestly because it was being evaluated. This matters because models that recognize evaluation contexts may behave differently during testing than in deployment. However, these results do not confirm that awareness directly alters behavior, and our own follow-up investigation found initial evidence that evaluation awareness may affect model behavior on a small subset of alignment evaluations, all unrelated to hazardous capabilities or propensities affecting model launch decisions. We concluded this was not a blocking concern for release, though it warrants further research. Read more in our upcoming Safety & Preparedness Report.

Conclusion

With Muse Spark, we're on a predictable and efficient scaling trajectory. We look forward to sharing increasingly capable models on the path to personal superintelligence soon.