Meta Introduces SAM Audio for Advanced Sound Isolation Using Multimodal Prompts

Tony Kim
Dec 16, 2025 16:47

Meta’s SAM Audio leverages multimodal prompts for audio separation, offering intuitive sound isolation capabilities. The model introduces state-of-the-art features for various audio processing tasks.

Meta AI has unveiled SAM Audio, a groundbreaking model designed to transform audio processing by enabling the isolation of sounds from complex audio mixtures using intuitive, multimodal prompts. This innovative model allows users to employ text, visual cues, or time segment marking to separate audio components, according to Meta AI.

Revolutionizing Audio Processing

Building on previous advancements, SAM Audio employs the Perception Encoder Audiovisual (PE-AV), a technical engine enhancing its performance in various audio separation tasks. This model mirrors the functionality of the Segment Anything Model (SAM), which revolutionized object segmentation in images and videos. SAM Audio aims to make audio separation more accessible and practical by adopting a user-friendly approach that aligns with natural human interaction with sound.

Technical Innovations

The core of SAM Audio is its ability to perform across multiple modalities, such as text, visual, and temporal cues, providing users with precise control over audio separation. This is achieved through three primary methods:

Text Prompting: Allows users to type specific sounds, like “dog barking,” to isolate them.
Visual Prompting: Enables clicking on objects or speakers in videos to isolate their audio.
Span Prompting: An innovative approach allowing users to mark time segments for target audio isolation.

The model’s architecture leverages a flow-matching diffusion transformer, encoding audio mixtures and prompts into a shared representation to generate target and residual audio tracks. This is supported by a robust data engine that synthesizes large-scale, high-quality separation data, enhancing the model’s applicability in real-world scenarios.

PE-AV: The Engine Behind SAM Audio

PE-AV, built on Meta’s open-source Perception Encoder model, extends advanced computer vision capabilities to audio. It aligns video features with audio, allowing accurate separation of visually grounded sources and inferring off-screen events. This temporal alignment supports high-precision multimodal audio separation, crucial for flexible and perceptually accurate outcomes.

Benchmarking and Evaluation

Meta has introduced SAM Audio Judge and SAM Audio-Bench to evaluate and benchmark audio separation models. SAM Audio Judge offers a reference-free, objective metric for assessing audio segmentation quality, while SAM Audio-Bench provides a comprehensive benchmark covering speech, music, and general sound effects using multimodal prompts.

These innovations position SAM Audio as a leading model in audio separation technology, achieving state-of-the-art results across various tasks and outperforming previous models in efficiency and quality. While challenges remain, such as the separation of similar audio events, the model’s capabilities in handling mixed-modality prompts demonstrate significant advancements in the field.

Looking Ahead

Meta envisions SAM Audio as a tool for empowering creators, researchers, and developers to explore new forms of expression and application development. The collaboration with partners like Starkey and 2gether-International highlights the model’s potential in advancing accessibility. SAM Audio marks a step towards more inclusive and creative AI, paving the way for future innovations in audio-aware technologies.

Image source: Shutterstock

Source: https://blockchain.news/news/meta-introduces-sam-audio-for-advanced-sound-isolation

Meta Introduces SAM Audio for Advanced Sound Isolation Using Multimodal Prompts

Revolutionizing Audio Processing

Technical Innovations

PE-AV: The Engine Behind SAM Audio

Benchmarking and Evaluation

Looking Ahead

You May Also Like

Botanix launches stBTC to deliver Bitcoin-native yield

Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week

Nvidia acquired Groq's assets for $20 billion, but officially stated that it did not acquire the entire company.

Trending News

Botanix launches stBTC to deliver Bitcoin-native yield

Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week

Nvidia acquired Groq's assets for $20 billion, but officially stated that it did not acquire the entire company.

Philippines Blocks Coinbase, Gemini in Unlicensed VASP Enforcement

PEPE Price Prediction: 23% Decline to $0.00003136 Expected Before Year-End Recovery to $0.000035

Quick Reads

Mastering the M Pattern: A Complete Guide to Double Top Reversal Trading

CXT Token Investment Guide: Analyzing Covalent Network's Market Potential and Future Outlook

Understanding Cryptocurrency Wallets: A Complete Guide

Understanding GH/s: Your Essential Guide to Mining Hash Rate Metrics

LOFI Token Price Prediction and Investment Analysis: Comprehensive Guide for 2025–2030

Crypto Prices