Meta has officially announced its next-generation SAM Audio model, bringing the revolutionary Segment Anything Model (SAM) technology from the visual world to the audio world. This AI model, setting a new standard in professional audio editing processes, can isolate desired sources from complex and intertwined audio files using multi-modal inputs.
Meta Announces New Audio Model
Unlike traditional methods, the model allows users to isolate specific sounds using natural interaction methods such as text commands, visual annotations, or time zone specification. This technology makes it possible to isolate the sound of an object in a video by simply clicking on it, or to remove unwanted sounds with a simple text command like “dog barking.”
The model’s technical infrastructure utilizes the Perception Encoder Audiovisual (PE-AV) engine, which precisely aligns audio and visual data along the time axis. SAM Audio can be configured with varying parameters ranging from 500 million to 3 billion, exceeding real-time processing speeds and achieving RTF ≈ 0.7 performance.
Architecturally built on a stream-matching diffusion converter, this generator system is supported by a massive training set of both real and synthetic data. The system can simultaneously produce both the target sound and the remaining “residual” audio tracks from a mixed audio file given as input.
SAM Audio offers three basic parsing methods, providing the user with unparalleled flexibility. In the text-based method, the user can directly specify the target by typing “piano sound” or “vocal,” while in the visual method, clicking on instruments or speakers in the video is sufficient. The industry-first span-prompting method is used to filter the audio characteristics of a specific time interval across the entire file.
{{user}} {{datetime}}
{{text}}