Article reprint source: AIGC
Source: AIGC Open Community
On September 14, the famous open source platform Stability AI released the audio generative AI product Stable Audio on its official website. (Free use address: https://www.stableaudio.com/generate)
Users can directly generate more than 20 types of background music, including rock, jazz, electronic, hip-hop, heavy metal, folk, pop, punk, country, etc. through text prompts.
For example, by entering keywords such as disco, drum machine, synthesizer, bass, piano, guitar, cheerful, 115 BPM, etc., you can generate background music.
Currently, Stable Audio has two versions: free and paid. The free version can generate 20 songs per month, with a maximum length of 45 seconds, and cannot be used for commercial purposes. The paid version costs US$11.99 (about RMB 87) per month and can generate 500 songs, with a maximum length of 90 seconds, and can be used for commercial purposes.
If you don’t want to pay, you can register several accounts and then stitch the generated music together through AU (an audio editor) or PR to achieve the same effect.
Stable Audio Introduction
In the past few years, diffusion models have achieved rapid development in image, video, audio and other fields, which can significantly improve training and inference efficiency. But there is a problem with diffusion models in the audio domain, which typically produce fixed-size content.
For example, an audio diffusion model may be trained on 30-second audio clips and can only generate 30-second audio clips. To break this technical bottleneck, Stable Audio uses a more advanced model.
This is an audio latent diffusion model based on text metadata as well as the duration and start time of the audio file, allowing control over the content and length of the generated audio. This additional time condition enables the user to generate audio of a specified length.
Compared to the original audio, using a significantly downsampled audio latent representation can achieve faster inference efficiency. With the latest stable audio model, Stable Audio can render 95 seconds of stereo audio with a sampling rate of 44.1 kHz using an NVIDIA A100 GPU in less than one second.
In terms of training data, Stable Audio uses a data set of more than 800,000 audio files, including music, sound effects, and various musical instruments.
The dataset contains more than 19,500 hours of audio in total, and is developed in collaboration with music service provider AudioSparx, so the generated music can be used commercially.
Latent Diffusion Model
The Latent Diffusion Models used by Stable Audio is a diffusion-based generative model that is mainly used in the latent encoding space of a pre-trained autoencoder. This is a method that combines autoencoders and diffusion models.
An autoencoder is first used to learn a low-dimensional latent representation of the input data (e.g., images or audio). This latent representation captures the important features of the input data and can be used to reconstruct the original data.
The diffusion model is then trained in this latent space, gradually changing the latent variables to generate new data.
The main advantage of this approach is that it can significantly improve the training and inference speed of diffusion models. Because the diffusion process occurs in a relatively small latent space rather than in the original data space, new data can be generated more efficiently.
In addition, by operating in the latent space, such models can also provide better control over the generated data. For example, one can change certain characteristics of the generated data by manipulating the latent variables, or guide the data generation process by imposing constraints on the latent variables.
Stable Audio usage and case demonstration
「AIGC Open Community」 tried the free version of Stable Audio. The usage is similar to ChatGPT. You can just enter the text prompt directly. The prompt content includes four categories: details, mentality, instruments and beats.
It should be noted that if you want to generate more delicate music with more rhythm and tempo, the input text also needs to be more detailed. In other words, the more text prompts you enter, the better the generated effect will be.
Stable Audio User Interface
The following is a sample of generated audio.
Trance, Island, Beach, Sun, 4am, Progressive, Synth, 909, Dramatic Chords, Chorus, Upbeat, Nostalgic, Dynamic.
Soft hug, comfort, low synth, shimmer, wind and leaves, ambient, peaceful, relaxing, water.
Pop electronic, big reverb synth, controlled drum machine, atmospheric, moody, nostalgic, cool, pop instrumental, 100 BPM.
3/4, 3 beats, guitar, drums, bright, happy, clapping
The source of this article is from the official website of Stability AI. If there is any infringement, please contact us to delete it.
END