Skip to content
WonderLand
Go back

DDSP paper and codes summary

Here I summarized a very exciting differentiable digital signal processing tool proposed by Google’s team in Jan 2020. The tool is especially useful in audio waveform reconstruction and some other related task.

paper summary

abstract

The paper and the codes are the most exciting ones I’ve seen in a long time! The DDSP paper successfully combine DSP and neural network together at a totally new level. By making some of the audio related signal processing differentiable. They achieve amazing audio reconstruction and style transfer results. To me, it is really exciting and inspiring. I have tried to do (audio) spectrogram reconstruction for several months. The wavenet based, tacotron based methods, together with GAN, all fail to generate good reconstruction. For a long time I thought it is VAE’s fault that the reconstruction would be blurry and introducing GAN/BiGAN models could not save the results (it seems there is a trade off between VAE and GAN). I also counted on tacotron based model (like parrotron). But it turns out the model is really hard to train, and not working on a relatively small dataset.

The paper first criticize two kinds of previous audio generation models, which seems reasonable:

The DDSP frameworks, however, can utilize strong domain knowledge as well as integrating these elements with deep learning. The framework could have good performance even without large autoregressive models (wavenet based) or adversarial losses(GAN based). The authors called it utilizing strong inductive biases without losing the expressive power of neural networks. The authors show that combining interpretable modules permits manipulation of each separate model component. Because we can manipulate F0 (foundamental frequency) and loudness, residual (timbre) features. The applications include independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning.

By the way, the DDSP tools are also easy to use. And their model is really small compared with large wavenet or GAN based model. Which makes it possible to train the model on a small dataset. The DDSP did not propose a specific network, but provided a useful differentiable signal processing tools. The tools enable us to assemble our own powerful model. And the most exciting thing is, I could use a auto-encoder (with out KL regularization, nor adversarial loss) to achieve a nearly perfect reconstruction results.

introduction

First talks about neural network,

Here, we increase the size of that toolbox by introducing the Differentiable Digital Signal Processing (DDSP) library, which integrates interpretable signal processing elements into modern automatic differentiation software (TensorFlow). Then talks about the periodic property Objects have a natural tendency to periodically vibrate. However, neural synthesis models often do not exploit this periodic structure for generation and perception.

Then the author points out the drawbacks of three different audio synthesis models.

Since audio oscillates at many frequencies, all with different periods from the fixed frame hop size, the model must precisely align waveforms between different frames and learn filters to cover all possible phase variations.

sinusoids at multiple neighboring frequencies and phases must be combined to represent a single sinusoid when Fourier basis frequencies do not perfectly match the audio

The can avoid these issues by generating the waveform a single sample at a time. They are not constrained by the bias over generating wave packets and can express arbitrary waveforms.

they require larger and more data-hungry networks, as they do not take advantage of a bias over oscillation. And a waveform’s shape does not perfectly correspond to perception.

Markdown

OSCILLATOR MODELS

Rather than predicting waveforms or Fourier coefficients, a third model class (vocoders or synthesizers) directly generates audio with oscillators.

These “analysis/synthesis” models use expert knowledge and hand-tuned heuristics to extract synthesis parameters (analysis) that are interpretable (loudness and frequencies) and can be used by the generative algorithm (synthesis). Neural networks have play some roles in oscillator models, in modeling pre-extracted synthesis parameters. As the authors says: The analysis parameters must still be tuned by hand and gradients cannot flow through the synthesis procedure. As a result, small errors in parameters can lead to large errors in the audio that cannot propagate back to the network. Crucially, the realism of vocoders is limited by the expressivity of a given analysis/synthesis pair.

contributions

DDSP models combine the strengths of the above approaches, benefiting from the inductive bias of using oscillators, while retaining the expressive power of neural networks and end-to-end training. DDSP components are capable of generating high-fidelity audio without autoregressive or adversarial losses. And DSSP could also perform different tasks:

To understand why Google’s team could build such a good tool, it is important to learn the papers which inspire them.

Source-filter/subtractive models are inspired by the human vocal tract and dynamically filter a harmonically rich source signal, while sinusoidal/additive models generate sound as the combination of a set of time-varying sine waves. Additive models are strictly more expressive than subtractive models but have more parameters as each sinusoid has its own time-varying loudness and frequency. (Serra & Smith, 1990; Beauchamp, 2007) builds an additive synthesizer combines sinusoids in harmonic (integer) ratios of a fundamental frequency alongside a time-varying filtered noise signal.

The NSF can be seen as a specific DDSP model, that uses convolutional waveshaping of a sinusoidal oscillator to create harmonic content, rather than additive synthesis explored in this work. Both works also generate audio in the time domain and impose multi-scale spectrograms losses in the frequency domain.

DDSP components

Now it is the most exciting part, introducing how they make DSP operations differentiable. The ideas is straightforward: to write the operations as functions in modern automatic differentiation software. They express core components as feedforward functions, allowing efficient implementation on parallel hardware such as GPUs and TPUs, and generation of samples during training. These components include oscillators, envelopes, and filters (linear-time-varying finite-impulse-response, LTV-FIR).

SPECTRAL MODELING SYNTHESIS

implement a differentiable version of Spectral Modeling Synthesis (SMS) Serra & Smith (1990).

This model generates sound by combining an additive synthesizer (adding together many sinusoids) with a subtractive synthesizer (filtering white noise). It is parametric, and it is a highly expressive model It is important that the model has more parametetrs than other parametric models: For example, in the 4 seconds of 16kHz audio in the datasets considered here, the synthesizer coefficients actually have ∼2.5 times more dimensions than the audio waveform itself ((1 amplitude + 100 harmonics + 65 noise band magnitudes) * 1000 timesteps = 165,000 dimensions, vs. 64,000 audio samples). It is hard to hand design so many features, but it is amenable to control by a neural network

As we only consider monophonic sources in these experiments, we use the Harmonic plus Noise model, that further constrains sinusoids to be integer multiples of a fundamental frequency

HARMONIC OSCILLATOR / ADDITIVE SYNTHESIZER

the heart of the synthesis techniques: sinusoidal oscillator.

Markdown

This part has clearly illustrate how to generate a monophonic waveform from time varying fundamental frequency(f0) and loudness. The f0 can generate harmonic multiples and loudness can generate harmonic amplitudes using scale factor.

ENVELOPES

the neural networks operate at a slower frame rate but the oscillator formulation above requires time-varying amplitudes and frequencies at the audio sample rate (1000 vs 64,000) For instantaneous frequency upsampling, they use bilinear interpolation, required smoothing to prevent artifacts. (by adding overlapping Hamming windows)

FILTER DESIGN : FREQUENCY SAMPLING METHOD

Standard convolutional layers are equivalent to linear time invariant finite impulse response (LTI-FIR) filters. However, to ensure interpretability and prevent phase distortion, we employ the frequency sampling method to convert network outputs into impulse responses of linear-phase filters. LTV-FIR

FILTERED NOISE / SUBTRACTIVE SYNTHESIZER

Natural sounds contain both harmonic and stochastic components. The Harmonic plus Noise model captures this by combining the output of an additive synthesizer with a stream of filtered noise (Serra & Smith, 1990; Beauchamp, 2007). We are able to realize a differentiable filtered noise synthesizer by simply applying the LTV-FIR filter from above to a stream of uniform noise Y l = H l N l where N l is the IDFT of uniform noise in domain [-1, 1].

The detailed usage found in solo example part

Experiments and Model details

Markdown

A demo auto-encoder model, F0 could be extracted from pretrained CREPE model or learned using a resnet encoder Note that the author says with pride that:

we find DDSP components are able to dramatically improve autoencoder performance in the audio domain. Introducing stochastic latents (such as in GAN, VAE, and Flow models) will likely further improve performance, but we leave that to future work as it is orthogonal to the core question of DDSP component performance that we investigate in this paper. I can verify their words as the model is really doing a great job and there are definitely improvements we could do.

Encoder: - the loudness l(t) is extracted directly from the audio, using a neural network! Note the details in http://archives.ismir.net/ismir2019/paper/000063.pdf, they can extract F0 and loudness features and use WaveRNN conditioning on it. - a pretrained CREPE model with fixed weights (Kim et al., 2018) is used as an f(t) encoder to extact the fundamental frequency, and optional encoder extracts a time-varying latent encoding z(t) of the residual information. (For the unsupervised autoencoder, the pretrained CREPE model is replaced with a Resnet architecture (He et al., 2016) that extracts f(t) from a mel-scaled log spectrogram of the audio, and is jointly trained with the rest of the network.) details in appendic B.1 - For the z(t) encoder, MFCC coefficients (30 per a frame) are first extracted from the audio, which correspond to the smoothed spectral envelope of harmonics (Beauchamp, 2007), and transformed by a single GRU layer into 16 latent variables per a frame.

z-encoder: As shown in Figure 8, the encoder first calculates MFCC’s (Mel Frequency Cepstrum Coefficients) from the audio. MFCC is computed from the log-mel-spectrogram of the audio with a FFT size of 1024, 128 bins of frequency range between 20Hz to 8000Hz, overlap of 75%. We use only the first 30 MFCCs that correspond to a smoothed spectral envelope. The MFCCs are then passed through a normalization layer (which has learnable shift and scale parameters) and a 512-unit GRU. The GRU outputs (over time) fed to a 512-unit linear layer to obtain z(t). The z embedding reported in this model has 16 dimensions across 250 time-steps.

Markdown

Decoder: The decoder network maps the tuple (f(t), l(t), z(t)) to control parameters for the additive and filtered noise synthesizers. The synthesizers generate audio based on these parameters, and a reconstruction loss between the synthesized and original audio is minimized.

The decoder’s input is the latent tuple (f(t), l(t), z(t)) (250 timesteps). Its outputs are the parameters required by the synthesizers. For example, in the case of the harmonic synthesizer and filtered noise synthesizer setup, the decoder outputs a(t) (amplitudes of the harmonics) for the harmonic synthesizer (note that f(t) is fed directly from the latent), and H (transfer function of the FIR filter) for the filtered noise synthesizer.

Markdown

Separate MLPs for different features

Note that the latent f(t) is also fed directly to the additive synthesizer as it has structural meaning for the synthesizer outside the context of any given dataset. This disentangled representation enables the model to both interpolate within and extrapolate outside the data distribution. Indeed, recent work support incorporation of strong inductive biases as a prerequisite for learning disentangled representations This means that with loudness and pitch explicitly controlled by (f(t), l(t)), the model should use the residual z(t) to encode timbre. Although architecture and training do not strictly enforce this encoding, they qualitatively demonstrate how varying z leads to a smooth change in timbre. Fig 3 shows that the features (f0 and loudness) of the resynthsized audio (dashed lines) closely follow the conditioning. But the latent vectors, z(t), are interpolated, and the spectral centroid of resulting audio (thin solid lines) smoothly varies between the original samples (dark solid lines). This could achieve timbre transfer

Markdown

converting the singing voice of an author into a violin. F0 and loudness features are extracted from the singing voice and the DDSP autoencoder trained on solo violin used for resynthesis. To better match the conditioning features, we first shift the fundamental frequency of the singing up by two octaves to fit a violin’s typical register. Next, we transfer the room acoustics of the violin recording (as described in Section 5.3) to the voice before extracting loudness, to better match the loudness contours of the violin recordings. The resulting audio captures many subtleties of the singing with the timbre and room acoustics of the violin dataset.

key questions to be anwsered

The brain stem faithfully represents the three main features of speech and music: timing (onsets/offsets and envelope of the response), pitch (encoding of the fundamental frequency), and timbre (harmonics) http://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC3989107&blobtype=pdf

code details

AE model

ddsp/ae.gin at master · magenta/ddsp · GitHub Autoencoder that decodes from (loudness, f0, z).

Encoder

z = encoder(audio) How to generate z?

Decoder

ZRnnFcDecoder() channels 512 append_f0_loudness = True this means concat conditioning[‘f0_scaled’], conditioning[‘ld_scaled’], and conditioning[‘z’] together (f0 and loudness are 1 dim) Use RNN and dense layers to generate outputs: output_splits = ((‘amps’, 1), (‘harmonic_distribution’, 60),(‘noise_magnitudes’, 65))

processorgroup

ProcessorGroup.dag = [
  (@synths.Additive(),
    [‘amps’, ‘harmonic_distribution’, ‘f0_hz’]),
  (@synths.FilteredNoise(),
    [‘noise_magnitudes’]),
  (@processors.Add(),
    [‘filtered_noise/signal’, ‘additive/signal’]),
]

what is the loss:

Li=SiS^i1+αlogSilogS^i1L_{i}=\left\|S_{i}-\hat{S}_{i}\right\|_{1}+\alpha\left\|\log S_{i}-\log \hat{S}_{i}\right\|_{1}

where α\alpha is a weighting term set to 1.0 in our P experiments. The total reconstruction loss is then the sum of all the spectral losses, L reconstruction = Lreconstruction =iLiL_{\text {reconstruction }}=\sum_{i} L_{i}. In our experiments, we used FFT sizes (2048, 1024, 512, 256, 128, 64), and the neighboring frames in the Short-Time Fourier Transform (STFT) overlap by 75%. Therefore, the L i ’s cover differences between the original and synthesized audios at different spatial-temporal resolutions.

solo instrument as example

Encoder is None, use CREME pretrained model to extract f0 and loudness, no z here Each sample in the batch includes: batch[‘audio’].shape,batch[‘f0_confidence’].shape, batch[‘f0_hz’].shape, batch[‘loudness_db’].shape 64000, 1000, 1000, 1000

After the encoder, we get ’audio’, ‘f0_confidence’, ‘f0_hz’, ‘loudness_db’, ‘f0_scaled’, ‘ld_scaled’

Markdown

Markdown

Note for the noise_magnitude, we use filtered_noise core function to process

Markdown

Usage

Colab is easy to use

installation

Requires tensorflow >=2.1.0
pip install —upgrade tensorflow-gpu
Shoud install libsndfile-dev but no package in conda
module load libsndfile/intel/1.0.28
! pip install soundfile --upgrade
conda install ffmpeg

ddsp-master/ddsp/colab/tutorials/0_processor.ipynb demonstrate how to process ‘amplitudes’, ‘harmonic_distribution’, ‘f0_hz’ And get_signal() can synthesize audio from controls, call can synthesize audio from raw inputs. amps, harmonic_distribution, f0_hz The function is defined in wavetable_synthesis

Markdown

After controls:

Markdown

Markdown

Prepare data !ddsp_prepare_tfrecord Note, the audio should not be too short, you should merge the short audio into a single audio file and feed in

module load cuda/9.0.176
module load cudnn/9.0v7.0.5

Note that the GPU use is a little tricky when running on own machines

Results in jupyter notebook, nearly perfect reconstruction, PCC 0.8-0.9 Analyze results - SHTOOKA whole dataset https://drive.google.com/file/d/1DgjxlMLd-hYtYq4_O99oclqgfliL3Cqx/view?usp=sharing - SHTOOKA data with phonemes <=4 https://drive.google.com/file/d/1kPQEVD5OBjeliH2rsLGNjPcJExJv5sSe/view?usp=sharing Other files: https://drive.google.com/drive/folders/1A7lT2dMrZf9r-EoRM2c61oQt3CGnaikM?usp=sharing

Github Issues:

Q: Hey, I just got a really good reconstruction result which is too good to be true. I have a sense that the idea behind the model is really good but it is still so amazing to me. I just use your demo autoencoder to reconstruct audios from the human voice and the result is really good. But I could not understand how it can be achieved by only using f0 and loudness information? For example, the vowel ‘a’ and ‘e’ is definitely different, how does this be reflected through f0 and loudness? I thought there might be some difference between musical instruments and human voice. I just couldn’t understand that these features are enough. By the way, if I want to add z as latent space besides f0 and loudness, how can I tell the model to use it? I thought you mentioned in the paper that z may correspond to timbre information but I couldn’t find it in timbre_transfer.ipynb, can you achieve timbre transfer without z? Another question I am really curious about is if we’d like to do human voice reconstruction from multiple sources(different people), should we consider timbre and include z in the model? Also since the model is really doing a good job on waveform reconstruction. Have you considered to use it on TTS task? Can we use an encoder to generate some features like f0 and loudness from text of some other signal to generate waveform? A:

My guess is that the model is probably overfitting quite a lot to a small dataset. In that case, segment of loudness and f0 corresponds to a specific phoneme because the dataset doesn’t have enough variation. For a large dataset, there will be one to many mappings that the model can’t handle without more conditioning (latent or labels). Q:

  1. I put the reconstruction result analysis here: https://drive.google.com/file/d/1DgjxlMLd-hYtYq4_O99oclqgfliL3Cqx/view
  2. For overfitting issues, I use SHTOOKA dataset which contains audio length around 1hour and 30 min, I think that is not too small for the model to overfit? I am still amazed that the model can handle the data so well, since I have tried parrotron model for spectrogram reconstruction on SHTOOKA dataset and the model could not converge…
  3. I am not sure if I understood more conditioning (latent or labels). here:

For a large dataset, there will be one to many mappings that the model can’t handle without more conditioning (latent or labels). Do you mean we can add conditionings besides z, f0 and loudness? You also mentioned I could add grapheme or phoneme conditioning for TTS task, do you mean using an encoder to extract phoneme, grapheme or other conditioning and concat with z, f0 and loudness and then feed them to decoder?

  1. I am also curios if I can further improve the result by add z conditioning and use Resnet instead CREPE mode? Or it will be harder to train? Have you try some more complicated model like VAE or GAN using DDSP?

Plan

Not included here


Share this post on:

Previous Post
几件小事
Next Post
上海迪士尼游记