WaveTransfer: A Flexible End-to-end Multi-instrument Timbre Transfer with Diffusion

Teysir Baoueb, Xiaoyu Bie, Hicham Janati, Gaël Richard

[Code | Paper]

Abstract: As diffusion-based deep generative models gain prevalence, researchers are actively investigating their potential applications across various domains, including music synthesis and style alteration. Within this work, we are interested in timbre transfer, a process that involves seamlessly altering the instrumental characteristics of musical pieces while preserving essential musical elements. This paper introduces WaveTransfer, an end-to-end diffusion model designed for timbre transfer. We specifically employ the bilateral denoising diffusion model (BDDM) for noise scheduling search. Our model is capable of conducting timbre transfer between audio mixtures as well as individual instruments. Notably, it exhibits versatility in that it accommodates multiple types of timbre transfer between unique instrument pairs in a single model, eliminating the need for separate model training for each pairing. Furthermore, unlike recent works limited to 16 kHz, WaveTransfer can be trained at various sampling rates, including the industry-standard 44.1 kHz, a feature of particular interest to the music community.

Contents

A. Subjective Evaluation

I. Test Description

The subjective test, which you can access via this link (please note that your responses will not be saved), is designed to evaluate mixture-to-mixture timbre transfer at 16 kHz. It consists of 10 questions, with half focusing on the timbre transfer from piano-strings to vibraphone-clarinet, and the other half on the transfer from vibraphone-clarinet to piano-strings. Each question includes 7 conditions:

Participants rate the conditions on a scale from 0 (worst score) to 100 (best score) based on their perception of the quality of the timbre transfer, without knowing that a hidden reference is included in each set. The order of the questions and the conditions within each question are randomized for each participant.

II. Results and Analysis

Responses were collected from 10 participants, 80% of whom have several years of musical experience. We show hereafter the results computed using the raw scores.

Table 1: 95% Confidence Intervals of the Subjective Evaluation for Both Timbre Transfer Types (Raw Scores)
Model Mean ± Margin of Error
Reference 87.57 ± 2.71
DiffTransfer 29.81 ± 3.74
Music-STAR 23.46 ± 3.20
WTglobal16 with BDDM-20 53.72 ± 3.78
WTglobal16 with WG-6 55.12 ± 3.58
WTmix16 with BDDM-20 64.84 ± 4.01
WTmix16 with WG-6 57.48 ± 3.91

The boxplot presented below summarizes five key statistics: the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The central box represents the interquartile range (IQR), containing the middle 50% of the data, with the bottom of the box marking Q1, the top marking Q3, and the line inside indicating the median. The whiskers extend from the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3, respectively. Data points outside the whiskers are considered outliers and are typically shown as individual dots.

As can be noticed from the results, WaveTransfer exhibits the highest timbre transfer quality compared to DiffTransfer and Music-STAR.

Since some participants did not assign a score of 100 to the condition they considered the best for certain questions, we rescaled the scores. This rescaling ensures that, for each participant and each question, the highest score is normalized to 100.

Talbe 2: 95% Confidence Intervals of the Subjective Evaluation for Both Timbre Transfer Types (Rescaled Scores)
Model Mean ± Margin of Error
Reference 97.94 ± 1.29
DiffTransfer 33.74 ± 4.14
Music-STAR 27.73 ± 3.96
WTglobal16 with BDDM-20 60.06 ± 3.82
WTglobal16 with WG-6 61.93 ± 3.81
WTmix16 with BDDM-20 72.12 ± 3.78
WTmix16 with WG-6 63.98 ± 3.85

B. Listening Samples

I. Timbre Transfer at 16 kHz

In this section, we consider the timbre transfer task at a sampling rate of 16 kHz. It should be noted that for both DiffTransfer and Music-STAR, six different models are trained for each type of timbre transformation and that their mixtures are obtained from the models designed specifically for mixture-to-mixture timbre transfer, rather than combining individual tracks generated from single instrument models.

1. Piano → Vibraphone

Name Input
(ground truth)
Target
(ground truth)
Music-STAR DiffTransfer WT16global
with WG-6
WT16global
with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

2. Vibraphone → Piano

Name Input
(ground truth)
Target
(ground truth)
Music-STAR DiffTransfer WT16global
with WG-6
WT16global
with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

3. Strings → Clarinet

Name Input
(ground truth)
Target
(ground truth)
Music-STAR DiffTransfer WT16global
with WG-6
WT16global
with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

4. Clarinet → Strings

Name Input
(ground truth)
Target
(ground truth)
Music-STAR DiffTransfer WT16global
with WG-6
WT16global
with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

5. (Piano + strings) → (Vibraphone + Clarinet)

Name Input
(ground truth)
Target
(ground truth)
Music-STAR DiffTransfer WT16global
with WG-6
WT16global
with BDDM-20
WT16mix
with WG-6
WT16mix
with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

6. (Vibraphone + Clarinet) → (Piano + strings)

Name Input
(ground truth)
Target
(ground truth)
Music-STAR DiffTransfer WT16global
with WG-6
WT16global
with BDDM-20
WT16mix
with WG-6
WT16mix
with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

II. Timbre Transfer at 44.1 kHz

1. Piano → Vibraphone

Name Input
(ground truth)
Target
(ground truth)
WT44global
with WG-6
WT44global
with BDDM-19
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

2. Vibraphone → Piano

Name Input
(ground truth)
Target
(ground truth)
WT44global
with WG-6
WT44global
with BDDM-19
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

3. Strings → Clarinet

Name Input
(ground truth)
Target
(ground truth)
WT44global
with WG-6
WT44global
with BDDM-19
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

4. Clarinet → Strings

Name Input
(ground truth)
Target
(ground truth)
WT44global
with WG-6
WT44global
with BDDM-19
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

5. (Piano + strings) → (Vibraphone + Clarinet)

Name Input
(ground truth)
Target
(ground truth)
WT44global
with WG-6
WT44global
with BDDM-19
WT44mix
with WG-6
WT44mix
with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

6. (Vibraphone + Clarinet) → (Piano + strings)

Name Input
(ground truth)
Target
(ground truth)
WT44global
with WG-6
WT44global
with BDDM-19
WT44mix
with WG-6
WT44mix
with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

References

[1] Mahshid Alinoori and Vassilios Tzerpos, “Music-star: a style translation system for audio-based re-instrumentation,” in Proc. ISMIR, 2022.
[2] Luca Comanducci, Fabio Antonacci, and Augusto Sarti, “Timbre transfer using image-to-image denoising diffusion implicit models,” in Proc. ISMIR, 2023.
[3] Mahshid Alinoori and Vassilios Tzerpos, “Starnet,” Available at https://doi.org/10.5281/zenodo. 6917099, August 2022.