WaveTransfer: A Flexible End-to-end Multi-instrument Timbre Transfer with Diffusion

Teysir Baoueb, Xiaoyu Bie, Hicham Janati, Gaël Richard

Abstract: As diffusion-based deep generative models gain prevalence, researchers are actively investigating their potential applications across various domains, including music synthesis and style alteration. Within this work, we are interested in timbre transfer, a process that involves seamlessly altering the instrumental characteristics of musical pieces while preserving essential musical elements. This paper introduces WaveTransfer, an end-to-end diffusion model designed for timbre transfer. We specifically employ the bilateral denoising diffusion model (BDDM) for noise scheduling search. Our model is capable of conducting timbre transfer between audio mixtures as well as individual instruments. Notably, it exhibits versatility in that it accommodates multiple types of timbre transfer between unique instrument pairs in a single model, eliminating the need for separate model training for each pairing. Furthermore, unlike recent works limited to 16 kHz, WaveTransfer can be trained at various sampling rates, including the industry-standard 44.1 kHz, a feature of particular interest to the music community.

A. Subjective Evaluation

I. Test Description
II. Results and Analysis

B. Listening Samples

I. Timbre Transfer at 16 kHz
II. Timbre Transfer at 44.1 kHz

References

A. Subjective Evaluation

I. Test Description

The subjective test, which you can access via this link (please note that your responses will not be saved), is designed to evaluate mixture-to-mixture timbre transfer at 16 kHz. It consists of 10 questions, with half focusing on the timbre transfer from piano-strings to vibraphone-clarinet, and the other half on the transfer from vibraphone-clarinet to piano-strings. Each question includes 7 conditions:

Reference audio
DiffTransfer
Music-STAR
WT¹⁶_globalwith both WG-6 and BDDM-20-generated noise schedules
WT¹⁶_mix with both WG-6 and BDDM-20-generated noise schedules

Participants rate the conditions on a scale from 0 (worst score) to 100 (best score) based on their perception of the quality of the timbre transfer, without knowing that a hidden reference is included in each set. The order of the questions and the conditions within each question are randomized for each participant.

II. Results and Analysis

Responses were collected from 10 participants, 80% of whom have several years of musical experience. We show hereafter the results computed using the raw scores.

**Table 1:** 95% Confidence Intervals of the Subjective Evaluation for Both Timbre Transfer Types (Raw Scores)
Model	Mean ± Margin of Error
Reference	87.57 ± 2.71
DiffTransfer	29.81 ± 3.74
Music-STAR	23.46 ± 3.20
WT_global¹⁶ with BDDM-20	53.72 ± 3.78
WT_global¹⁶ with WG-6	55.12 ± 3.58
WT_mix¹⁶ with BDDM-20	64.84 ± 4.01
WT_mix¹⁶ with WG-6	57.48 ± 3.91

The boxplot presented below summarizes five key statistics: the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The central box represents the interquartile range (IQR), containing the middle 50% of the data, with the bottom of the box marking Q1, the top marking Q3, and the line inside indicating the median. The whiskers extend from the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3, respectively. Data points outside the whiskers are considered outliers and are typically shown as individual dots.

As can be noticed from the results, WaveTransfer exhibits the highest timbre transfer quality compared to DiffTransfer and Music-STAR.

Since some participants did not assign a score of 100 to the condition they considered the best for certain questions, we rescaled the scores. This rescaling ensures that, for each participant and each question, the highest score is normalized to 100.

**Talbe 2:** 95% Confidence Intervals of the Subjective Evaluation for Both Timbre Transfer Types (Rescaled Scores)
Model	Mean ± Margin of Error
Reference	97.94 ± 1.29
DiffTransfer	33.74 ± 4.14
Music-STAR	27.73 ± 3.96
WT_global¹⁶ with BDDM-20	60.06 ± 3.82
WT_global¹⁶ with WG-6	61.93 ± 3.81
WT_mix¹⁶ with BDDM-20	72.12 ± 3.78
WT_mix¹⁶ with WG-6	63.98 ± 3.85

B. Listening Samples

I. Timbre Transfer at 16 kHz

In this section, we consider the timbre transfer task at a sampling rate of 16 kHz. It should be noted that for both DiffTransfer and Music-STAR, six different models are trained for each type of timbre transformation and that their mixtures are obtained from the models designed specifically for mixture-to-mixture timbre transfer, rather than combining individual tracks generated from single instrument models.

1. Piano → Vibraphone

Name	Input (ground truth)	Target (ground truth)	Music-STAR	DiffTransfer	WT¹⁶_global with WG-6	WT¹⁶_global with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

2. Vibraphone → Piano

Name	Input (ground truth)	Target (ground truth)	Music-STAR	DiffTransfer	WT¹⁶_global with WG-6	WT¹⁶_global with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

3. Strings → Clarinet

Name	Input (ground truth)	Target (ground truth)	Music-STAR	DiffTransfer	WT¹⁶_global with WG-6	WT¹⁶_global with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

4. Clarinet → Strings

Name	Input (ground truth)	Target (ground truth)	Music-STAR	DiffTransfer	WT¹⁶_global with WG-6	WT¹⁶_global with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

5. (Piano + strings) → (Vibraphone + Clarinet)

Name	Input (ground truth)	Target (ground truth)	Music-STAR	DiffTransfer	WT¹⁶_global with WG-6	WT¹⁶_global with BDDM-20	WT¹⁶_mix with WG-6	WT¹⁶_mix with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String

6. (Vibraphone + Clarinet) → (Piano + strings)

Name	Input (ground truth)	Target (ground truth)	Music-STAR	DiffTransfer	WT¹⁶_global with WG-6	WT¹⁶_global with BDDM-20	WT¹⁶_mix with WG-6	WT¹⁶_mix with BDDM-20
Pirates of Caribbean
My Heart Will Go On
Beethoven's String