Abstract: As diffusion-based deep generative models gain prevalence, researchers are actively investigating their potential applications across various domains, including music synthesis and style alteration. Within this work, we are interested in timbre transfer, a process that involves seamlessly altering the instrumental characteristics of musical pieces while preserving essential musical elements. This paper introduces WaveTransfer, an end-to-end diffusion model designed for timbre transfer. We specifically employ the bilateral denoising diffusion model (BDDM) for noise scheduling search. Our model is capable of conducting timbre transfer between audio mixtures as well as individual instruments. Notably, it exhibits versatility in that it accommodates multiple types of timbre transfer between unique instrument pairs in a single model, eliminating the need for separate model training for each pairing. Furthermore, unlike recent works limited to 16 kHz, WaveTransfer can be trained at various sampling rates, including the industry-standard 44.1 kHz, a feature of particular interest to the music community.
Contents
- A. Subjective Evaluation
- B. Listening Samples
- References
A. Subjective Evaluation
I. Test Description
The subjective test, which you can access via this link (please note that your responses will not be saved), is designed to evaluate mixture-to-mixture timbre transfer at 16 kHz. It consists of 10 questions, with half focusing on the timbre transfer from piano-strings to vibraphone-clarinet, and the other half on the transfer from vibraphone-clarinet to piano-strings. Each question includes 7 conditions:
- Reference audio
- DiffTransfer
- Music-STAR
- WT16globalwith both WG-6 and BDDM-20-generated noise schedules
- WT16mix with both WG-6 and BDDM-20-generated noise schedules
Participants rate the conditions on a scale from 0 (worst score) to 100 (best score) based on their perception of the quality of the timbre transfer, without knowing that a hidden reference is included in each set. The order of the questions and the conditions within each question are randomized for each participant.
II. Results and Analysis
Responses were collected from 10 participants, 80% of whom have several years of musical experience. We show hereafter the results computed using the raw scores.
| Model | Mean ± Margin of Error | 
|---|---|
| Reference | 87.57 ± 2.71 | 
| DiffTransfer | 29.81 ± 3.74 | 
| Music-STAR | 23.46 ± 3.20 | 
| WTglobal16 with BDDM-20 | 53.72 ± 3.78 | 
| WTglobal16 with WG-6 | 55.12 ± 3.58 | 
| WTmix16 with BDDM-20 | 64.84 ± 4.01 | 
| WTmix16 with WG-6 | 57.48 ± 3.91 | 
The boxplot presented below summarizes five key statistics: the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The central box represents the interquartile range (IQR), containing the middle 50% of the data, with the bottom of the box marking Q1, the top marking Q3, and the line inside indicating the median. The whiskers extend from the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3, respectively. Data points outside the whiskers are considered outliers and are typically shown as individual dots.
 
As can be noticed from the results, WaveTransfer exhibits the highest timbre transfer quality compared to DiffTransfer and Music-STAR.
Since some participants did not assign a score of 100 to the condition they considered the best for certain questions, we rescaled the scores. This rescaling ensures that, for each participant and each question, the highest score is normalized to 100.
| Model | Mean ± Margin of Error | 
|---|---|
| Reference | 97.94 ± 1.29 | 
| DiffTransfer | 33.74 ± 4.14 | 
| Music-STAR | 27.73 ± 3.96 | 
| WTglobal16 with BDDM-20 | 60.06 ± 3.82 | 
| WTglobal16 with WG-6 | 61.93 ± 3.81 | 
| WTmix16 with BDDM-20 | 72.12 ± 3.78 | 
| WTmix16 with WG-6 | 63.98 ± 3.85 | 
 
  B. Listening Samples
I. Timbre Transfer at 16 kHz
In this section, we consider the timbre transfer task at a sampling rate of 16 kHz. It should be noted that for both DiffTransfer and Music-STAR, six different models are trained for each type of timbre transformation and that their mixtures are obtained from the models designed specifically for mixture-to-mixture timbre transfer, rather than combining individual tracks generated from single instrument models.1. Piano → Vibraphone
| Name | Input (ground truth) | Target (ground truth) | Music-STAR | DiffTransfer | WT16global with WG-6 | WT16global with BDDM-20 | 
|---|---|---|---|---|---|---|
2. Vibraphone → Piano
| Name | Input (ground truth) | Target (ground truth) | Music-STAR | DiffTransfer | WT16global with WG-6 | WT16global with BDDM-20 | 
|---|---|---|---|---|---|---|
3. Strings → Clarinet
| Name | Input (ground truth) | Target (ground truth) | Music-STAR | DiffTransfer | WT16global with WG-6 | WT16global with BDDM-20 | 
|---|---|---|---|---|---|---|
4. Clarinet → Strings
| Name | Input (ground truth) | Target (ground truth) | Music-STAR | DiffTransfer | WT16global with WG-6 | WT16global with BDDM-20 | 
|---|---|---|---|---|---|---|
5. (Piano + strings) → (Vibraphone + Clarinet)
| Name | Input (ground truth) | Target (ground truth) | Music-STAR | DiffTransfer | WT16global with WG-6 | WT16global with BDDM-20 | WT16mix with WG-6 | WT16mix with BDDM-20 | 
|---|---|---|---|---|---|---|---|---|
6. (Vibraphone + Clarinet) → (Piano + strings)
| Name | Input (ground truth) | Target (ground truth) | Music-STAR | DiffTransfer | WT16global with WG-6 | WT16global with BDDM-20 | WT16mix with WG-6 | WT16mix with BDDM-20 | 
|---|---|---|---|---|---|---|---|---|
II. Timbre Transfer at 44.1 kHz
1. Piano → Vibraphone
| Name | Input (ground truth) | Target (ground truth) | WT44global with WG-6 | WT44global with BDDM-19 | 
|---|---|---|---|---|
2. Vibraphone → Piano
| Name | Input (ground truth) | Target (ground truth) | WT44global with WG-6 | WT44global with BDDM-19 | 
|---|---|---|---|---|
3. Strings → Clarinet
| Name | Input (ground truth) | Target (ground truth) | WT44global with WG-6 | WT44global with BDDM-19 | 
|---|---|---|---|---|
4. Clarinet → Strings
| Name | Input (ground truth) | Target (ground truth) | WT44global with WG-6 | WT44global with BDDM-19 | 
|---|---|---|---|---|
5. (Piano + strings) → (Vibraphone + Clarinet)
| Name | Input (ground truth) | Target (ground truth) | WT44global with WG-6 | WT44global with BDDM-19 | WT44mix with WG-6 | WT44mix with BDDM-20 | 
|---|---|---|---|---|---|---|
6. (Vibraphone + Clarinet) → (Piano + strings)
| Name | Input (ground truth) | Target (ground truth) | WT44global with WG-6 | WT44global with BDDM-19 | WT44mix with WG-6 | WT44mix with BDDM-20 | 
|---|---|---|---|---|---|---|
References
		[1] Mahshid Alinoori and Vassilios Tzerpos, “Music-star: a style translation system for audio-based re-instrumentation,” in Proc. ISMIR, 2022.
		[2] Luca Comanducci, Fabio Antonacci, and Augusto Sarti, “Timbre transfer using image-to-image denoising diffusion implicit models,” in Proc. ISMIR, 2023.
		[3] Mahshid Alinoori and Vassilios Tzerpos, “Starnet,” Available at https://doi.org/10.5281/zenodo. 6917099, August 2022.