An End-to-End Unsupervised Network for Timbre Transfer in Musical Applications

A successful application of end-to-end computational intelligence in the field of image processing is the so-called “style transfer”, i.e., the creative modification of a “content” image applying textures, strokes and colors from a reference image providing stylistic information. In the audio field, however, and specifically with music signals, the task is yet to be properly defined and addressed. With this paper we pose the problem in more rigorous terms, we provide an overview of state of the art algorithms proposed in neighboring research areas and we discuss timbre learning as a first step in the context of style representation. A dedicated neural network architecture is proposed that provides first promising results in timbre transfer, compared to previous works.

Sound examples: female voice to distorted guitar

Original
Target timbre
MFCC k-nn matching
Spectral flattening
Proposed approach trained on distorted guitar

Sound examples: female voice to breaty female voce

Original
Target timbre
MFCC k-nn matching
Spectral flattening
Proposed approach trained on breaty female vox on breaty female vox

Figures: (todo)