05/07/2026
By Danielle Fretwell

The Francis College of Engineering, Department of Electrical and Computer Engineering, invites you to attend a Master's Thesis defense by Emmanuel Karakatsanis titled: "Expressive MIDI-to-Audio AI Synthesis Using Differentiable DSP and Transformer Networks."

Candidate Name: Emmanuel Karakatsanis
Degree: Master’s
Defense Date: Friday, May 22, 2026
Time: 10 a.m. - noon
Location: This will be a virtual defense via Zoom. Those interested in attending should contact Committee Chair Dalila_Megherbi@uml.edu at least 24 hours before the defense to request access to the meeting.

Committee:

  • Advisor: Dalila Megherbi, Ph.D., Professor, Electrical and Computer Engineering, UMass Lowell
  • Xuejun Lu, Ph.D., Professor, Electrical and Computer Engineering, UMass Lowell
  • Hengyong Yu, Ph.D., Professor, Electrical and Computer Engineering, UMass Lowell

Abstract: The ability to generate realistic musical audio from symbolic representations has significant implications for music production, film scoring, game audio, and accessibility. Current digital audio workstations rely on static sample libraries and manual editing to bridge the gap between Musical Instrument Digital Interface (MIDI) notation and expressive audio, a process that is time-consuming and limited in its ability to capture the nuance of live performance. 

Advances in artificial intelligence and differentiable signal processing now make it possible to learn the complex relationship between symbolic music input and realistic instrument audio directly from data, opening new possibilities for automated and controllable music synthesis. 

This thesis presents two complementary approaches for generating realistic instrument audio from MIDI input, both built on the Differentiable Digital Signal Processing (DDSP) framework. The first proposed approach, referred to as the base transformer, uses a single end-to-end causal transformer to predict DDSP synthesis parameters from seven MIDI-derived input features. The model is trained in two phases: first on the GoodSounds dataset for clean timbre learning, then fine-tuned on the URMP dataset to model musical context. A proposed expression module bridges the gap between flat MIDI features at inference time and the expressive features the transformer expects, using rule-based vibrato combined with a learned loudness model. The second proposed approach, the timbre-conditioned transformer, extends the base transformer by adding a 16-dimensional latent embedding extracted from a reference recording as an additional input. This allows users to control the output's specific timbral character. An autoencoder parameter blending scheme further combines the timbre fidelity of the reference recording with the musical context learned by the transformer. 

A proposed multi-scale context training strategy improves the model’s awareness of phrase-level musical structure. Quantitative evaluation demonstrates that the timbre-conditioned transformer achieves an 18.8% reduction in reconstruction loss compared to the base transformer on held-out URMP data. Both transformer variants outperform the frame-independent autoencoder. A four-way instrument classification experiment confirms that each instrument model produces a distinct, identifiable timbre with 99.5% classification accuracy. The approach is evaluated across four instruments: violin, cello, clarinet, and saxophone. Detailed experimental results, including spectral comparisons, reconstruction accuracy metrics, blend parameter analysis, and synthesis method comparisons, are presented to show the potential value of the proposed methods.