Towards Real-Time Musical Agents:
Instrumental Accompaniment with Latent Diffusion Models and MAX/MSP

Tornike Karchkhadze  •  Shlomo Dubnov

University of California San Diego

Paper LDM model Python Code MAX/MSP and C++ Code

Abstract

We propose a framework for a real-time instrumental accompaniment and improvisation system. The project is twofold: we develop a diffusion-based generative model for musical accompaniment, and we build a hybrid system that enables real-time interaction with this model by combining MAX/MSP with a remote Python server. We train our latent diffusion model with look-ahead conditioning and deploy it on a Python server. The MAX/MSP frontend handles real-time audio input, buffering, and playback, communicating with the server via OSC messages.

This setup enables a musician to plug in and play live within MAX/MSP, while the ML model listens and responds with complementary instrumental parts. We demonstrate inherent trade-offs between offline and real-time regimes: while offline generation offers better coherence, real-time operation requires faster inference and suffers from reduced quality. To address latency bottlenecks, we apply consistency distillation to enable faster, low-latency inference. The fastest configuration achieves real-time performance by generating 2×X seconds of look-ahead audio in X seconds.

System Overview

Latent diffusion model
Accompaniment generation with a latent diffusion model. Input audio is encoded into a 64×64 latent representation via Music2Latent encoder E. A U-Net diffusion model conditioned on the mixture latent and a one-hot instrument label performs iterative denoising. Decoder D reconstructs the target stem waveform. Temporal masking (striped regions) enables look-ahead generation during both training and inference.
Sliding-window streaming protocol
Sliding-window protocol for real-time accompaniment generation. Three regimes are illustrated corresponding to look-ahead depths w ∈ {−1, 0, 1}: Retrospective — prediction on already observed content (offline upper bound); Immediate — prediction without look-ahead (impractical under non-negligible inference latency); Look-ahead — future audio generated ahead of playback time, enabling real-time streaming.

Audio Examples

All examples are generated from a single held-out test track from the Slakh2100 dataset, using step ratio r = 0.25 and 6-second context window. For each condition, we show the generated stem alongside the full mix (generated stem + input context). The ground truth stems are shown once below for reference.

The excerpt is a MIDI rendition of I Will Survive by Gloria Gaynor, seconds 24–55. We deliberately chose this passage for its musical contrast: it opens with a sparse moment where only the piano plays, then transitions into a full rhythmic section with all instruments. This makes it a challenging and revealing test for the model across all three conditions.

Full Mix (Ground Truth)

The complete mixture of all four stems — this is the piece the models are learning to accompany.

Full Mix
0:00

Ground Truth Stems

Drums Drums
0:00
Bass Bass
0:00
Guitar Guitar
0:00
Piano Piano
0:00
Diffusion Model  (LDM, 5 denoising steps · ~981 ms cycle)
Retrospective w = −1  ·  Offline upper bound Full context available

Model generates each instrument with complete past and future context visible — best achievable quality.

Instrument Ground Truth Generated Stem Mix (Generated + Context)
DrumsDrums
0:00
0:00
0:00
BassBass
0:00
0:00
0:00
GuitarGuitar
0:00
0:00
0:00
PianoPiano
0:00
0:00
0:00
Immediate w = 0  ·  No look-ahead Causal outpainting

Model generates the next segment with no future visibility — impractical for live use as inference time exceeds audio buffer duration.

Instrument Ground Truth Generated Stem Mix (Generated + Context)
DrumsDrums
0:00
0:00
0:00
BassBass
0:00
0:00
0:00
GuitarGuitar
0:00
0:00
0:00
PianoPiano
0:00
0:00
0:00
Look-ahead w = 1  ·  Real-time streaming Future audio generated ahead of playback

Model predicts future segments ahead of playback time, enabling uninterrupted real-time accompaniment. Satisfies real-time constraint at r = 0.25 (981 ms < 1500 ms step).

Instrument Ground Truth Generated Stem Mix (Generated + Context)
DrumsDrums
0:00
0:00
0:00
BassBass
0:00
0:00
0:00
GuitarGuitar
0:00
0:00
0:00
PianoPiano
0:00
0:00
0:00
Consistency Distillation Model  (CD, 2 steps · ~589 ms cycle · 1.66× faster)
Retrospective w = −1  ·  Offline upper bound Full context available

CD model with full context. Slightly lower quality than diffusion in this regime, but still strong.

Instrument Ground Truth Generated Stem Mix (Generated + Context)
DrumsDrums
0:00
0:00
0:00
BassBass
0:00
0:00
0:00
GuitarGuitar
0:00
0:00
0:00
PianoPiano
0:00
0:00
0:00
Immediate w = 0  ·  No look-ahead Causal outpainting

CD model without look-ahead. Thanks to reduced inference time, this condition is closer to practical real-time use than the diffusion model.

Instrument Ground Truth Generated Stem Mix (Generated + Context)
DrumsDrums
0:00
0:00
0:00
BassBass
0:00
0:00
0:00
GuitarGuitar
0:00
0:00
0:00
PianoPiano
0:00
0:00
0:00
Look-ahead w = 1  ·  Real-time streaming Future audio generated ahead of playback

CD model with look-ahead. Satisfies real-time constraint even at r = 0.125 (497 ms < 750 ms), giving finer-grained temporal updates than the diffusion model.

Instrument Ground Truth Generated Stem Mix (Generated + Context)
DrumsDrums
0:00
0:00
0:00
BassBass
0:00
0:00
0:00
GuitarGuitar
0:00
0:00
0:00
PianoPiano
0:00
0:00
0:00

MAX/MSP Live Demo

The following video demonstrates the real-time system in operation: a musician performs live within the MAX/MSP patch while the AI model listens and generates complementary instrumental parts.

🎬

Video coming soon.
Replace this placeholder in index.html with your YouTube embed or local video file.