Abstract
We propose a framework for a real-time instrumental accompaniment and improvisation system. The project is twofold: we develop a diffusion-based generative model for musical accompaniment, and we build a hybrid system that enables real-time interaction with this model by combining MAX/MSP with a remote Python server. We train our latent diffusion model with look-ahead conditioning and deploy it on a Python server. The MAX/MSP frontend handles real-time audio input, buffering, and playback, communicating with the server via OSC messages.
This setup enables a musician to plug in and play live within MAX/MSP, while the ML model listens and responds with complementary instrumental parts. We demonstrate inherent trade-offs between offline and real-time regimes: while offline generation offers better coherence, real-time operation requires faster inference and suffers from reduced quality. To address latency bottlenecks, we apply consistency distillation to enable faster, low-latency inference. The fastest configuration achieves real-time performance by generating 2×X seconds of look-ahead audio in X seconds.
System Overview
Audio Examples
All examples are generated from a single held-out test track from the Slakh2100 dataset, using step ratio r = 0.25 and 6-second context window. For each condition, we show the generated stem alongside the full mix (generated stem + input context). The ground truth stems are shown once below for reference.
The excerpt is a MIDI rendition of I Will Survive by Gloria Gaynor, seconds 24–55. We deliberately chose this passage for its musical contrast: it opens with a sparse moment where only the piano plays, then transitions into a full rhythmic section with all instruments. This makes it a challenging and revealing test for the model across all three conditions.
Full Mix (Ground Truth)
The complete mixture of all four stems — this is the piece the models are learning to accompany.
Ground Truth Stems
Model generates each instrument with complete past and future context visible — best achievable quality.
| Instrument | Ground Truth | Generated Stem | Mix (Generated + Context) |
|---|---|---|---|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
Model generates the next segment with no future visibility — impractical for live use as inference time exceeds audio buffer duration.
| Instrument | Ground Truth | Generated Stem | Mix (Generated + Context) |
|---|---|---|---|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
Model predicts future segments ahead of playback time, enabling uninterrupted real-time accompaniment. Satisfies real-time constraint at r = 0.25 (981 ms < 1500 ms step).
| Instrument | Ground Truth | Generated Stem | Mix (Generated + Context) |
|---|---|---|---|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
CD model with full context. Slightly lower quality than diffusion in this regime, but still strong.
| Instrument | Ground Truth | Generated Stem | Mix (Generated + Context) |
|---|---|---|---|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
CD model without look-ahead. Thanks to reduced inference time, this condition is closer to practical real-time use than the diffusion model.
| Instrument | Ground Truth | Generated Stem | Mix (Generated + Context) |
|---|---|---|---|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
CD model with look-ahead. Satisfies real-time constraint even at r = 0.125 (497 ms < 750 ms), giving finer-grained temporal updates than the diffusion model.
| Instrument | Ground Truth | Generated Stem | Mix (Generated + Context) |
|---|---|---|---|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
|
0:00 |
0:00 |
0:00 |
MAX/MSP Live Demo
The following video demonstrates the real-time system in operation: a musician performs live within the MAX/MSP patch while the AI model listens and generates complementary instrumental parts.
Video coming soon.
Replace this placeholder in index.html with your YouTube embed or local video file.