Towards Real-Time Musical Agents

Abstract

We propose a framework for a real-time instrumental accompaniment and improvisation system. The project is twofold: we develop a diffusion-based generative model for musical accompaniment, and we build a hybrid system that enables real-time interaction with this model by combining MAX/MSP with a remote Python server. We train our latent diffusion model with look-ahead conditioning and deploy it on a Python server. The MAX/MSP frontend handles real-time audio input, buffering, and playback, communicating with the server via OSC messages.

This setup enables a musician to plug in and play live within MAX/MSP, while the ML model listens and responds with complementary instrumental parts. We demonstrate inherent trade-offs between offline and real-time regimes: while offline generation offers better coherence, real-time operation requires faster inference and suffers from reduced quality. To address latency bottlenecks, we apply consistency distillation to enable faster, low-latency inference. The fastest configuration achieves real-time performance by generating 2×X seconds of look-ahead audio in X seconds.

System Overview

Accompaniment generation with a latent diffusion model. Input audio is encoded into a 64×64 latent representation via Music2Latent encoder E. A U-Net diffusion model conditioned on the mixture latent and a one-hot instrument label performs iterative denoising. Decoder D reconstructs the target stem waveform. Temporal masking (striped regions) enables look-ahead generation during both training and inference.

Sliding-window protocol for real-time accompaniment generation. Three regimes are illustrated corresponding to look-ahead depths w ∈ {−1, 0, 1}: Retrospective — prediction on already observed content (offline upper bound); Immediate — prediction without look-ahead (impractical under non-negligible inference latency); Look-ahead — future audio generated ahead of playback time, enabling real-time streaming.

Audio Examples

All examples are generated from a single held-out test track from the Slakh2100 dataset, using step ratio r = 0.25 and 6-second context window. For each condition, we show the generated stem alongside the full mix (generated stem + input context). The ground truth stems are shown once below for reference.

The excerpt is a MIDI rendition of I Will Survive by Gloria Gaynor, seconds 24–55. We deliberately chose this passage for its musical contrast: it opens with a sparse moment where only the piano plays, then transitions into a full rhythmic section with all instruments. This makes it a challenging and revealing test for the model across all three conditions.

Full Mix (Ground Truth)

The complete mixture of all four stems — this is the piece the models are learning to accompany.

Full Mix