Brain2Music: Reconstructing Music from Human Brain Activity

Authors anonymized

Abstract  The process of reconstructing experiences from human brain activity offers a unique lens into how the brain interprets and represents the world. In this paper, we introduce a method for reconstructing music from brain activity, captured using functional magnetic resonance imaging (fMRI). Our approach uses either music retrieval or the MusicLM music generation model conditioned on embeddings derived from fMRI data. The generated music resembles the musical stimuli that human subjects experienced, with respect to semantic properties like genre, instrumentation, and mood. We investigate the relationship between different components of MusicLM and brain activity through a voxel-wise encoding modeling analysis. Furthermore, we discuss which brain regions represent information derived from purely textual descriptions of music stimuli.

Music Reconstruction with MusicLM (Highlights)

This section contains three manually selected highlights (the best out of 10). The left-most column contains the simulus, i.e., the music that our human test subjects were exposed to while their brain activity was recorded. The following three columns contain three samples from MusicLM which aim to reconstruct the original music.
An overview of our Brain2Music pipeline: High-dimensional fMRI responses are condensed into the semantic, 128-dimensional music embedding space of MuLan (Huang et al., 2022). Subsequently, the music generation model, MusicLM (Agostinelli et al., 2023), is conditioned to generate the music reconstruction, resembling the original stimulus. As an alternative we consider retrieving music from a large database, instead of generating it.

Comparison of Retrieval and Generation

Below we compare retrieval from FMA with generation using MusicLM (three samples). The results are random samples from test subject 1. There is one row for each of the 10 GTZAN genres.

Comparison Across Subjects

Below we compare retrieval from FMA with generation using MusicLM across all five subjects for which fMRI data has been collected. The results are random samples. There is one row for each of the 10 GTZAN genres.

Encoding: Whole-brain Voxel-wise Modeling


By constructing a brain encoding model, we find that two components of MusicLM (MuLan and w2v-BERT) have some degree of correspondence with human brain activity in the auditory cortex.

We also find that the brain regions representing information derived from text and music overlap.

GTZAN Music Captions

We release a music caption dataset for the subset of GTZAN clips for which there are fMRI scans. Below are ten examples from the dataset.