Brain2Music: Reconstructing Music from Human Brain Activity

Authors anonymized

Abstract The process of reconstructing experiences from human brain activity offers a unique lens into how the brain interprets and represents the world. In this paper, we introduce a method for reconstructing music from brain activity, captured using functional magnetic resonance imaging (fMRI). Our approach uses either music retrieval or the MusicLM music generation model conditioned on embeddings derived from fMRI data. The generated music resembles the musical stimuli that human subjects experienced, with respect to semantic properties like genre, instrumentation, and mood. We investigate the relationship between different components of MusicLM and brain activity through a voxel-wise encoding modeling analysis. Furthermore, we discuss which brain regions represent information derived from purely textual descriptions of music stimuli.

Music Reconstruction with MusicLM (Highlights)

This section contains three manually selected highlights (the best out of 10). The left-most column contains the simulus, i.e., the music that our human test subjects were exposed to while their brain activity was recorded. The following three columns contain three samples from MusicLM which aim to reconstruct the original music.

Stimulus (GTZAN music)	Reconstructions (generated by MusicLM)

An overview of our Brain2Music pipeline: High-dimensional fMRI responses are condensed into the semantic, 128-dimensional music embedding space of MuLan (Huang et al., 2022). Subsequently, the music generation model, MusicLM (Agostinelli et al., 2023), is conditioned to generate the music reconstruction, resembling the original stimulus. As an alternative we consider retrieving music from a large database, instead of generating it.

Comparison of Retrieval and Generation

Below we compare retrieval from FMA with generation using MusicLM (three samples). The results are random samples from test subject 1. There is one row for each of the 10 GTZAN genres.

Stimulus	Retrieval	Generation with MusicLM
GTZAN music	From FMA	Gen #1	Gen #2	Gen #3

Comparison Across Subjects

Below we compare retrieval from FMA with generation using MusicLM across all five subjects for which fMRI data has been collected. The results are random samples. There is one row for each of the 10 GTZAN genres.

Stimulus	Retrieval (FMA)					Generation (MusicLM)
Stimulus	Subject 1	Subject 2	Subject 3	Subject 4	Subject 5	Subject 1	Subject 2	Subject 3	Subject 4	Subject 5

Encoding: Whole-brain Voxel-wise Modeling

By constructing a brain encoding model, we find that two components of MusicLM (MuLan and w2v-BERT) have some degree of correspondence with human brain activity in the auditory cortex.

We also find that the brain regions representing information derived from text and music overlap.

GTZAN Music Captions

We release a music caption dataset for the subset of GTZAN clips for which there are fMRI scans. Below are ten examples from the dataset.

Clip name	GTZAN music (15s slice)	Human-written text caption
blues.00017		It is lazy blues with a laid-back tempo and relaxed atmosphere. The band structure is simple, with the background rhythm punctuated by bass and guitar cutting. The impressive phrasing of the lead guitar gives the piece a nostalgic impression.
classical.00008		Several violins play the melody. The melody is simple and almost unison, but it moves between minor and major keys and changes expression from one to the other.
country.00012		This is a classic country song. You can hear clear singing and crisp acoustic guitar cutting. The wood bass provides a solid groove with a two-beat rhythm. This is country music at its best. Ideal for nature scenes and homely atmospheres.
disco.00004		This music piece has a disco sound. Vocals and chorus create extended harmonies. The synthesiser creates catchy melodies, while the drumming beats rhythmically. Effective tambourine sounds accentuate the rhythms and add further dynamism. This music is perfect for dance parties, club floors and other scenes of dancing and fun.
hiphop.00014		This is a rap-rock piece with a lot of energy. The distorted guitars are impressive and provide an energetic sound. The bass is an eight beat, creating a dynamic groove. The drums provide the backbone of the rhythm section with their powerful hi-hats. The vocal and chorus interaction conveys tension and passion and draws the audience in.
jazz.00040		This is medium-tempo old jazz with female vocals. The band is a small band similar to a Dixie Jazz formation, including clarinet, trumpet and trombone. The vocal harmonies are supported by a piano and brass ensemble on a four beat with drums and bass.
metal.00026		This is a metal instrumental piece with technical guitar solos and distortion effects. The heavy, powerful bass creates a sense of speed, and the snare, bass and guitar create a sense of unity in unison at the end. It is full of over-the-top playing techniques and intense energy.
pop.00032		Passionate pops piece with clear sound and female vocals. The synth accompaniment spreads out pleasantly and the tight bass grooves along. The beat-oriented drums drive the rhythm, creating a strong and lively feeling. Can be used as background music in cafés and lounges to create a relaxed atmosphere.
reggae.00013		This reggae piece combines smooth, melodic vocals with a clear, high-pitched chorus. The bass is swingy and supports the rhythm, while whistles and samplers of life sounds can be heard. It is perfect for relaxing situations, such as reading in a laid-back café or strolling around town.
rock.00032		This rock piece is characterised by its extended vocals. The guitar plays scenically, while the bass enhances the melody with rhythmic fills. The drums add dynamic rhythms to the whole piece. This music is ideal for scenes with a sense of expansiveness and freedom, such as mountainous terrain with spectacular natural scenery or driving scenes on the open road.