Abstract

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting.

Framework Overview

BandTok framework overview
Comparison between RVQ tokenizers and BandTok. Figure(b) shows BandTok, which patchifies the Mel spectrogram into 2D latents and quantizes them with a single codebook. Its vertical axis corresponds to Mel-frequency bands.
NMI and perplexity profile analysis
Comparison between residual and band-wise tokens. Normalized mutual information (NMI) and language-model perplexity (PPL) are used to analyze token dependence and autoregressive prediction difficulty, respectively.

Generation Demos

Non-Cherry-Picked Results

Text prompts from the SongDescriber Dataset.

Citation

@inproceedings{cheng2026modeling,
  title     = {Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation},
  author    = {Cheng, Yuqing and Ma, Xingyu and Yu, Guochen and Gu, Xiaotao},
  booktitle = {IEEE ICME 2026 Challenge Papers},
  year      = {2026}
}