Browser support¶
demucs-onnx runs fully in the browser tab via
onnxruntime-web.
The audio never leaves the user's machine.
This page shows three flavors of integration, from least to most production-shaped:
- Zero-build vanilla — drop two files on a static host, double-click
index.html, done. - Vite + React + TS — the typical SPA shape.
- Server-side / Next.js — Next.js needs an SSR carve-out and custom headers; we cover both.
For each one we publish a demucs-onnx browser-config snippet you can
copy verbatim into your project's bundler config.
Use the fp16weights variant in the browser
The default browser demo loads
htdemucs_ft_vocals_fp16weights.onnx (~166 MB) instead of the fp32
file (~316 MB). The graph still computes in fp32 at runtime, so
latency and accuracy are unchanged; only the download shrinks
1.91×. Browser caches keep it around forever after the first load.
Pick a model¶
| Model file (from HF, fp16weights variant) | Download | Best for |
|---|---|---|
htdemucs_ft_vocals_fp16weights.onnx |
166 MB | Vocal removal / karaoke (default). |
htdemucs_ft_drums_fp16weights.onnx |
166 MB | Drum extraction. |
htdemucs_fp16weights.onnx |
166 MB | All 4 stems, single session. |
htdemucs_6s_fp16weights.onnx |
136 MB | 6-stem with guitar + piano. |
The 4-stem htdemucs_ft specialist files predict only one row meaningfully
(the bag's drums file's drums row, etc). The htdemucs and
htdemucs_6s files predict every row, which makes them slightly better
fits for a browser demo when you want all stems in one download.
1. Zero-build vanilla demo¶
# scaffold the demo files into /tmp/demo
demucs-onnx browser-demo /tmp/demo
cd /tmp/demo
python -m http.server 8080
# open http://localhost:8080/
You get three files:
/tmp/demo/
├── index.html ~2.5 KB file-picker UI + "Extract vocals" button
├── demo.js ~6 KB ORT session, chunked overlap-add, WAV encode
└── README.md
index.html works under file:// for the inference itself (the HF CDN
is permissive enough). You only need an HTTP server if you want
multi-threaded WASM (which needs the COOP/COEP headers shown in the
next section).
The vanilla demo doesn't import onnxruntime-web from npm — it loads
the prebuilt ort.min.mjs from cdn.jsdelivr.net. That keeps the demo
zero-dependency, but if you want to pin ORT to a version locally, fork
the file and swap the import.
2. Vite + React + TS¶
demucs-onnx browser-demo /tmp/react-demo --react
cd /tmp/react-demo
npm install
npm run dev
# open http://localhost:5173/
You get a standard Vite + React + TypeScript app:
/tmp/react-demo/
├── package.json
├── vite.config.ts
├── tsconfig.json
├── index.html
├── README.md
└── src/
├── App.tsx ~5 KB React UI + chunked separation loop
└── main.tsx tiny mount point
The important parts of vite.config.ts:
import { defineConfig } from "vite";
import react from "@vitejs/plugin-react";
export default defineConfig({
plugins: [react()],
// 1) Do not pre-bundle ORT's WASM-touching entry.
optimizeDeps: { exclude: ["onnxruntime-web"] },
// 2) COOP/COEP enable multi-threaded WASM EP (3-5× speedup).
server: {
headers: {
"Cross-Origin-Opener-Policy": "same-origin",
"Cross-Origin-Embedder-Policy": "require-corp",
},
},
});
You can print the snippet for any other bundler:
demucs-onnx browser-config --bundler vite # default
demucs-onnx browser-config --bundler webpack
demucs-onnx browser-config --bundler esbuild
demucs-onnx browser-config --bundler next
demucs-onnx browser-config --bundler rollup
3. Next.js¶
Next.js needs an SSR carve-out (isServer branch in webpack config)
plus the COOP/COEP headers via next.config.js:
Then dynamically import ORT on the client:
"use client";
import { useEffect, useState } from "react";
export default function VocalsButton() {
const [ort, setOrt] = useState<typeof import("onnxruntime-web") | null>(null);
useEffect(() => {
import("onnxruntime-web").then(setOrt);
}, []);
// ... render once `ort` is loaded ...
}
The 30-line chunked overlap-add loop¶
The htdemucs ONNX graph is fixed at exactly 7.8 s of stereo 44.1 kHz
(mix shape (1, 2, 343980)). For inputs longer than that, you need
overlap-add chunking with a triangular window. This is the loop both
demos use, distilled to its essence:
const SAMPLE_RATE = 44100;
const N_SAMPLES = Math.round(7.8 * SAMPLE_RATE); // 343,980
const OVERLAP = Math.floor(N_SAMPLES / 4);
const STRIDE = N_SAMPLES - OVERLAP;
async function separate(
session: ort.InferenceSession,
mix: [Float32Array, Float32Array], // [L, R]
stemRow: number, // 0=drums, 1=bass, 2=other, 3=vocals
) {
const total = mix[0].length;
const nChunks = Math.ceil(total / STRIDE);
const out = [new Float32Array(total), new Float32Array(total)];
const weight = new Float32Array(total);
const window = makeTransitionWindow(N_SAMPLES, OVERLAP);
const chunkBuf = new Float32Array(2 * N_SAMPLES);
for (let i = 0; i < nChunks; i++) {
const start = i * STRIDE;
const end = Math.min(start + N_SAMPLES, total);
chunkBuf.fill(0);
for (let c = 0; c < 2; c++) {
chunkBuf.subarray(c * N_SAMPLES, c * N_SAMPLES + (end - start))
.set(mix[c].subarray(start, end));
}
const result = await session.run({
mix: new ort.Tensor("float32", chunkBuf, [1, 2, N_SAMPLES]),
});
const stems = result.stems.data as Float32Array; // (1, 4, 2, N) flat
const rowOffset = (stemRow * 2) * N_SAMPLES;
const clen = end - start;
for (let c = 0; c < 2; c++) {
for (let s = 0; s < clen; s++) {
out[c][start + s] += stems[rowOffset + c * N_SAMPLES + s] * window[s];
}
}
for (let s = 0; s < clen; s++) weight[start + s] += window[s];
}
for (let c = 0; c < 2; c++) {
for (let s = 0; s < total; s++) {
out[c][s] /= Math.max(weight[s], 1e-8);
}
}
return out;
}
function makeTransitionWindow(seg: number, overlap: number) {
const w = new Float32Array(seg).fill(1);
for (let i = 0; i < overlap; i++) {
w[i] = i / overlap;
w[seg - 1 - i] = i / overlap;
}
return w;
}
For htdemucs_6s change stems[rowOffset + ...] to use a 6-row layout
and pick whichever row you want
(drums=0, bass=1, other=2, vocals=3, guitar=4, piano=5).
Performance notes¶
Single-threaded WASM on an Apple M4 Pro processes a 7.8 s chunk in ~6 s (RTF ~0.77) — slower than the same hardware running ORT natively (~1.6 s) because the WASM EP runs on the main thread by default.
To unlock multi-threaded WASM (3-5× speedup):
- Serve under HTTPS or
localhost. - Set the COOP/COEP headers shown above so
SharedArrayBufferis available. ort.env.wasm.numThreads = navigator.hardwareConcurrency ?? 4;
We've measured ~2.5 s per 7.8 s chunk on the same Apple M4 Pro with 4 threads. Mobile devices are slower (~5-10 s per chunk on iPhone 15 Pro).
Frequently asked¶
Q: Why fixed 7.8 s segments instead of dynamic length? The exported ONNX graph bakes the segment length in to keep the model ~3× smaller (no dynamic-shape blowup in attention). Chunked overlap-add recovers any length.
Q: Can I run this on mobile (iOS / Android)?
Yes — pair the same ONNX file with onnxruntime-mobile (iOS / Android
native) or React Native via react-native-onnxruntime. The 4-blocker
patches make the model bit-identical across runtimes.
Q: Does this leak my user's audio to a server?
No. The browser fetches the .onnx file from Hugging Face's CDN, then
runs everything locally. The audio never leaves the tab.