🐦 warbler about

what is this?

warbler is a small audio classifier that answers one question: is this person singing or speaking?

upload a clip and it slices the audio into a few-second chunks, converts each chunk to a mel spectrogram, and runs it through a small convolutional neural network. each chunk gets its own probability of "singing", which you can see broken down in the per-chunk chart on the inference page.

how was it trained?

the model was trained on roughly 65 hours of multilingual audio (japanese, chinese, english, and korean) - a mix of singing and speech, including same-speaker singing/speech pairs. the idea is to push the model toward learning genuine acoustic differences between singing and speech, rather than shortcuts based on language or speaker identity.

architecture: a few conv blocks -> global pooling -> small fully-connected head -> sigmoid. tiny - not even half a megabyte.

what was it trained on?

a mix of public datasets and a few small private collections, covering singing and speech in japanese, chinese, english, and korean:

why does it exist?

mostly curiosity - it's a fun, self-contained problem with a clean binary label, real edge cases (rap, recitative, humming, spoken-word), and a model small enough to run comfortably on a CPU. this demo exists so you can throw your own clips at it and see how it holds up.

don't expect perfection on the hard cases! rap and melodic speech in particular tend to confuse it.

<- back to the demo