🐦 warbler about

what is this?

warbler is a small audio classifier that answers one question: is this person singing or speaking?

upload a clip and it slices the audio into a few-second chunks, converts each chunk to a mel spectrogram, and runs it through a small convolutional neural network. each chunk gets its own probability of "singing", which you can see broken down in the per-chunk chart on the inference page.

how was it trained?

the model was trained on roughly 65 hours of multilingual audio (japanese, chinese, english, and korean) - a mix of singing and speech, including same-speaker singing/speech pairs. the idea is to push the model toward learning genuine acoustic differences between singing and speech, rather than shortcuts based on language or speaker identity.

architecture: a few conv blocks -> global pooling -> small fully-connected head -> sigmoid. tiny - not even half a megabyte.

what was it trained on?

a mix of public datasets and a few small private collections, covering singing and speech in japanese, chinese, english, and korean:

CSD

korean, english · singing, speech · CC BY-NC-SA

KAIST Music and Audio Computing Lab — Soonbeom Choi, Wonil Kim, Saebyul Park, Sangeon Yong, Juhan Nam
NIT-SONG070-F001

japanese · singing · CC BY 3.0

Sinsy Working Group, Nagoya Institute of Technology — Keiichi Tokuda, Yoshihiko Nankaku, Keiichiro Oura, Kazuhiro Nakamura, Shinji Sako
GTSinger

chinese, korean · singing, speech · CC BY-NC-SA 4.0

Yu Zhang, Changhao Pan, Wenxiang Guo, Ruiqi Li, Zhiyuan Zhu, Jialei Wang, Wenhao Xu, Jingyu Lu, Zhiqing Hong, Chuxin Wang, LiChao Zhang, Jinzheng He, Ziyue Jiang, Yuxun Chen, Chen Yang, Jiecheng Zhou, Xinyu Cheng, Zhou Zhao
ACV-001

chinese · singing · CC BY-SA 4.0

ArchiVoice
ACV-002

chinese · singing · CC BY-SA 4.0

ArchiVoice
ACV-003

chinese · singing · CC BY-SA 4.0

ArchiVoice
ACV Speech

chinese · speech · CC BY-SA 4.0

ArchiVoice
M4Singer (part of)

chinese · singing · CC BY-NC-SA 4.0

Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, Zhou Zhao
several small, private datasets

chinese, english, japanese · singing, speech · private

why does it exist?

mostly curiosity - it's a fun, self-contained problem with a clean binary label, real edge cases (rap, recitative, humming, spoken-word), and a model small enough to run comfortably on a CPU. this demo exists so you can throw your own clips at it and see how it holds up.

don't expect perfection on the hard cases! rap and melodic speech in particular tend to confuse it.

<- back to the demo

what is this?

how was it trained?

what was it trained on?

CSD

NIT-SONG070-F001

GTSinger

ACV-001

ACV-002

ACV-003

ACV Speech

M4Singer (part of)

several small, private datasets

why does it exist?