Somewhat silly question: To what extent does analysis of music/sound by looking at spectrogram images provide "enough" information for usage in deep learning systems (like ResNet) compared to something like MusicNet?
They're pretty fundamental. speech to text networks like DeepSpeech convert audio to an MFCC power spectrogram. Others use an Stft magnitude spectrogram.
It's a 2-d (amplitude by Freq and time) representation of something that's usually 1-d (amplitude over time).
I think some networks have tried using the discrete wavelet transform too,but that's outside my knowledge area.
Thanks for the explanation. I've checked out your HN profile and then SoundCloud: are many of the tracks you've posted generating as part of your research into "adversarial audio examples"?
[0]https://magenta.tensorflow.org/datasets/maestro