This is almost the exact approach I took for my project Neptunely (https://neptunely.com/). Working on bringing it to a VST at the moment so it's more portable.
Abstract: We introduce MusicLM, a model for generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff”. MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM out-performs previous systems both in audio quality and adherence to the text descriptions. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
I've been building a procedural music generation engine for the last 2 years. It's been a passion project. There are a handful of videos on youtube that showcase it. It can generate pretty good songs and transitions based on established rules. I'm hoping to make it more generally accessible soon, but so far I've just been using it to help me make my music: https://open.spotify.com/artist/3Xtq9IlfA0l3dNPe3lhGAY
I dont think its possible to make a compelling song in quite the same way. At Neptunely, we are pursuing a route that keeps the human in the picture. More of a collaboration. https://neptunely.com