Hacker News new | past | comments | ask | show | jobs | submit login
Generating MIDI Music with GPT-2 (gwern.net)
143 points by gwern on April 25, 2020 | hide | past | favorite | 43 comments



The author mentions about 5% of the generated music is any good to take a listen. Even the best sounding ones shared don't sound that good to me.


That aspect reminds me of previous experiences with fractal art. If you're evaluating generated sets, there's always going to be a part of an original piece that's off, that's for sure. "Hey, it looks like headlights of cars driving around in the fog! But this other part makes zero sense and is distracting."

If you can use a piece as inspiration, browsing around until you find the 5% or whatever % that you like, there's really some benefit to be found. I felt a bit like Simon Cowell, auditioning works and being really picky, last time I did this. But in the end you can discard an item, go with it as-is, adapt it somehow, or use it as reference material. Eventually you build a gallery.


"LMD: catchy ambient-esque piano piece" is pretty good. I could see it as background music in a game. I like "Pop MIDI, rapid jazz piano?" too.


You're right. I listened to a few and didn't like them much but when I saw your comment I went back to that one, it's pretty good. It could do with a related but different section in the middle, but apart from that I enjoyed it.

Do people ever train at different hierarchical levels? I've done very little ml, but it seems to me that it'd be beneficial to train a net on "plans" and then separately train one to interpret plans.


Bare in mind that these exact same MIDI notes can be fed into better sounding instruments (such as lush pads, sharp lead synths, acoustic guitars, etc...), thus improving the overall end product.


Yeah. I think a lot of people are ignoring that the output target, MIDI, is pretty limited. A skilled producer could pretty easily take these exact notes and make a track that sounds great. Sound design makes a huge difference.


And the constant velocity doesn’t help either. Almost any MIDI score could be made sounding twice as good with a little variation in velocity and slight timing errors. Speaking of which, some friends who had a video production company were picking an audio track from a commercial DVD package and there were song choice in around few dozen different styles to pick from: country, progressive rock, etc. At some point I realized that the same score that was played within different style and textures and different cadence made it sound like different songs altogehter.


Most of it sounds terrible. I guess that it's because there is no way to give some feedback. A chess program knows what means winning. And IIRC "this person doesn't exist" uses human opinion to train.

Music has many rules, not only theoretical but unwritten rules about what works. You must incorporate them somehow into the program, either by code or offering something to the program to deduce them.


I previously tried an approach which uses DRL for feedback but I couldn't quite get it to work: https://gwern.net/GPT-2-preference-learning

At the moment, it probably would be more practical to train a model to predict ratings and use that to screen generated samples or possible completions and throw out too-low-scoring ones (the 'ranker' approach worked out very well for the Meena chatbot recently: https://arxiv.org/abs/2001.09977 )


What do you think about training a binary classifier to distinguish between human and generated samples? E.g. choose a state of the art model designed to classify composers or styles, and finetune it for this task.


I think that could work potentially as a rejection sampling, but you also have the risk that it will simply find some small discriminative detail and be unusefully good at the classification; that's why you do it in a loop as a GAN, but as you mention, GANs work really badly on sequences, still, so...

If you wanted to improve my ABC-MIDI GPT-2, the most straightforward ways would be to do data cleaning (I'm sure there's tens of thousands of awful MIDI files which should be removed! data cleaning with RNNs or GPT-2 or GANs always makes a large difference) and increase the model size (the fact that loss bottomed at 0.20, which is still quite bad, suggests that MIDI is hard enough that GPT-2 is struggling). More interesting would be to use Reformer or another long-range Transformer and try to operate directly on a more raw representation, like the the piano roll representation of MIDI. I think GPT-2 makes a lot of syntax errors which cripple outputs when a 'voice' goes silent, and a piano roll representation would be a lot more robust (at the cost of being like 10x larger).


How would you present a piano roll to a transformer (e.g. what would be a sample of the sequence)? You could try using a tuple of pitch integers for each time step. I'm not sure how big a "vocabulary" would need to be to capture most of the chords (note combinations) - it might actually be comparable in size to a language vocabulary (tens of thousands of words). You could use two channels to capture note onset/offset info (like it was done in a biaxial RNN paper). Or the encoding used for Musenet (with explicit timing info), but somehow I like the idea of "chords as words" better.


You would simply have a vector of 128 for each timestep (I'm not sure how discrete MIDI goes), and any of the 128 instruments turned on would be non-zero. It's not a dense encoding, by a long shot, but it's quite literal and explicit so you don't have to worry indirectness making it hard to learn (assuming you have a model which can handle such long-range inputs to begin with).

Personally, I'd rather try redoing BPE encoding for ABC-MIDI specifically.


That's a tiny step away from a GAN, which works very well for image generation, so seems promising here too.


GANs have been tried many times already for music generation, without much success. GPT-2 works very well for text generation so it seemed promising here too.

Music falls somewhere in between text (as a sequence of chords or PCM samples) and image (as a piano roll or a spectrogram), so maybe some hybrid of image and text generators is needed.


Music does neither, which is why these naive approaches don't work.


> And IIRC "this person doesn't exist" uses human opinion to train.

It doesn't, incidentally. It's just a standard StyleGAN dumping random images, trained to to model the average image/distribution, and not optimizing for human ratings or anything. Almost all the 'X Does Not Exist' things operate that way, including my own https://www.thiswaifudoesnotexist.net/


I thought it was interesting, even if the overall state of most of them wasn't great. Some of them had decent sections, but then there would be some periodically repeated jarring note. Definitely better than I would've expected.


Yeah...I'm now mildly curious what could be done using this, plus a genetic algorithm decided either by a human listener, or a suitability function encoding basic ideas in music.


the rate among humans who are learning to compose is lower as ideas you don't end up using are silent.


Went in with high hopes, but it’s not remotely listenable.


Many comments here seem somewhat dismissive, but I heard a number of terrific melodies & counter-melodies and would certainly consider working many of them into part of a song. Also, a lot of the notes or chords which sound 'jarring' to the average listener might work great in a jazz or contemporary classical context, especially with added layering.

It's definitely interesting to me. Not sure I see a product in there... but if it could be made more efficient and set up to generate smaller, tighter clips with better instrumentation (after being a trained a bit more on what people like), and had a few key features like time stretching, it could prove useful to creative types. Writing a melody or progression doesn't always come easily, and a little push can do wonders for writer's block.


This is my take as well. There are a lot of lines that would be right at home in anything moderately jazzy or progressive, and plenty more that I would have a great time using as a starting point.


Is there technology available through which we can change voice in real-time on phone calls, I assume for it work latency has to low.

There is this project https://github.com/CorentinJ/Real-Time-Voice-Cloning

But I found it pretty hard to run, and it doesn't have voice input.


(Disclaimer: we work on voice synthesis/style transfer and cloning)

Depends on how much fidelity you want and how much lag you are willing to accept. Our current voice style transfer state of the art is sufficiently capable though the results may still need anywhere from 6 months to two years of development to be considered "production ready" i.e think poor quality, noise, and artefacts in output audio with existing tech. Pasini. has a pretty good blog post and paper on this:

https://towardsdatascience.com/voice-translation-and-audio-s...


One thing I was hoping to see from GPT-2 is phrasing. Most of the folk music they're using as input has a structure of internal repetition that lets you know where you are in the tune. Most commonly: AABB structure, 64 beats total, with every 2^n beats clearly grouped. Some of the examples seemed to have this at the 8-beat level, but I couldn't find any with it at the 16, 32, or 64 level.

Here's a random snippet of a real tune where you can clearly hear the AABB structure: https://www.jefftk.com/contras/tunes/cast64__starabovethegar...


You don't really need any AI to do phrase repetition patterns. Typically you simply repeat it as is, or maybe change pitch. This can be done algorithmically after the phrase generating stage.

It would be nice if the network learned these patterns (and transitions) on its own, but it's far more important to generate quality phrases. That is the hard part and as you can see from the samples we are not there yet.


GPT-2 for text is impressive because it does a surprisingly good job with larger scale patterns. If you give it a play it generates things with actor marks, it's sentences are a reasonable length etc. I had been hoping this would be similar to learning the structure of tunes, but it seems that requires something we haven't figured out yet.

I do disagree, though, that phrase repetition is as simple as you say. A good 16-count phrase has patterns within it, and a good AABB tune has relationships between the A and B parts.


For state of the art in computer generated music check out https://aiva.ai


the state of this technology after being called 'too dangerous to release'


1. GPT-2 was built for generating text, not music nor programming languages. So this is pretty impressive.

2. The "too dangerous to release" refers to the text generation mode of GPT-2 and was mainly targeted at spammers etc.

3. Of course there was a hype component in it. They wanted people to start thinking about the question before it's too late, which they did. I doubt that we'll get any kind of attempted world control takeover by AI (unless someone explicitly tells AI to do it, at which point it's a guy trying to seize world control with the help of an AI tool), but even if the chance is very small, if you multiply it with the damage, you should at least start wondering about countermeasures.


As mentioned in the other comment, this is a very deceitful comment. GPT-2 was made for generating text, and the specific thing it's dangerous for is disinformation bots on twitter/reddit as well as fake generated articles, two things the original model quite excels at. It's like claiming that nuclear bombs aren't dangerous because they're not good at taking us to space.


What I meant by state is status. As in, this type of humble retro styled application is where we find GPT-2 in 2020 instead of dominating the news cycle with what you claim it is so good at. If it excels at text, where are the fake articles and tweets? It's just not that good.


Have you been reading the mass comments/tweets on political stories lately? Some of it does seem bot-like to me. I don’t have a strong opinion about how much this is happening. But I know I could personally run a bot that generates political spam with GPT-2. I know a lot of people want to influence the political conversation. I mean, people get paid salaries to write comments online. So I can’t help but suspect that someone is using tools that allow it to be done a lot more cheaply.


Saying that it could be is wide use but you wouldn't be able to detect it is an impossible to falsify claim though.


The way to falsify it is to track down the identity of the human author of every internet comment. Simple!

To get philosophical I won’t say that I “know” it’s happening. Most of the world consists of things I don’t know about. The best I can do is build a mental model based on what I do know.


It does work, though. Here's an example: https://techscience.org/a/2019121801/ There's also https://arxiv.org/pdf/1908.09203.pdf#page=48

Why aren't bad actors using it in the wild? I think it's a combination of them being technically unsophisticated and conservative, propaganda not actually working nearly as well as people like to think it does, and lack of detection of competent actors (things like StyleGAN being used for fake FB profiles are detected through carelessness like leaving the faces exactly aligned).


By definition, if the tweets and articles are good, they are unnoticeable from real ones. That's exactly what makes them dangerous. If you could detect them, then they wouldn't be dangerous anymore.


You should see the too-dangerous-to-release AI model my team built. It actually governs nations in a manner indistinguishable from a human. There are nations currently under its thrall that you wouldn't even imagine. The world leaders of these nations merely speak what speeches it writes for them.


> the state of this technology after being called 'too dangerous to release'

One trained model is not enough to judge transformers. Listed to this:

https://magenta.tensorflow.org/piano-transformer


Probably should have left the title alone. There's no such thing as 'MIDI music'. MIDI is a communications protocol to transfer encoded performance data between electronic sources and sinks.

Algorithmic performance data is a thing, but generally not very musical. There've been a couple of exceptions.


> There's no such thing as 'MIDI music'. MIDI is a communications protocol to transfer encoded performance data between electronic sources and sinks.

Why can't I just as easily say that

> There's no such thing as 'sheet music'. Sheets are a communications protocol to transfer encoded performance data between human sources and sinks.


The .mid file format is also part of the official MIDI standard. Since that's the format that's ultimately being generated, I don't see what's incorrect or misleading about describing it as "MIDI music".

https://en.wikipedia.org/wiki/MIDI#Standard_files




Applications are open for YC Summer 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: