Is there a good explanation of how to train this from scratch with a custom dataset[0]?
I've been looking around the documentation on Huggingface, but all I could find was either how to train unconditional U-Nets[1], or how to use the pretrained Stable Diffusion model to process image prompts (which I already know how to do). Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working. I'm pretty sure I also need some other trainables at some point, too.
[0] Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA. This would rule out a lot of the complaints people have about living artists' work getting scraped into the machine; and probably satisfy Debian's ML guidelines.
Ah I am glad to see someone else talking about using public domain images!
Honestly it baffles me that in all this discussion, I rarely see people discussing how to do this with appropriately licensed images. There are some pretty large datasets out there of public images, and doing so might even help encourage more people to contribute to open datasets.
Also if the big ML companies HAD to use open images, they would be forced to figure out sample efficiency for these models. Which is good for the ML community! They would also be motivated to encourage the creation of larger openly licensed datasets, which would be great. I still think if we got twitter and other social media sites to add image license options, then people who want to contribute to open datasets could do so in an easy and socially contagious way. Maybe this would be a good project for mastodon contributors, since that is something we actually have control over. I'd be happy to license my photography with an open license!
It is really a wonderful idea to try to do this with open data. Maybe it won't work very well with current techniques, but that just becomes an engineering problem worth looking at (sample efficiency).
Human artists derive their inspiration and styles from a large set of copyrighted works, but they are free to produce new art despite of that. Art would have developed much slower and be much poorer if, for example, Impressionism or Cubism had been entangled in long ownership confrontations in courts.
Then there's the fact that humanity has been able to develop and share art and literary works for thousands of years without the modern copyright system.
It would be interesting to see if this technology can erode the copyright concept a bit. Maybe not remove it completely, but perhaps influence people to create wider definitions for "fair use", and undo the extensions that Disney lobbyists have created.
That is a very apropos reference. If you're familiar with Cubism, you know that there's Picasso, and then there's Braque. The one is an art celebrity beyond almost any other, and the other isn't.
But they developed Cubism in parallel. There were periods where their work was almost indistinguishable. "Houses at l'Estaque", the trope namer for Cubism thanks to the remarks of a critic, was in fact by Braque.
You can generate infinite recognizable Basquiat from an AI, but is it Basquiat? No, of course not, because Basquiat's style operates within the context of a specific individual human making a point about expectations and the interface between his race and his artistic boldness and audacity as experienced by his wealthy audience. Making an AI 'ape' (!) his art style is itself quite the artistic statement, but it's not the same thing in the slightest.
You can generate infinite Rothko as 512x512 squares, but if you don't understand how the gallery hangings work and their ability to fill your entire visual field with first carefully chosen color, and then a great deal of detail at the threshold of perception of distinctions between color shades meant to further drive home the reaction to the basic color's moods, what you generate is basically arbitrary and nothing. Rothko isn't 'just a random color', Rothko is about giving you a feeling through means that aren't normal or representational, and the unusualness of this (reasonably successful) effort is what gave the work its valuation.
Ownership of the experience by a particular artist isn't the point. Rothko isn't solely celebrity worship and speculation. Picasso isn't all of Cubism. Art is things other than property of particular artists.
What makes it awkward is the great ease by which AI can blindly and unhelpfully wear the mask of an artist, such as Basquiat, to the detriment of art. It's HOW you use the tools, and it's possible to abuse such tools.
> You can generate infinite recognizable Basquiat from an AI, but is it Basquiat? No, of course not, because Basquiat's style operates within the context of a specific individual human making a point about expectations and the interface between his race and his artistic boldness and audacity as experienced by his wealthy audience.
I'm not sure how I feel about this - I agree with the conclusion, but not the reasoning. For me, AI-generated Basquiat is not Basquiat simply because he had no ownership or agency in the process of its creation.
It feels like an overly romantic notion that art requires specific historical/cultural context at the moment of its creation to be valid.
If I could hypothetically pay Basquiat $100 to put his own work into a stable diffusion model that created a Basquiat-esque work, that's still a Basquiat. If I could pay him to draw a circle with a pencil, that's his work - and if I used it in an AI model, then it's not.
It's about who held the paintbrush, or who delegated holding the paintbrush, not a retrospectively applied critical theory.
On reflection, I'm going to say 'nope'. Because it's Basquiat, I'm pretty sure you couldn't get him to make a model of himself (maybe he would, and call it 'samo'?). I don't think you could pay him to draw a circle with a pencil: I think he'd have been offended and angry. And so that is not 'his work'. It trips over what makes him Basquiat, so doing these things is not Basquiat (though it's very, very Warhol).
Even more than that, you couldn't do Rothko that way: the man would be beyond offended and would not deal with you at all. But by contrast, you ABSOLUTELY are doing a Warhol if you train an AI on him and have it generate infinite works, and furthermore I think he'd be absolutely delighted at the notion, and would love exploring the unexplored conceptual space inside the neural net.
In a sense, an AI Warhol is megaWarhol, an unexplored level of Warholiness that wasn't attainable within his lifetime.
Context and intent matter. All of modern art ended up exploring these questions AS the artform itself, so boiling it down to 'did a specific person make a mark on a thing' won't work here.
This seems to me to confuse agency with interpretation - romanticising the life and character of the artist after their heyday and death, talking about what they would have done.
Any drawing Basquiat did is a piece of art by Basquiat, whether or not it fits into the narrative of a book/thesis/lecture/exhibition. The circle metaphor isn't important - replace it with anything else. Artists regularly throw their own work away. Some of this is saved and celebrated posthumously, some never sees the light of day in accordance with their wishes. Scraps that fell on Picasso's floor sell for huge amounts of money.
Does everything he did fit the "brand" that some art historians have labelled him with, or the "brand" that auction houses promote to increase value, or the "brand" which a fashion label licenses for t-shirts? No, but I suspect this is probably what you are talking about ie. a "classic" Basquiat™ with certificate of authenticity?
Human artists cannot produce thousands of works in a few hours.
This arguments come up in every thread, and I'm baffled that people don't think the scale matters.
You may also be observed in public areas by police, but it would be an orwellian dystopia to have millions of cameras in spaces analyzing everyone's behavior in public.
Scale matters.
(But I'm indeed in favor of weaker copyright laws! But preferably to take power away from the copyright monopolies than the individual artists who barely get by with their profits)
> It would be interesting to see if this technology can erode the copyright concept a bit
Copyright law (especially in US) only ever changes in the direction that suits corporations. So - no.
What I expect instead is artists being sued by a big tech company for copyright violations because that big tech company used the artist Public Domain image for training their copyrighted AI and as a result it created a copyrighted copy of the original artist's image.
My bet is that big corporations won’t risk suing anyone over a supposed copyright on generated images,as there is a good chance that a court ends up stating that all AI generated images are in fact public domain (no author, not from the original intent and idea of a human)
You can already see the quite strange and toned down language they use on their sites. (And for some the revealing reversal from we licence to you to you licence to us)
Some smaller AI companies might believe they own a clear cut copyright and sue, but it would make sense that they would either be thrown out or loose
So, the US Copyright Office will already refuse to issue a copyright for text-prompt-generated AI art, at least if you try a stunt like naming the artist to be the AI program itself.
However, even if an image is not copyrightable, it can still infringe copyright. For example, mechanical reproductions of images are not copyrightable in the US[0] - which is why you even can have public domain imagery on the web. However, if I scan a copyrighted image into my computer, that doesn't launder the copyright away, and I can still be sued for having that image on my website.
Likewise, if I ask an AI to give me someone else's copyrighted work[1], it will happily regurgitate its training set and do that, and that's infringement. This is separate from the question of training the AI itself; even if that is fair use[2], that does nothing for the people using the AI because fair use is not transitive. If I, say, take every YouTube video essay and review on a particular movie and just clip out and re-edit all the movie clips in those reviews, that doesn't make my re-edit fair use. You cannot "reach through" a fair use to infringe copyright.
[0] In Europe there's a concept of neighboring rights, where instead of issuing you a full copyright you get 20 years of ownership instead. This is intended for things like databases and the like. This also applies to images; copyright over there distinguishes between artistic photography (full copyright) and other kinds of photography (20 years neighboring right only). This is also why Wikimedia Commons has a hilarious amount of Italian photos from the 80s in a special PD-Italy category.
[1] Which is not too difficult to do
[2] My current guess is that it is fair use, because the AI can generate novel works if you give it novel input.
> So, the US Copyright Office will already refuse to issue a copyright for text-prompt-generated AI art, at least if you try a stunt like naming the artist to be the AI program itself.
That’s because only humans can own copyrights. People can and have registered copyrights for Midjourney outputs.
> Copyright law (especially in US) only ever changes in the direction that suits corporations. So - no.
There's certainly arguments to be made in this direction, for example corporations tending to have the most money they can afford to spend on lobbying to get their way, but the attitude of "it hasn't been good up 'til now so it definitely can't ever be good" is pretty defeatist and would imply that positive change is impossible in any area.
In this situation, it would seem like the suit would end up at "comparing the timestamp at which the public domain and copyrighted versions were published", wouldn't it ?
There is nothing that the generative AI can do in this process that's legally different from copy pasting the image, editing it a bit by hand, and somehow claiming intellectual property of the _initial_ image, no ?
In theory yes, in practice you have to pay your legal expanses in US even if you win the case. Which means you can bankrupt because a big company thought you infringed on their rights even if you didn't. Simply because you can't afford the costs.
Doesn't your argument in the first paragraph assume that the methods by which humans derive new works from past experiences is equivalent to the way statistical models iteratively remove noise from images based on a set of abstract features derived from an input prompt?
That seems to be the core of the issue, and a much more interesting conversation to have. So why do I keep seeing a version of your first paragraph everywhere and not an explanation on why the assumption can be made?
The problem is not that people aren't owning ideas hard enough, ideas shouldn't be ownable in this way, the problem is that we've created a system that's obsessed with scarcity and collecting rents. Being able to own and trade ideas a la copyright/patents helps people who can buy copyrights and patents stifle creativity more than it helps artists gather reward for their creation (though it does both).
Human endeavor is inherently collaborative. The idea that my art is my virgin creation is an illusion perpetuated by capitalists. My art is the work of thousands who came before me with my slight additions and tweaks.
Your (and in general, our) suggestion that we should be concerned with respecting or even expanding these protections is incorrect if you want human creativity to flourish.
You misunderstand me. I am strongly in favor of abolishing all intellectual property restrictions. Here is me arguing just that two days ago:
https://news.ycombinator.com/item?id=33697341
But I am absolutely not in favor of keeping IP restrictions in place and then letting big corporations scoop up the works of small independent artists for their ML models.
Think of it in terms of software licenses. The people who write GPL protected software are leveraging existing copyright laws to enforce distribution of their code. They would probably be in favor of abolishing the entire IP rights system. But if a big corporation was copying a project from an independent creator that was GPL licensed, they’d sure as hell want to prosecute.
I believe strongly that IP restrictions are harmful. But keeping them in place while letting big corporations benefit from the work of independent artists who don’t want their work used in this way seems wrong to me. As long as artists wouldn’t expect anyone else to be able to copy their works, I’d like them to be able to consent to their work being used in these systems.
Ahh, I don't think that stance is evident from the GP but fair enough. I may even have a less fervent hate for IP protections than you do.
> But keeping them in place while letting big corporations benefit from the work of independent artists who don’t want their work used in this way seems wrong to me.
I see what you're saying here. My concern is that should copyright style protection be extended to the "vibe" or "style" of a painting it is going to be twisted in a way that ends up being used to silence/abuse artists in the same way that copyright strikes are already.
I think the idea that art is mostly individually creative vs mostly drawing upon the work of all the artists and art-appreciators around you and before you is already really problematic. The corrupting power of the idea is what I worry about. Similarly to crypto/NFTs, the idea that scarcity should exist in the digital world is the most dangerous thing, most of the other bad stems from that.
IMO the most important thing to work on is getting people to reject the idea itself as harmful.
I worry that any short term fix to try to prop up artists' rights in response to this changing landscape will become a long term anchor on our society's equity and cultural progress in the exact same way copyright is.
When I was younger, I also thought that way. I also felt that being artist has nothing to with money: a true artist will always create out of their internal need, not for money.
Then came the brutal reality: creating high-quality artwork needs time. Some can be created after work, but not that much. Some forms of art require expensive instruments. Some, like filmmaking, require collaboration and coordination of many people. So yes, I could do some forms of art part-time using the money from my day job, but I knew it was a far cry from what I could do when working on it full time. It's not capitalism, it's just reality.
Yeah, if you want artists to be able to devote their lives to their craft and reach the highest possible levels, they have to get paid enough to do that.
If all artists are "weekend warriors", they will still produce a lot of art, and some of it will be the best in that world. But the quality will be far from what we enjoy today.
That said, there are of course other ways to pay artists than the capitalist way of having customers pay for what they like. But I think the track record firmly favors a capitalist system.
It's almost like "capitalism" isn't something that needs to be created and forced upon people, it's just the way a world where energy isn't free and can not be created from thin air works. Capitalism is just that, the realization that there's no free lunches and no UBIs are possible without some serious unintended consequences. I pirate everything I consume, but I would never be such an hypocrite to say that all copyright must be abolished.
What? No. Capitalism is a more specific system for organizing goods and services, wherein the means of production and distribution of those goods and services (buildings, land, machines and other tools, vehicles etc) are privately owned and operated by workers (who are paid a wage) for the profit of the owners. That's only been the norm for a few hundred years, and only in certain places. Also, capitalism is separate from copyright and other IP, though IP as currently implemented is pretty obviously a capitalist concept.
At the moment I'd rather not get involved in an online argument about which economic systems are better than than which other ones... especially not on a forum run by a startup accelerator, with a constraint that my preferred system has to be more than 300 years old.
I just wanted to point out that capitalism is in fact a specific economic system. It's not a law of nature, or another word for "markets" or "freedom", or a realization that some other system doesn't work.
That's one of the great victories of capitalism: somehow it has convinced people that a 300 year-old economic system originating in north-western Europe is as natural as the air we breathe, and as inevitable as gravity or any natural law.
You have to threaten to shoot people to get them to practice any other -ism.
So, yes, capitalism in the sense of the freedom to trade one's labor does appear to be naturally and universally emergent in advanced human societies, in the absence of violent interference.
Capitalism has violent coercion at its core, in order to enforce its property rights. You simply think that that violence is legitimate and unproblematic because you believe the system it upholds is "natural" and legitimate, but at this point you're arguing in circles. But to say that capitalism is not violent is laughable.
Yes, it is. The violence comes in when you interfere with capitalism. It's not imposed upon you forcefully, you just aren't allowed to get in the way.
To the extent that certain aspects of capitalism lead to violence, those are elements that other parties -- generally corporations or governments rather than writers or philosophers -- added to the ideology.
People die trying to break out of non-capitalist countries, while they die trying to break in to capitalist ones. That's one possible way to tell the good guys from the bad guys.
(Shrug) Taking peoples' rights away, including their economic rights, is likely to get the hurt put on you. Ric Romero has more on this late-breaking story at 11.
It sounds funny but he may have a point.
It's not a quality of capitalism per se, had it been communism instead then communism would have been the best system for the present moment.
But capitalism prevails and may be the best system there is for now because I cannot fathom a change in system overnight that would not result in mass suffering for (almost) everyone.
The restrictions on creating art are the product of the society you live in, which means they are the product of capitalism if you live in a capitalist society. The way society is organised determines the cost of people's time, the cost of the tools, and the cost of the materials.
Yea I find when people say "ideas shouldn't be ownable" it's really the more general "deriving profit from private ownership was a mistake". Like you kinda point out, most of the reason I can think of that a person would want control of their intellectual property is to derive profit from it.
That reason has nothing to do with intellectual property or how it's created, it's a consequence of living in a capitalist society.
So anybody who just wanted a thing to exist, and don't care who gets the credit, aren't "real artists"? You must not work on any large art projects that involve other people.
99%? You might have it in reverse because most art is not produced by "fulltime" artists. I would even go as far and say 99% of art is not produced to earn money.
I've seen many arguments about getting laws on the books around ML learning. I would suggest people create a project that creates movies using ML and train it using existing Hollywood movies. I realize this isn't easy but the issue needs to be pushed to people that have the means to force change.
If you can't process/digest copyrighted content with algorithms/machine learning then Google Search (the whole thing, not just Image Search) is dead.
So no, it's not at all clear where the legal lines are drawn. There have been no court cases yet, regarding the training of ML models. People are trying to draw analogies from other types of cases, but this has not been tried in court yet. And then the answer will likely differ based on country.
> If you can't process/digest copyrighted content with algorithms/machine learning then Google Search (the whole thing, not just Image Search) is dead.
Not if Google honors the robots.txt like they say they do. Hosting content with a robots.txt saying "index me please" is essentially an implicit contract with Google for full access to your content in return for showing up in their search results.
Hosting an image/code repository with a very specific license attached and then having that licensed ignored by someone who repackages that content and redistributes it is not the same as sites explicitly telling Google to index their content.
A much closer comparison IMO would be someone compressing a massive library of copyrighted content and then redistributing it and arguing it's legal because "the content has been processed and can't be recovered without a specific setup". I don't think we'd need prior court cases to argue that would most likely be illegal, so I don't see how machine learning models differ.
LAION/StableDiffusion is already legal under the same exemptions as Google Image Search and does respect robots.txt. It was also created in Germany so US court cases wouldn’t apply to it.
Well, you can learn about generative models from MOOCs like the ones taught at UMich, Universitat Tubingen, or New York University (taught by Yann LeCun), and can gain knowledge there.
You can also watch the fast.ai MOOC titled Deep Learning from Scratch to Stable Diffusion [0].
You can also look at open source implementation of text2image models like Dall-E Mini or the works of lucid rain.
I worked on the Dall-E Mini project, and the technical knowhow that you need isn’t closely taught at MOOCs. You need to know, on top of Deep Learning theory, many tricks, gotchas, workarounds, etc.
You could follow the works of Eluther AI, follow Boris Dayma (project leader of Dall-E Mini) and Horace Ho on twitter. And any such people who have significant experience in practical AI and regularly share their tricks. The PyTorch forums is also a good place.
If you're talking about training from scratch and not fine tuning, that won't be cheap or easy to do. You need thousands upon thousands of dollars of GPU compute [1] and a gigantic data set.
I trained something nowhere near the scale of Stable Diffusion on Lambda Labs, and my bill was $14,000.
[1] Assuming you rent GPUs hourly, because buying the hardware outright will be prohibitively expensive.
I have... ~11TBs of free disk space and a 1080ti. Obviously nowhere close to being able to crunch all of Wikimedia Commons, but I'm also not trying to beat Stability AI at their own game. I just want to move the arguments people have about art generators beyond "this is unethical copyright laundering" and "the model is taking reference just like a real human".
To put things in perspective, the dataset it's trained on is ~240TB and Stability has over ~4000 Nvidia A100 (which is much faster than a 1080ti). Without those ingredients, you're highly unlikely to get a model that's worth using (it'll produce mostly useless outputs).
That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
As pointed out in [1], it seems machine learning takes the same path as physics already did. In the mid-20th century there was a "break" in physics, before individuals were making ground breaking discoveries in their private/personal labs (think Newton, Maxwell, Curie, Roentgen, Planck, Einstein, and many others) later huge collaborations (LHC/CERN, Icecube, EHT, et al.) are required, since the machinery, simulations, models are so complex, that groups of people are needed to create, comprehend and use them.
P.S. To counteract that (unintentionally actually, likely because of a simple optimization of instruments' duty cycle) in astronomy people come up with a concept of "observatory" (Like Hubble, JWST) instead of "experiment" (like LHC, HESS telescopes) where outside people can submit their proposals, and if selected get observational time. Along with raw data authors of the proposals get required expertise from the collaboration to process and analyze that data.
The point is that there's is no practical limit on compression. You don't need "AI" or anything besides very basic statistics to get astronomical compression ratios. (See: "zip bomb".)
The only practical limit is the amount of information entropy in the source material, and if you're going to claim that internet pictures are particularly information-dense I'd need some evidence, because I don't believe you.
Correct, however "compression is equivalent to general intelligence" (http://prize.hutter1.net/hfaq.htm#compai ) and so in a sense, all learning is compression. In this case, SD applies a level of compression that is so high that the only way it can sustain information from its inputs is by capturing their underlying structure. This is a fundamentally deeper level of understanding than image codecs, which merely capture short-range visual features.
Most human behavior is easy to describe with only a few underlying parameters, but there are outlier behaviors where the number of parameters grows unboundedly.
("AI" hasn't even come close to modeling these outliers.)
Internet pictures squarely falls into the "few underlying parameters" bucket.
Because we made the algorithms and can confirm these theories apply to them.
We can speculate they apply to certain models of slices of human behaviour based on our vague understanding of how we work, but not nearly to the same degree.
Hang on, but- plagiarism is a copyright violation, and that passes through the human brain.
When a human looks at a picture and then creates a duplicate, even from memory, we consider that a copyright violation. But when a human looks at a picture and then paints something in the style of that picture, we don't consider that a copyright violation. However we don't know how the brain does it in either case.
How is this different to Stable Diffusion imitating artists?
Well that would be ~4000 people each with an Nvidia A100 equivalent, or more with less, this would be an open effort after all. Something similar to folding@home could be used. Obviously the software for that would need to be written, but I don't think the idea is unlikely. The power of the commons shouldn't be underestimated.
It's not super clear whether the training task can be scaled in a manner similar to protein folding. It's a bit trickier to optimise ML workflows across computation nodes because you need more real time aggregation and decision making (for the algorithms).
A100 costs 10-12k USD 40GB/80GB vram and it's not even targeted at the individual gamer (not effective at gaming) -- they don't even give these things to big YouTube reviewers(LTT). So 4k people will be hard to find. 3090, you can find, that's a 24GB vram card. But that's expensive too and it's a power guzzler compared to the A100 series.
AFAIK. This is not possible at the moment and would need some breakthrough in training algorithms, the required bandwidth between the GPUs is much higher than internet speed.
> That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
The matter is really very nuanced and trivialising it that way is unhelpful.
If I recompress 240TB as super low quality jpgs and manage to zip them up as single file that is significantly smaller than 240TB (because you can), does the fact they are not pixel perfect matches for the original images mean you’re not violating copyright?
If an AI model can generate statistically significantly similar images from the training data, with a trivial guessable prompt (“a picture by xxx” or whatever) then it’s entirely arguable that the model is similarly infringing.
The exact compression algorithm, be it model or jpg or zip is irrelevant to that point.
It’s entirely reasonable to say, if this is so good at learning, why don’t you train it without the art station dataset.
…because if it’s just learning techniques, generic public domain art should be fine right? Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?
If not, then it’s not just learning technique, it’s copying.
So; tldr: there’s plenty of scope for trying to train a model on an ethically sourced dataset, and investigation of techniques vs copying in generative models.
If I recompress 240TB as super low quality jpgs and manage to zip them up as single file that is significantly smaller than 240TB (because you can), does the fact they are not pixel perfect matches for the original images mean you’re not violating copyright?
If you compress them down to two or three bytes each, which is what the process effectively does, then yes, I would argue that we stand to lose a LOT as a technological society by enforcing existing copyright laws on IP that has undergone such an extreme transformation.
Does that mean it’s worthless to try to train an ethical art model?
Is it not helpful to show that you can train a model that can generate art without training it on copyrighted material?
Maybe it’s good. Maybe not. Who cares if people waste their money doing it? Why do you care?
It certainly feels awfully convenient for that there are no ethically trained models because it means no one can say “you should be using these; you have a choice to do the right thing, if you want to”.
I’m not judging; but what I will say is that there’s only one benefit in trying to avoid and discourage people training ethical models:
…and that is the benefit of people currently making and using unethically trained models.
We don't agree on what "ethical" means here, so I don't see a lot of room for discussion until that happens. Why do you care if people waste computing time programming their hardware to study art and create new art based on what it learns? Who is being harmed? More art in the world is a good thing.
> Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?
You couldn't teach a human to do that without them having seen Greg's art. There are elements of stroke, palette, lightning and composition that can't be fully captured by natural language (short of encoding a ML model, which defeats the point).
Copyrights say you cannot reproduce, distribute, etc a work without consent from the author, whatever the mean. The copy doesn't need to be exact, only sufficiently close.
However, copyright doesn't prevent someone to look at the work and study it. Even study it by heart. Infringement comes only if that someone would make a reproduction of that work. Also, there are provision for fair use, etc.
> …because if it’s just learning techniques, generic public domain art should be fine right? Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?
Is it fair to hold it to a higher standard than humans though? To some degree it's the whole "xxx..... on a computer!" thing all over again if we go that way
> The matter is really very nuanced and trivialising it that way is unhelpful.
Harping about copyrights in the Age of Diffusion Models is unhelpful (for artists) like protesting against a tsunami. It's time to move up the ladder.
ML engineers have a similar predicament - GPT-3 like models can solve at first try, without specialised training, tasks that took a whole team a few years of work. Who dares still use LSTMs now like it's 2017? Moving up the ladder, learning to prompt and fine-tune ready made models is the only solution for ML eng.
The reckoning is coming for programmers and for writers as well. Even scientific papers can be generated by LLMs now - see the Galactica scandal where some detractors said it will empower people to write fake papers. It also has the best ability to generate appropriate citations.
The conclusion is that we need to give up some of the human-only tasks and hop on the new train.
I think it's a great idea regardless of practicality / implementation which I think is generally understood to be largely a matter of time, money and hardware. I feel like you write it up so the idea gets out there or you can pitch it to someone if the opportunity arises.
Oh and also I second the fast.ai suggestion, part 2 is 100% focused on implementing stable diffusion from scratch in the python standard library and it's amazing all around. The course is still actively coming out but the first few lessons are freely available already and the rest sounds like it will be made freely available soon.
Can you go into a bit more detail?
What architecture did you use? Is the month training time really just training with mini batches with a constant learning rate? Or are these many failed attempts until you trained a successful model for a few days in the end?
I particularly interested in the image generation part (the DDPM/SGM)
Yeah I did have a few false starts. Total time is more like 3 months vs 1 month for the final model. For small scale training I found it’s necessary to use a long lr warmup period, followed by constant lr.
There’s code on my GitHub (glid3)
edit: The architecture is identical to SD except I trained on 256px images with cosine noise schedule instead of linear. Using the cosine schedule makes the unet converge faster but can overfit if overtrained.
edit 2: Just tried it again and my model is also pretty bad at hands actually. It does get lucky once in a while though.
What kind of form factor do you use for 4x3090? Don't people usually use the datacenter product line when they're trying to get more than one into a box?
The datacenter cards are 3-4x the price for the same speed + double the vram. Gaming cards are a lot more cost effective if your model fits in under 24gb.
I use an open air rig like the ones used for crypto mining. 4x3090 would normally trip the breakers without mods but if you under volt the cards the power draw is just under the limit for a home AC outlet.
> Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA.
Doesn't the "BY" part of the license mean you have to provide attribution along with your models' output[0]? I feel you'll have the equivalent of Github Copilot problem: it might be prohibitive to correctly attribute each output, and listing the entire dataset in attribution section won't fly either. And if you don't attribute, your model is no different than Stable Diffusion, Copilot and other hot models/tools: it's still a massive copyright violation and copyright laundering tool.
I feel quite strongly that there is a large difference between Stable Diffusion and Copilot: with the size of the training set vs the number of parameters, it should be very difficult if not impossible for Stable Diffusion to memorize and, by extension, copy paste to produce its outputs. Copilot is trained on text and outputs text. Coding is also inherently more difficult for an AI model to do. I expect it will memorize large portions of its input and is copy pasting in many cases to produce output. I therefore believe Copilot is doing "copyright laundering" but Stable Diffusion is not. Furthermore, I do not believe, for example, that artists should be able to copyright a "style" - but I would like to see them not be negatively impacted by this. Its complicated.
Let me guess that you write more code than visual art?
Isnt it a bit anthropomorphic to compare the two algorithms by "how a human believes they work" instead of "what they're actually doing different to the inputs to create the outputs"?
These are algorithms and we can look at how they work, so it feels like a cop-out to not do that.
If I was generating image labels I absolutely would need to worry about that. However, since we're only generating images alone, we don't need to worry about bits of the labels getting into the output images.
The attribution requirement would absolutely apply to the model weights themselves, and if I ever get this thing to train at all I plan to have a script that extracts attribution data from the Wikimedia Commons dataset and puts it in the model file. This is cumbersome, but possible. A copyright maximalist might also argue that the prompts you put into the model - or at least ones you've specifically engineered for the particular language the labels use - are derivative works of the original label set and need to be attributed, too. However, that's only a problem for people who want to share text prompts, and the labels themselves probably only have thin copyright[0].
Also, there's a particular feature of art generators that makes the attribution problem potentially tractable: CLIP itself was originally designed to do image classification. Guiding an image diffuser is just a cool hack. This means that we actually have a content ID system baked into our image generator! If you have a list of what images were fed into the CLIP trainer and their image-side outputs[1], then you can feed a generated image back into CLIP and compare the distance in the output space to the original training set and list out the closest examples there.
[0] A US copyright doctrine in which courts have argued that collections of uncopyrightable elements can become copyrightable, but the resulting protection is said to be "thin".
[1] CLIP uses a "dual headed" model architecture, in which both an image and text classifier are co-trained to output data into the same output parameter space. This is what makes art generators work, and it can even do things like "zero-shot classification" where you ask it to classify things it was never trained on.
>If I was generating image labels I absolutely would need to worry about that. However, since we're only generating images alone, we don't need to worry about bits of the labels getting into the output images.
Just to be correct, SD generates labels on images sometimes, so, we need to worry ;)
This is not possible because the model is smaller than the input weights. Just as any new image it generates is something it made up, any attributions it generated would also be made up.
CLIP can provide “similarity” scores but those are based on an arbitrary definition of “similarity”. Diffusion models don’t make collages.
> Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working.
But training multi-modal text-to-image models is still a _very_ new thing, in terms of the software world. Given that, my experience has been that it's never been easier to get to work on this stuff from the software POV. The hardware is the tricky bit (and preventing bandwidth issues on distributed systems).
That isn't to say that there isn't code out there for training. Just that you're going to run into issues and learning how to solve those issues as you encounter them is going to be a highly valuable skill soon.
edit:
I'm seeing in a sibling comment that you're hoping to train your own model from scratch on a single GPU. Currently, at least, scaling laws for transformers [0] mean that the only models that perform much of anything at all need a lot of parameters. The bigger the better - as far as we can tell.
Very simply - researchers start by making a model big enough to fill a single GPU. Then, they replicate the model across hundreds/thousands of GPU's, but feed each on a different set of the data. Model updates are then synchronized, hopefully taking advantage of some sort of pipelining to avoid bottlenecks. This is referred to as data-parallel.
All this horsepower deployed to image generation is interesting but somebody wake me up when there is a stable diffusion for SQL or when on demand generative User Interfaces are spun up on the fly to suit the purpose.
It will be worthwhile to use images from commons. I have found that my photography is used in the stable diffusion data set. What was funny is that they have taken the images from other URLs than my flickr account.
I've been looking around the documentation on Huggingface, but all I could find was either how to train unconditional U-Nets[1], or how to use the pretrained Stable Diffusion model to process image prompts (which I already know how to do). Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working. I'm pretty sure I also need some other trainables at some point, too.
[0] Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA. This would rule out a lot of the complaints people have about living artists' work getting scraped into the machine; and probably satisfy Debian's ML guidelines.
[1] Which actually does work