Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
1 week of Stable Diffusion (multimodal.art)
457 points by victormustar on Aug 30, 2022 | hide | past | favorite | 166 comments


I think most are vastly underestimating the impact of Synthetic AI media - this is at least as big as the invention of film/photography (century-level shift) and maybe as big as the invention of writing (millenia-level shift). Once you really think through the consequences of the collapse of idea and execution, you likely to tend think the latter...

What we're seeing now are toy-era seeds for what's possible - e.g. I've been making a completely Midjourney-generated "interactive" film called SALT: https://twitter.com/SALT_VERSE/status/1536799731774537733

That would have been completely impossible just a few months ago. Incredibly exciting to think what we'll be able to do just one year from now..


Is it? I seriously doubt it.

Other than "you can't trust anything you don't see with your own eyes", what kind of shift is it? People lived like that for literally millennia before photography, audio, and video recording.

At absolute worst, we are only undoing about 150 years of development, and only "kind of" and only in certain scenarios.

Moreover, people were making convincing edits of street signs, etc. literally 20 years ago using just Photoshop. What does this really change at a fundamental level? Okay, so you can mimic voices and generate video, rather than just static images. But people have been making hoax recordings and videos for longer than we've had computers.

I think the effects of this stuff will be: 1) making it easier/cheaper to create certain forms of art and entertainment media, 2) making it easier to create hoaxes, and 3) we will eventually need to contend with challenges to IP law. That's about it. I think it will create a lot of value for a lot of people (sibling comment makes a good point about this being equivalent to CGI), but I don't see the big societal shift you're claiming that this is.


> people were making convincing edits of street signs, etc. literally 20 years ago using just Photoshop

You needed to have tools, skill, resources and time to do such things. You don't need to have that anymore. Anyone can do anything on any scale.

It's something what SpaceX did. Ofc it was possible to launch a rocket before SpaceX, but few really could afford that. Now that prices are low, that opens infinite number of new space exploration possibilities.


"opens infinite number of new space exploration possibilities"

In practice, "infinite" translates mainly to a handful of hyper-competitive guys checking off "went to space" off their achievements list. There is a very good reason for that: space is very inhospitable, much more inhospitable than Antarctica. Nothing much has happened in Antarctica for 100 years beyond the occasional hyper-competitive athlete and a few research stations. Perhaps a natural resource gold rush might liven up the place for a few decades, until exhaustion and falling back to inhospitable status, dotted with the rare ghost town remains.

Something similar happens in the "creative" space: the Internet unleashed a massive tidal wave of "content", yet the vast vast majority of it is rather trite and devoid of any (spiritual) meaning. Personally, I'm much more inclined to stick with the classics than even 20 years ago, simply because it's not worth my time wading through the deluge of poor quality "content" out there. To wrap up the analogy, I'd rather inhabit a nutritionally rich environment, than getting lost in the the vast, but mostly empty, expanse of the Internet.


Even if the vast majority of content on the internet is "rather trite and devoid of meaning", there's so much out there that even if just 0.05% of it is any good then that's a vast amount of new high quality content to enjoy and learn from.

I would take today's internet-fuelled media landscape over the landscape of 20 years ago in a heart beat.


I am fairly torn on this topic. The statement is mostly an admission that I'm too weak to not personally waste too much time on trite Internet content.

The question I often ask myself: is spending time with this content, while entertaining in the short term, perhaps via the novelty factor, also nourishing in the long term? The answer is, sadly, much more frequently NO than in the time of printed books.

The best I can hope is to be able to use Internet as an encyclopedia for laser-focused lookups. Sadly, I am too often caught in browsing random content only loosely related to the original lookup topic.


Nothing happens in Antarctica because the world powers signed a treaty in 1959 causing that to happen[1]. If it was open land that was allowed inhabited and owned by people forming new governments, I would bet you would see settlements sprout up there quite quickly. In space you might be able to set up new sovereign entities. That is one major reason people will want to go there.

[1] https://www.nsf.gov/geo/opp/antarct/anttrty.jsp


I see your point, and I raise you SoundCloud rap, tiktok, and virality.

A lot of the “trite” internet creations have gone on to become absolutely massive songs or artists.


I wouldn't say "a lot". A handful at most.


New music is funneled through the internet now, and that’s how things get launched. The old mode is dying.


> You needed to have tools, skill, resources and time to do such things.

Downloading a cracked copy of Photoshop and checking out a book from the library on how to edit photos is only somewhat more difficult than learning to use Python and write programs that generate art from some model. And only because learning anything is extremely easy today with so many free resources and help forums.

> Anyone can do anything on any scale.

I'll believe it when I see it.

> Now that prices are low, that opens infinite number of new space exploration possibilities.

Except SpaceX is still in the "crawl" phase of "crawl, walk, run", and they only got even that far because because an eccentric billionaire has staked his reputation on the problem and thrown a huge amount of money at it, without having to worry about things like "reporting to Congress" and "making sure the space program creates jobs in such-and-such voting district". And after all that effort and truly astounding engineering (the rocket lands itself back on the launch pad!!), space launches are still expensive, risky, and complicated, and will remain so into the foreseeable future (~decades).


> Python

Most people using those models aren't writing Python code. Check out https://www.reddit.com/r/dalle2/, https://www.reddit.com/r/midjourney/, https://www.reddit.com/r/StableDiffusion/, https://www.reddit.com/r/bigsleep/ etc

I expect that once the technology matures, a smaller and smaller niche of users will be doing any kind of programming


> they only got even that far because because an eccentric billionaire has staked his reputation on the problem and thrown a huge amount of money at it

That's what Bezos did with its Blue Origin. If you check where is Blue Origin in space race, you'll quickly realize it's not enough

Edit: ah, and is space still expensive? If one of universities in my middle-sized country with no space engineering background could afford to launch cubesat via SpaceX, then yes, I think it became cheap.


Did Bezos stake his reputation on Blue Origin? I think it'd be a lot less embarrassing for him if Blue Origin folded than it would be for Musk if SpaceX folded.


> Downloading a cracked copy of Photoshop and checking out a book from the library on how to edit photos is only somewhat more difficult than learning to use Python and write programs that generate art from some model.

How much practice would one need after doing that, before they're able to match the quality of some of the AI generated art? Not all of the AI generated artwork is perfect, but some of the art would take the average person years of practice to be able to match. Some art requires more than a cracked copy of Photoshop and a weekend of reading a book you borrowed from a library. You may be surprised to find that some people spend years honing their craft.


> You don't need to have that anymore. Anyone can do anything on any scale.

I have yet to see an example of "Synthetic AI media" that was both realistic and not immediately recognizable as being synthetically generated.

And if you think being 99% there means we're very close to 100% just remember how long it's taken self driving cars to close the gap (we actually don't know how long since they still haven't succeeded in this).


>I have yet to see an example of "Synthetic AI media" that was both realistic and not immediately recognizable as being synthetically generated.

Would you know if you had?


> I have yet to see an example of "Synthetic AI media" that was both realistic and not immediately recognizable as being synthetically generated.

Man, am I a good photographer or what

https://imgur.com/mUoY4b1

I mean, probably if you're familiar enough with squirrels something gives it away, but I'm not.


That said, I do think it does better with less realistic images right now, like

https://imgur.com/mcJsg0n and https://imgur.com/41eENUO

It also took a number of much worse images to get those ones.


Okay, but Imgur marked that second one as erotic, so for every two steps forward… haha


Lol, it didn't tell me that.

In case anyone is worried, it isn't remotely erotic or otherwise nsfw.


You're right. I guess if you can't tell the difference, that means everyone who claims to be able to tell the difference is lying!

I'm not particularly familiar with squirrels and something about that "photo" looks very off. If you showed it to me in a vacuum I'd just assume someone was trying to make a highly stylized version of something they had a photo reference for, but under no circumstances would I believe that's a real photo.


I'm not accusing anyone of lying, I'm suggesting that they might not have fully seen what this technology is capable of yet. I'm sure that I haven't. The whole point of the post we are discussing this under is the rate of progress in the space.


> I have yet to see an example of "Synthetic AI media" that was both realistic and not immediately recognizable as being synthetically generated.

Three months ago I'd probably have agreed with you. Things have changed.


Being able to just make a mental storyboard of ideas, and have that be trivially easy to turn into a finished product will transform who can express, and how they will express ideas and stories and art. Now you can do images. Will we also be able to do voice actors, video, 3d Assets, and even story beats and dialogue? It seems quite possible. Everyone just became, or is becoming, a director with 1000 cast members, concept artists, and crew at their disposal.


Not to mention being able to generate movies from a script and a few key frames. This tech is a revolution for opening up creativity.


Photoshop has always existed. Yet we haven't lost the era of not trusting pictures. What is new?

I think we're making the folly of comparing AI generated art to human generated art from the 90s. Humans have "advanced" much further with the advancements now that DALL-E is nowhere close to.

https://youtu.be/iKBs9l8jS6Q


And even earlier than Photoshop or computers. The concern over this reminds me of the era shortly after the development of photography where it was discovered that doing things like multiple exposure allowed the creation of "trick" photographs where you could combine images to create pictures of giants towering over buildings or tiny people playing in teacups. Society managed not to be falsely convinced of the existence of such beings despite the worry that people wouldn't be able to tell fact from fiction in them.


I think your perspective is really sharp and hits the important points. I think you might be right but I also think we are witnessing a sort of merger between an internet history and physical history. It’s playing out in a lot of ways but mostly through the impact of social media on civil discourse. We are watching technology collide with the real world.

It just feels to me like the internet has arrived in a way that can best be expressed in an Adam Curtis documentary.


Ever see the movie "Fantasic Planet"?

This isn't just drawing. It's the start of telling a computer you want something and it making it. Not just pictures, anything. Phyiscal things, computational things,...

Hey computer make me a sandwich, a desk, a computer virus, a paperclip, a gun, a bomb, a little brother.



As big as the invention of CGI

Still, humans use art to communicate intent, and we still consider AIs to be 'things' , no agency or intent. Being an artist just became a lot harder, because no amount of technical prowess can make you stand out. It s all about the narrative now


Art has always been about more than technical prowess; it's fundamentally an exploration of new ways to tickle neurons.

For more practical purposes like product design, anyone will tell you that actually drawing stuff is akin to typing in code, it can take a while but it's not the hard part


> we still consider AIs to be 'things'

Do we, though? You and I do, sure. Most people here will, probably. But at least one counter-example was on display a few weeks ago, the guy from Google that told the press that their text completion engine was "alive" and "had agency".

From my friends I talked about this (which are not in IT), most believed him. YMMV, but I seriously believe a good chunk of the population thinks we already have thinking A(G)I. I don't think there is a "we" here, anymore. :/


Do you think you 'll see an AI in jail anytime soon? If not, then it's nowhere nearly "a good chunk of the population"


Do we see pets in jail? No, but most people wouldn't say they are things (even though, in German law, they literally are), they are more or less intelligent beings with agency.

I don't see the correlation, to be honest. A good chunk (but still a minority) of the people believing something doesn't automatically change the law, anyway, does it? :/


pets do get euthanized/ muzzled if they are found 'guilty' against humans because we do think they have agency . Sometimes we blame the owner for things they could have done, but there are cases where it's beyond their control. For the same reason we don't jail kids but sometimes they do get punishment. We do thing kids and pets have agency, but not full agency.

At the current stage i don't think there is any AI that can be punished, or anyone that would credibly claim that an AI must be punished. Its maker will always be punished instead.

Well true but i think that chunk is quite small. It's one thing to nonchalantly say "this is alive" and a very different thing when you have to deal with the consequences.


> It's one thing to nonchalantly say "this is alive" and a very different thing when you have to deal with the consequences.

Yeah, fully agreed! Chatbots and "creative" ML systems are in the weird spot where they can't physically kill or hurt people, like e.g. a self-driving car, and perform tasks that "feel" like they need intelligence.

It's also absolutely quite possible that the "chunk" is way smaller than I think, I'm just blindly extrapolating from my social bubble :D


> toy-era seeds

I think what we have is a toy and will remain a toy, just like Eliza was 60 years ago. Academically fascinating, and given the constraints of the era, genuinely remarkable, but still a long way from really being useful.

I'm already getting bored of seeing 95% amazing 5% wtf AI generated images, I can't fathom how anyone else remains excited about this stuff so long. My slack is filled with impressive-but-not-quite-right images of all sorts of outrageous scenarios.

But that's the catch. These diffusion models are stuck creating wacky or surreal images because those contexts are essential allowing you to easily ignore how much these generates miss the mark.

Synthetic AI media won't even been as disruptive as photoshop, let alone the creation of written language.


No matter what happens, it sure is thrilling to witness this debate. You might be right, you might be wrong.

Personally I think a line can now be drawn that starts at the first cave drawing and ends in 2022. Something has fundamentally shifted, a true paradigm shift before our eyes.


This thought occurred to me recently while skimming the formulaic and indistinguishable programming on Netflix. It won't be long before a GPT-3 script is fed to an image generator and out comes the components of a movie or TV show. The product will undoubtedly need some human curation and voice acting, but the possibility of a one-person production studio is on the horizon.


And it will probably suck just as bad as any of the low-effort formulaic movie or music that humans like to produce.


It makes me wonder if all these musician’s catalogs that have been purchased as a whole lately are even more powerful than before. Owning a piece of every bit of media that contains a slice of David Bowie for example would be extremely valuable.

Consider the gaming world’s concept of “whales”. Customers willing to spend disproportionately enormous amounts of money in game. Can you sell these whales a unique, personalized David Bowie album that is about, I don’t know, maybe the customer’s own life story?


You're right, it's already on Show HN right now.


Agreed, it really does not seem far off now to imagine a world where I can request artifacts like

"This episode of Law & Order, but if Jerry Orbach never left the show"

"Final Fantasy VII as an FPS taking place in the Call of Duty universe"

"A 3D printable part that will enable automatic firing mode for {a given firearm}"


All without AGI!


> I've been making a completely Midjourney-generated "interactive" film called SALT

I stumbled over Midjourney the other day through these music videos[1][2] generated by Midjourney from the songs lyrics, and I immediately thought we're not far away from this being viable for a cartoon-like film.

Interesting times ahead.

[1]: https://www.youtube.com/watch?v=bulNXhYXgFI

[2]: https://www.youtube.com/watch?v=KVj_AEhpVbA


It doesn't bring anything new, just enhanced on top of what already exists, not even close to photography or film.


Doubt it - but it will become another great tool for artists to use.


It’s really crazy how Stable Diffusion seems to be very on par with DALL-E and you can run it on “most” hardware. Is there an equivalent for GPT-3? I don’t even think I can run the 2M lite GPT-J on my computer…


Tangential: I've set up a Discord Bot that turns your text prompt into images using Stable Diffusion.

You can invite the bot to your server via https://discord.com/api/oauth2/authorize?client_id=101337304...

Talk to it using the /draw Slash Command.

It's very much a quick weekend hack, so no guarantees whatsoever. Not sure how long I can afford the AWS g4dn instance, so get it while it's hot.

Oh and get your prompt ideas from https://lexica.art if you want good results.

PS: Anyone knows where to host reliable NVIDIA-equipped VMs at a reasonable price?


One thing I noticed is that on GCP if you create a a2-ultragpu (Nvidia a100 80gb) and you select a spot instance, the price estimate goes down to $0.33 hourly ($240/m) which sounds really good if it's not a mistake. I was wondering if you could then turn a single A100 into 7 GPUs using Multi-instance GPUs. So on an 80gb one you get 7 10GB GPUs (can't have 8 due to yield issues on those cards). I'm pretty sure that will run much slower than on the full instance, but not 7x slower so if you're running a larger service at scale this could be an option to parallelize things. If someone is able to get that running please let me know how it performs.

The next thing I considered was just buying up a ton of 3060 12gb cards (saw a few new ones for $330) and just hosting a server from my house. This might be a good option if you don't care about speed but care about throughput.

RTX 3090s are also decent in terms of price per iteration of Stable Diffusion. If you want to build a fast service like Dreamstudio I think it's the only option to be able to do it at a reasonable price. If you want to host these in the cloud using consumer RTX cards, you'll have to go with less reputable hosts since Nvidia doesn't allow it. I don't want to name any since I can't vouch for them, but there are some if you search. The cheapest option will be to buy them and host it yourself.

I'm still researching what the best price/performance is for hosting this so if you have any findings please share.


I'm experimenting with your Discord bot right now. It would be great to have a command that shows where your processes are currently in the queue or maybe the discord bot can update on queue position.


You can now ask the bot for the queue position of your draw tasks via /draw-status (see https://github.com/manuelkiessling/stable-diffusion-discord-...).


BTW thanks for putting this bot together! I've been playing with it. Very comparable results to Midjourney, no surprise. Would love the ability to do variations and upscaling, but I know that's very GPU resource intensive and you're doing this out of your own pocket (how are you affording this? How much is it costing?)


Right now it‘s $0,65 per hour, for an AWS g4dn.xlarge instance which features an NVIDIA T4.

I‘m not really affording this, to be honest — I’m looking forward to switch to a Spot instance tomorrow, which could bring costs down to about $0,20 per hour, but even then I will have to switch it off in a couple of days.

I‘m working on a significant speed improvement — if that works out, and users get a result in under 1 minute if they are first in line, then maybe it‘s possible to make the bot finance itself through a credits system.


I assume all the code is open source on github if I'd like to run this myself on my own expense? Maybe an installation / configuration guide if there isn't already one would be helpful.



Good idea, I'll look into it.


I submitted 2 /draw requests with prompts, got quoted a time 15-30 min for first one and then 17-34 for 2nd, submitted about 5 minutes apart but it's been now past the upper limit of the quoted time without any results. I'm assuming that the image generation has failed or perhaps the bot got stuck. Having some way of knowing would be helpful.


Not stuck, just a full queue. Results will come back sooner or later. Time is really just a guesstimate.


Just got one of the images back. Looks like you might want to double your time estimates. Also I got a Rick Roll meme image back as one of the results. I assume this is some sort of failure mode response?


Rick chimes in when the AI thinks the image might be NSFW.


Well that would have been a fun interpretation of my prompt ;)


I got this with the prompt "A cartoon backdrop of an empty hospital bed surrounded by medical equipment" Racy.


Any chance of releasing the source? I'd like to host my own instance so my discord server doesn't have to worry about queue times


Yeah, sure: https://github.com/manuelkiessling/stable-diffusion-discord-...

I quickly polished things and created a useful README - hopefully it's all correct. If not, let me know!


Awesome, tysm!



Consider joining our Discord server if you would like to discuss the project further: https://discord.gg/nsfeutx35z


GPT-3 isn't really all that optimized in terms of size. Later studies have shown that you don't need that many parameters to get the same results, so it should be possible to train a model that could run at least on something like an RTX 4090 Ti with 48GB ram.


Full GPT-6B can run if you have 22gb ram (CPU or GPU depending on where you run it).

Also can run an 8 bit quantized version pretty easily. This takes ~6gb RAM.

The results seem far off from GPT-3 but apparently it can get good results when fine tuned.

Bigger models like OPT 66B can run on cloud machines (or a really big local system)

OPT 175B weights are not open but can be applied for.

175B would require something like 500GB RAM if not quantized. That's a lot, but it's possible to build that locally if you have a couple 10's of thousands of dollars.

Wait a few years and 175B on a GPU will be no problem.


What does 'quantized' mean in this context?


Basically stuff a 32 bit value into an 8 bit value (and lose precision).

Apparently it doesn't affect the results significantly.

More info:

https://github.com/huggingface/transformers/pull/17901


Personally I don't find it on-par or even close to DALL-E... stylistically its output is a lot more plain (Midjourney does really well here) and it can't handle complicated prompts well (it will pick the one thing in the prompt it does know about and run with it, ignoring all else)

Plus, there are huge gaps in training. Ask it to draw something simple, like "a penis" and you get nightmare fuel....


Does DALL-E let you output penises? I thought openAI was forbidding many 'unseemly' prompts.


It definitely doesn't let you and you may get permanently banned for trying.


This is the killer aspect of it. Running an image in a <5 mins on a Mac is amazing when you consider the alternatives atm.


I worked on the Stable Diffusion and GPT-J integrations on NLP Cloud (https://nlpcloud.com/). Both can be used in FP16 without any noticeable quality drop (in my opinion). Stable diffusion requires 7GB of VRAM on a Tesla T4 GPU. GPT-J requires 12GB of VRAM (but if you really try to use the 2048 tokens context, the VRAM will go up and reach something like 20GB of VRAM).


Stable Diffusion seems hyper-trained on digital art and faces. Dall-e feels a lot more "intelligent" and can create a far greater and more comprehensive diversity of images from different prompts.


I have 64GB RAM and nVidia with 24GB vmem, which projects could be the limit I can run locally?


They forgot to mention the porn one..

https://www.vice.com/en/article/xgygy4/stable-diffusion-stab...

Why'd they "overlook" it? Probably more culturally significant and controversial than any of the others. It's the natural elephant.


Hi, I'm the creator of multimodal.art, I didn't overlook it, but there's no "specialized" NSFW content maker to be highlighted - this Vice articles just show people using the model in different iterations to generate NSFW content; you don't need a specialized notebook/tool for that, a few ones on the post can do it (others have a NSFW filter that comes in by default).

Additionally it is important to note that model was licensed under the OpenRAIL-M LICENSE which is not as permissive as an MIT license and forbids certain outputs to be shared or purposes to be built as apps


I thought only the "derivatives of the model" are under the user restrictions in the license, and are very permissive. The outputs of the model are very briefly covered in the license text

> You are accountable for the Output you generate and its subsequent uses. No use of the output can contravene any provision as stated in the License.


>there's no "specialized" NSFW content maker to be highlighted

Unless I'm misunderstanding you, yes there is, and it was even posted on HN last week:

https://news.ycombinator.com/item?id=32572770

(And yes, many of its results are horrifying)


Goes without saying, but enthusiasts are working on it and making progress. Here's a quote from a discussion on it yesterday:

as far as I am aware, I may have the only semi-tuned nsfw model, though it's not all that great. There are a lot of concepts that SD needs to learn, and they're difficult to teach. It's very likely that by the time I get something usable, it's going to destroy all of the other concepts. @<another_user> also has a fine tuned model, but it's only tuned on generating closeups of female genitals. If that's all you want to generate, then he's produced some really good results. As for distribution, I wouldn't know where to begin, considering the model is 11GB


No, this is just Stable Diffusion without any modification, no fine-tuning is needed to create nudes.


Stable diffusion trained on a nude-free data set might struggle somewhat.


I will not even be slightly surprised when in 10 years we get stats like

"60% of all image generation compute power used for making NSFW material"


I would expect the % of image generation compute power used for making NSFW material to come down over time, just like the % of home video minutes or digital photographs used for making NSFW material went down over time.


Or maybe something like "60% of all power on earth".

I can't yet decide if it's going to be extremely appealing or quickly get [even more] boring and repetitive.


My 3 year old GPU generates one image every ~5 seconds, and according to the documentation draws 215 watts at a maximum steady state load (TDP). That's very roughly a kilowatt second per image.

The internet tells me that a 2022 honda civic takes about 0.07 liters of gas per km. And it also tells me that that is equivalent to 2394 kilowatt seconds. I.e. 2394 images on a 3 year old GPU per km travelled using a new and fuel efficient model of car...

I'm not worried about this consuming a significant fraction of the power on earth.


I appreciate your energy comparison, and it makes sense to me.


I mean, its not hard to imagine where this goes next: video (and eventually 3d/VR scenes). Say 60fps -> 2394 frames/60fps = 40 seconds. That's equivalent to driving your car at (3600 seconds per hour / 40 seconds / km) == 90 km/hr. Yes, your GPU is older, but there will also be pressure to increase the resolution and fidelity of the generated content to match.


Ok, sure, maybe there's demand there. But how would the logistics work out so that it managed to become a problem?

Are people

a) Waiting 5 seconds/frame * 60 fps = 5 minutes / second for a video to generate on their personal computer, and doing this constantly enough that it manages to become a problem?

b) Buying computers that can do it real-time, but therefore output vastly more heat, requiring thermal management system akin to a car driving at highway speed?

c) Renting these computers at considerable cost to make these videos?

As long as enough people watch each video (or one person watches it enough times), the energy usage washes out to become negligible compared to the amount of human time invested. I just can't see a world where enough people are managing to consume a kw minute/second producing videos for themselves to watch only once or twice that it becomes an issue.

Personally I'm optimistic that energy/compute is going to continue going down substantially (in which case even real-time video generation might not be an issue). If it doesn't and we don't become substantially better at efficiently synthesizing video, I can't see personalized single use video generation being a thing.


> a) Waiting 5 seconds/frame * 60 fps = 5 minutes / second for a video to generate on their personal computer, and doing this constantly enough that it manages to become a problem?

There will be much more compute resources thrown at it to make it render in real time. We're not there yet, but I can see a path to that happening in the next few years.

> b) Buying computers that can do it real-time, but therefore output vastly more heat, requiring thermal management system akin to a car driving at highway speed?

Why not? We already have billions of cars driving around outputting heat. Its an incredible expenditure of energy, sure, but perhaps the value of generated content entertainment will match the value of car transportation.

> c) Renting these computers at considerable cost to make these videos?

I imagine longer term, the opex (i.e energy costs) will dominate the capex (GPU HW). The price of going into a generated world could be similar to going for a drive.

> As long as enough people watch each video (or one person watches it enough times), the energy usage washes out to become negligible compared to the amount of human time invested. I just can't see a world where enough people are managing to consume a kw minute/second producing videos for themselves to watch only once or twice that it becomes an issue.

This is where I strongly disagree. The democratization of skills and tools in creating content will break the one to many media model. You saw this in a large way in what the internet did to content distribution, in how the number of independent people creating content skyrocketed. These models will do the same for content creation. I predict most people will consume content personally generated for themselves or in small groups.

Here's an example: a group of friends puts on their VR headsets for their weekly DnD session. The DM begins describing the scene, which autogenerates around them. Each character can then respond with their own actions / path, and the scenes react dynamically. The hour session costs them $10 in compute/energy.

I'm mostly spitballing. I would imagine that we still have a couple of orders of magnitude reduction in energy costs that can be squeezed out of these models with improvements in specialized HW. But it will be matched against the insatiable demand of consumers for richer interactivity in content.


It really feels like for 3D the better quality/compute trade-off would be to have the ML model generate 3D models and animations, and then use a more traditional 3D rendering pipeline (by then with ray tracing and denoising).


I agree with your analysis, but I don't think that's how most people interpret global compute power consumption. The stored energy in the gasoline is not counted in global energy production figures, but the electricity used to power your GPU is.


As someone who stopped looking at porn about two months ago, after years of porn use, I can tell you: it’s going to be highly subjective.

I used to watch porn basically daily. But then after finally deciding to stop watching porn, the idea of porn itself is downright off putting to me. I don’t even quite know how or why. It just is.

And I imagine it will be the same for others with AI generated porn.


Technically you could get porn of that one exact weird turnon you have, that there is literally zero porn existing now (except if you pay for a costum video/photoshot).

Is this good? maybe... maybe not. Since most of the "normal" stuff already exists, it'll either be something "too extreme" for classic porn studios, or stuff using non porn people to turn into ai-porn stars.


I also wouldn't be surprised if 60%+ of the training data is NSFW when it's not filtered out.


idiocracy is becoming increasingly prophetic


> "60% of all image generation compute power used for making NSFW material"

Or 80% of all NSFW viewing happens at work.


> Why'd they "overlook" it?

It appears to be a site for AI art, so there's that.


This will be the majority use case for tools like this. I suppose this also extends rule 34 - even if it does not exist, there will be porn of it.


Funny of Reddit banning the mentioned Subs in a short amount of time.

Some years ago, the pendulum was very much on the other side.


Reddit's ban hammer is significantly more liberal than it used to be.

(I got suspended just for saying "fuck the king of Thailand" - in reference to a politically abused law prohibiting insults to the monarch.)


Reddit want to become public company so it's very much expected result.


Strange how the same specie who killed its cofounder are the same who lead it now.


He was a co-founder in name only. He was forced on the actual founders by Paul Graham.


Oh come it’s the same with twitter he liked the prospect of making lots of money and now he screams foul.


He's not doing that much screaming: https://en.wikipedia.org/wiki/Aaron_Swartz#Death.


who?


I don't understand the harm in generated images of people that don't exist. It might even reduce our CSAM problem.

Obviously if they're made to look like some celebrity, that's problematic.


Is it known what killed those subs? Was it content based on actual people (celebrities)?


I thought I saw someone mention that a Vice article linked to them, and possibly reddit didn't want people thinking they're going to be hosting a repository of "fake nudes of non-consenting people"

I took a quick look at the subreddit before it was banned and I don't think I saw any real people represented. It was a lot of video game or anime style characters. And one of Shrek with a massive dong.


Mentioned where? The linked article only mentions Reddit once and that link resolves fine.


I think this comment was meant to be a reply to the Vice article posted elsewhere in this thread


7 days and already that many UIs, plugins and integrations released. To be fair, developer/researcher access was a bit earlier but that is impressive adoption speed.


> 7 days and already that many UIs, plugins and integrations released.

That’s because you can use it to make porn. Don’t underestimate the motivational power of being able to easily create porn.


This is the easy answer, but I don't think this is the right answer.

The right answer, I'd argue, is that this was Prometheus giving fire to the mortals, and then the mortals quickly discovered everything that could be possible with fire.


Haha, yes it is the energy that sustained the internet after all


Tangent discussion: What are people's here experiences with running Stable Diffusion locally? I've installed it and haven't had time to play around, but I also have a RTX 3060 8GB GPU -- IIRC, the official SD docs say that 10GB is the minimum, but I've seen posts/articles saying it could be done with 8GB.

Mostly I'm interested in the processing time. Like, using a midrange desktop, what's the average time to expect SD to produce an image from a prompt? Minutes/Tens of minutes/Hours?


RTX 3080 (10GB) here

Keep in mind to have the batch-size low (equal to 1, probably), that was my main issue when I first installed this.

Then, there's lot's of great forks already which add an interactive repl or web ui [0][1]. They also run with half-precision which saves a few bytes. Additionally, they optionally integrate with upscaling neural networks, which means you can generate 512x512 images with stable diffusion and then scale them up to 1024x1024 easily. Moreover, they optionally integrate with face-fixing neural networks, which can also drastically improve the quality of images.

There's also this ultra-optimized repo, but it's a fair bit slower [2].

[0]: https://github.com/lstein/stable-diffusion

[1]: https://github.com/hlky/stable-diffusion

[2]: https://github.com/basujindal/stable-diffusion


ASUS Zephyrus G15 (GA503QM) with a laptop 3060 (95W, I think) with 6GB of VRAM, basujindal fork, does 512×512 at about 3.98 iterations per second in turbo mode (for which there’s plenty of memory at that size). That’s under 15 seconds per image on even small batches at the default 50 steps, and I think it was only using around 4.5GB of VRAM.

(I say “I think” because I’ve uninstalled the nvidia-dkms package again while I’m not using it because having a functional NVIDIA dual-GPU system in Linux is apparently too annoying: Alacritty takes a few seconds to start because it blocks on spinning up the dGPU for a bit for some reason even though it doesn’t use it, wake from sleep takes five or ten seconds instead of under one second, Firefox glyph and icon caches for individual windows occasionally (mostly on wake) get blatted (that’s actually mildly concerning, though so long as the memory corruption is only in GPU memory it’s probably OK), and if the nvidia modules are loaded at boot time Sway requires --unsupported-gpu and my backlight brightness keys break because the device changes in the /sys tree and I end up with an 0644 root:root brightness file instead of the usual 0664 root:video, and I can’t be bothered figuring it out or arranging a setuid wrapper or whatever. Yeah, now I’m remembering why I would have preferred a single-GPU laptop, to say nothing of the added expense of a major component that had gone completely unused until this week. But no one sells what I wanted without a dedicated GPU for some reason.)


Removing the NSFW and watermark modules from the model will easily allow you to run it with 8 GB VRAM (usually takes around 6.9 GB for 512x512 generations).

With an RTX 3060, your average image generation time is going to be around 7-11 seconds if I recall correctly. This swings wildly based on how you adjust different settings, but I doubt you'll ever require more than 70 seconds to generate an image.


I'm using the fork at https://github.com/basujindal/stable-diffusion which is optimized for lower VRAM usage. My RTX 2070 (8 GB) takes about 90 seconds to generate a batch of 4 images.


Just did a thousand images (different seeds for the same prompt) with the base (not fork) model at 50 iterations in a couple of hours (RTX 3080). Gpu vram usage seemed to hover around 9 gigs. This model is pretty damn fast for what it does.

I quickly threw together a folder structure where I have a md5'd prompt as a folder name, into that goes _promp.txt with the actual text of the prompt and the images i generate in a loop with the seed used and iteration number in the image's file name. That way I can generate like 20 seed-based images for a prompt and if the model bites I let it run with a much higher number of seed-based images. When you have a 1000 to pick from, some of the results are freaking amazing.


It's pretty fast on a RTX 3070 (8GB), a few seconds per image.

My first impression is it seems a lot more useful then DALL-E, because you can quickly iterate on prompts, and also generate many batches, picking the best ones. To get something that's actually usable, you'll have to tinker around a bit and give it a few tries. With DALL-E, feedback is slower, and there's reluctance to just hammer prompts because of credits.


I was blown away when I got DallE access, but now it seems almost silly by comparison. I really wonder why the DallE team chose to expose so few controls.


I have a dated 1070 with 8gb of vram, some of which also renders my desktop.

I was able to obtain 256x512 images with this card using the standard model, but ran into OOM issues.

I don't mind waiting, so now I am using the "fast" repo:

https://github.com/basujindal/stable-diffusion

With this, it takes 30s to generate a 768x512 image (any larger and I am experiencing OOM issues again). I think you should expect a bit faster at the same resolution with your 3060 because it's a faster card with the same amount of memory.


I have a Titan X (Pascal) from like 2015 with 12GB of vram and I've had no trouble running it locally. I'd say it takes me about 30 seconds maybe to generate a single image on a 30ddim (which is like the bare minimum I consider for quick iterations), when I want to get more quality images after I focus on a proper prompt, I set it to like 100 or 200 ddim and that maybe takes 1 minute for one picture (I didn't accurately measure). I usually just let it run for a few minutes in bulk of 10 or 20 pictures while I go do something else then come back half 15-20 minutes later.

It runs pretty well but the most I can get is a 768x512 image, but it's pretty good for stuff like visual novel background art[0] and similar things.

[0] - https://twitter.com/xMorgawr/status/1564271156462440448


I had to get a different repo with "optimized commands" on the first day, but my 3070 8GB has been happily processing images in decent time.


I decided to set up a local instance yesterday with my 3070 TI 8GB and had similar success, about 10 seconds per image at the default settings. Like you I also opted for a different repo [0] which emphasized adding a GUI but I think also opts out of the watermark addition/other checks. Sounds like it reduces memory usage from what others have said. Had more trouble coming up with creative prompts then getting set up surprisingly (to me anyway).

[0] https://github.com/hlky/stable-diffusion


This helped me with prompt generation: https://promptomania.com/stable-diffusion-prompt-builder/


I found it very easy to set up, too. I had a previous couple things I set up that were a lot harder to set up. Stable Diffusion has been dreamy. I'm already tempted to upgrade my setup to one of these with the GUIs, but I think if I wait just a bit longer, it's going to get even better. So I'm resisting the urge.


Take a moment to appreciate the fact that in 4,2Gb (less than that actually) you have the English language somehow encoded.

This is mind blowing.


I've been playing with it a bit, and I also find the information theory aspect absolutely amazing. It's more than just the English language that's encoded there. It's also encodes information about characters and the styles of countless artists. I just cannot fathom how all this information fits in that space.


Agreed. My wife plays animal crossing, so I had stable diffusion do some animal crossing prompts and was blown away. This 4gb file understand how all the textures on the objects in this game should look. Then I turned around and was generating liches on thrones in the styles of the painting masters. This is absolutely mind blowing to me.


AI generated art is interesting and will probably be helpful.

I see it as a cheap and fast alternative to paying a concept artist.

But not a revolution. Creating precise and coherent assets is going to be a challenge, at least with the current architecture.

From a research perspective this is, I think, much more than a toy, those models can help us better understand the nature of our minds, especially related to their processing of text, images and abstraction.


This is what a revolution looks like when it is happening.

Did you learn about the "Industrial revolution" or the "agricultural revolution" in class? That didn't take a week, or a year, or a decade to happen. Even the Internet revolution took more than a decade.

This is a revolution. And you're seeing it happen in real time.


I think what it shows us that activities that we think of as "human", like getting drunk, saying silly things that sound brilliant, or painting things that look stunning are actually the things that a machine has least trouble to copy.

Whereas things we associate more with computers, such as hard thinking, mathematics, etc. turn out to be more difficult to copy by a machine, and therefore perhaps more "human".


I've dismissed DALL-E - very cool, but won't really replace everyone. After playing with Stable Diffusion, as an artist, this is the most profound experience I've ever had with a computer. Check this out https://andys.page/posts/how-to-draw/


After playing with it for a few hours I'm sold on it soon replacing all blog spam media and potentially flooding etsy with "artists" trying to pass the renders as their own art work.

Here's some of the stuff I generated: https://imgur.com/a/mfjHNgO


It must be interesting being a graphic artist in 2020-2022. First NFT's that enabled some to make millions of dollars. Less than 2 years later, Stable Diffusion, which will probably shrink the market significantly for human graphical artists.


can someone recommend a good paper or blog post with an overview of the technical architecture of training and running stable diffusion?



I guess we know where the new market for all those Ethereum miners' GPUs will come from. I have always been sort of bear-ish on the trend towards throwing GPU power at neural nets and their descendants, but clearly there are amazing applications for this tech. I still think it's morally kinda wrong to copy an artist's style using an automated tool like this, but I guess we'll have to deal with that because there's no putting this genie back in the bottle.


Just imagine - you could write your own script of a series and have it realistically generated, especially cartoons, complete with voice acting. Popular generated Spongebob episodes could form canonical entries in the mind of the general public - after some information fallout, original episodes couldn't be even told apart. Postmodern pastiche will accelerate and will become total.


7 days trying to install python and their packages and failed. Have to remove those garbages , global dependencies from my machine. Such a waste of ecosystem.


Someone on reddit made a single self contained .exe with a GUI (haven't tested it) https://old.reddit.com/r/StableDiffusion/comments/wwh1s9/jus...


I'm using the Docker one, so much easier and no worries of polluting my real environment (all the installation scripts tend to download a variety of things from a variety of places).


Openvivo Stable Diffusion (CPU port of SD) is a easy install on Linux within a venv. Be sure to update Pip first before installing the required packages from the list. The lack of GPU acceleration and the associated baggage makes this much easier to set up and run.

https://github.com/bes-dev/stable_diffusion.openvino


Try this one with the docker image instead: https://github.com/AbdBarho/stable-diffusion-webui-docker


I know several semi-non-technical people that have got this running locally.


Yes, i wish i had same luck as theirs. Sometimes, i think they're genius!


If you're on Windows this is by far the easiest way: https://softology.pro/tutorials/tensorflow/tensorflow.htm

Mostly-automated installer.


It took me some time to get the OpenVivo distribution running on my Windows box. It turns out that it wasn't compatible with Python 3.10, I had to go back to 3.9. Maybe that'll help you?


I found it surprisingly easy to run it on a 2015 MacBook Pro.


How long does it take you to generate an image? What setup/fork are you using for this? Thanks


That Figma plugin is mind blowing to me. I'm also curious to see how the Blender integration pans out


In 30 years everything AI generates will be a red circle, because at that point it will have just trained on itself repeatedly.

Instead of labeling data for what things are, we'll have to label things as being generated or not.


The collaboration,pace and progress is stunning. If this can applied to other fields such climate change etc.

Great write up


One use case I have in mind is manga drawing. I wonder if anybody has tested manga related generation.


You can coax it to generate whole manga pages. The only downside is the text and story is incoherent, and the characters are inconsistent.


Right consistency would be an issue. Thank you for replying.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: