I can't speak for others obviously, but this sort of caption is nauseous: > In t...

natch · on June 7, 2024

Exactly. Ugh.

>In the heart of

It's clearly the edge of a skatepark. Not "the heart."

etc. etc... others have gone through it extensively so I won't do that again, but it's full of gratuitous added content that does not match what's in the picture at all. It seems to be aping some writing style, not going for accuracy.

What's more remarkable to me is that the authors of this do not seem to notice.

It's bizarre that you would create a project using an approach and then when assessing that project prior to publication, you would just glance at the paragraph without even reading it, and say "looks great" and move on. Details matter. This is absolute crap and we need researchers who can discern crap, not just accept anything.

Further down in the post they ask why an image showing a dog as the Mona Lisa is funny. It's actually not funny. It's an old trope, and, as such, super dull. They should realize it's funny only to a subset of people, but they don't seem to even realize that much. This team needs to get out more.

worstspotgain · on June 7, 2024

It's just an artifact of the training sample pairs IMO. You can reduce it by fine-tuning the prompt, or even LoRA if you'd like to nuke this mosquito.

throw310822 · on June 7, 2024

Came to say the same. Might be the task "describe the picture" that puts it into that mode; however I still hope that no human being would really write such tosh.

darby_nine · on June 7, 2024

I really am very pro ai describing images, but it's the editorializations like "The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding." that struck me as—idk, quite odd and uncanny-valley like.

jebarker · on June 7, 2024

It's not only nauseous to read but it's also too speculative, e.g. how do you know the trick is impressive? How do you know the skateboarder is exhilarated?

Jackson__ · on June 7, 2024

Yeah, this is my number one complaint with all recent open source vision models, and it seems like it is only getting worse. It's verbose to the point of parody, making it extremely difficult to evaluate what it can actually _see_, and what it's just dumbly markov-chaining based on previous text tokens.

In GPT4V, you can prompt around this if you know about it, but none of the people collecting datasets for open models appear know or care to apply that, and so we just get this default GPT4V contamination everywhere.

The only vision model I enjoy is Google Gemini, simply because it will give you a no-nonsense caption. Of course it still hallucinates things that are not there, but getting a color or object wrong is orders of magnitude less bad than having 3 sentences that have nothing to do with the image.

fragmede · on June 7, 2024

I got chatGPT to say

> A skateboarder attempts an awkward mid-trick in an average skate park, with distracted onlookers.

about the image, but where's the fun in that,

esafak · on June 7, 2024

That's the price of getting an image that accurately represents what you had in mind. Otherwise you could just prompt it with "skateboarder in a skatepark".

Jackson__ · on June 7, 2024

>In the heart of a vibrant skatepark,

Vibrant? It's made of white concrete.

>a skateboarder is caught in a moment of pure exhilaration.

Oh, great we're mind-reading from pictures now? His face isn't even visible.

> adding balance to the daring stunt.

What incredible commentary.

>palm trees sway gently, adding a touch of nature to the urban setting.

The trees aren't swaying. But because sometimes trees sway, I guess it'll just say it.

> A few spectators can be seen in the distance, their attention riveted on the airborne skateboarder

The spectators look more like they are just glancing over, nowhere near "riveted".

> The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding.

This adds literally nothing.

zimpenfish · on June 7, 2024

> Vibrant? It's made of white concrete.

"vibrant" could apply to the activity, not the physical structure. There's 4 other people in the back of the photo - if you assumed it was a tiny slice of the park, you could say it was vibrant.

(to me it doesn't look particularly vibrant re: activity but this is just one small corner and I will allow the ML some leeway in its floridity.)

natch · on June 7, 2024

The paragraph said it is the heart of the park, not a "tiny slice."

If you're going to praise the paragraph, at least choose a word that's defensible. Like ok, it got "the" right.

Surprising that you would defend that particular word. Far from being vibrant, the place looks dead, frankly. You can see it in the bored faces of the three people staring off in random directions with disinterested stances. Even the guy who's walking looks like he's just shuffling along.

"I can see how, if you look at it in just the right light, you might think there is a little blue in there."

Nah. Not good enough.

darby_nine · on June 7, 2024

It doesn't accurately represent the photo, hence the issue at hand. In many ways "skateboarder in a park" is more accurate than the large number of small inaccuracies this description manages to accrue. (Like many humans, but still the verbosity is very odd for the little detail it actually conveys!)

I'm not trying to argue against the idea of ai-generated titling, just that the product is very inferior to what even mild diletantes of the field might expect from advertised capabilities.

jebarker · on June 7, 2024

This isn't a text to image model it's an image captioning model. The images in the figures are confusingly labeled since it's the caption that's generated, not the image I think.