I can't speak for others obviously, but this sort of caption is nauseous:
> In the heart of a vibrant skatepark, a skateboarder is caught in a moment of pure exhilaration. The skateboarder, dressed in a black t-shirt adorned with a yellow graphic and black pants, is suspended in mid-air, performing an impressive trick on a concrete ramp. The skateboarder's arms are outstretched, adding balance to the daring stunt. The skatepark itself is a concrete playground, with the skateboarder's ramp being the main focus. In the background, palm trees sway gently, adding a touch of nature to the urban setting. A few spectators can be seen in the distance, their attention riveted on the airborne skateboarder. The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding.
This seems a lot more like a puff piece from a local publisher trying to fill space, or description of a stock photo to an advertiser, than a description I'd describe as accurate from a human to another human.
It's clearly the edge of a skatepark. Not "the heart."
etc. etc... others have gone through it extensively so I won't do that again, but it's full of gratuitous added content that does not match what's in the picture at all. It seems to be aping some writing style, not going for accuracy.
What's more remarkable to me is that the authors of this do not seem to notice.
It's bizarre that you would create a project using an approach and then when assessing that project prior to publication, you would just glance at the paragraph without even reading it, and say "looks great" and move on. Details matter. This is absolute crap and we need researchers who can discern crap, not just accept anything.
Further down in the post they ask why an image showing a dog as the Mona Lisa is funny. It's actually not funny. It's an old trope, and, as such, super dull. They should realize it's funny only to a subset of people, but they don't seem to even realize that much. This team needs to get out more.
Came to say the same. Might be the task "describe the picture" that puts it into that mode; however I still hope that no human being would really write such tosh.
I really am very pro ai describing images, but it's the editorializations like "The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding." that struck me as—idk, quite odd and uncanny-valley like.
It's not only nauseous to read but it's also too speculative, e.g. how do you know the trick is impressive? How do you know the skateboarder is exhilarated?
Yeah, this is my number one complaint with all recent open source vision models, and it seems like it is only getting worse. It's verbose to the point of parody, making it extremely difficult to evaluate what it can actually _see_, and what it's just dumbly markov-chaining based on previous text tokens.
In GPT4V, you can prompt around this if you know about it, but none of the people collecting datasets for open models appear know or care to apply that, and so we just get this default GPT4V contamination everywhere.
The only vision model I enjoy is Google Gemini, simply because it will give you a no-nonsense caption. Of course it still hallucinates things that are not there, but getting a color or object wrong is orders of magnitude less bad than having 3 sentences that have nothing to do with the image.
That's the price of getting an image that accurately represents what you had in mind. Otherwise you could just prompt it with "skateboarder in a skatepark".
"vibrant" could apply to the activity, not the physical structure. There's 4 other people in the back of the photo - if you assumed it was a tiny slice of the park, you could say it was vibrant.
(to me it doesn't look particularly vibrant re: activity but this is just one small corner and I will allow the ML some leeway in its floridity.)
The paragraph said it is the heart of the park, not a "tiny slice."
If you're going to praise the paragraph, at least choose a word that's defensible. Like ok, it got "the" right.
Surprising that you would defend that particular word. Far from being vibrant, the place looks dead, frankly. You can see it in the bored faces of the three people staring off in random directions with disinterested stances. Even the guy who's walking looks like he's just shuffling along.
"I can see how, if you look at it in just the right light, you might think there is a little blue in there."
It doesn't accurately represent the photo, hence the issue at hand. In many ways "skateboarder in a park" is more accurate than the large number of small inaccuracies this description manages to accrue. (Like many humans, but still the verbosity is very odd for the little detail it actually conveys!)
I'm not trying to argue against the idea of ai-generated titling, just that the product is very inferior to what even mild diletantes of the field might expect from advertised capabilities.
This isn't a text to image model it's an image captioning model. The images in the figures are confusingly labeled since it's the caption that's generated, not the image I think.
> In the heart of a vibrant skatepark, a skateboarder is caught in a moment of pure exhilaration. The skateboarder, dressed in a black t-shirt adorned with a yellow graphic and black pants, is suspended in mid-air, performing an impressive trick on a concrete ramp. The skateboarder's arms are outstretched, adding balance to the daring stunt. The skatepark itself is a concrete playground, with the skateboarder's ramp being the main focus. In the background, palm trees sway gently, adding a touch of nature to the urban setting. A few spectators can be seen in the distance, their attention riveted on the airborne skateboarder. The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding.
This seems a lot more like a puff piece from a local publisher trying to fill space, or description of a stock photo to an advertiser, than a description I'd describe as accurate from a human to another human.