Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT-4V(ision) Unsuitable for Clinical Care and Education: An Evaluation (arxiv.org)
75 points by PaulHoule on March 26, 2024 | hide | past | favorite | 52 comments


I'm not specifically referring to this article, but in general I've noticed a frustrating pattern:

> AI company releases generalist model for testing/experimentation

> Users unwisely treat it like a universal oracle and give it tasks far outside its training domain

> It doesn't perform well

> People are shocked and warn about the "dangers of AI"

This happens every time. Why can't we treat AI tools like they actually are: interesting demonstrations of emergent intelligent properties that are a few versions away from production-ready capabilities?


I spoke w/ Marvin Minsky once back in the 1990s and he told me that he thought "emergent properties" were bunk.

As for the future I am certain LLMs will become more efficient in terms of resource consumption and easier to train, but I am not so certain that they're going to fundamentally solve the problems that LLMs have now.

Try to train one to tell you what kind of shoes a person is wearing in an image and it will likely "short circuit" and conclude that a person with fancy clothes is wearing fancy shoes (true much more often than not) even if you can't see their shoes at all. (Is a person wearing a sports jersey and holding a basketball on a basketball court necessarily wearing sneakers?) This is one of those cases where bias looks like it is giving better performance but show a system like that a basketball player wearing combat boots and it will look stupid. So much of the apparent high performance LLMs comes out of this bias and I'm not sure the problem can really be fixed.


I spoke w/ Marvin Minsky once back in the 1990s and he told me that he thought "emergent properties" were bunk.

He said the same thing about perceptrons in general. When it comes to bunk, Minsky was... let's just say he was a subject matter expert.

He caused a lot of people to waste a lot of time. I've got your XOR right here, Marvin...


Minsky is an example of the power of credentialism. He was an MIT professor who researched AI. I think history has pretty much demonstrated he didn't have any special insight into the question of AI, but for decades he was the most famous researcher in the field.


Hard to say. Perceptrons was an early mathematically rigorous book in CS that asked questions about what you could accomplish with a family of algorithms. That said, it is easy to imagine that neural networks could gave gained 5 years or more progress if people had had some good luck early on or taken them more seriously.


I doubt it would take very many suitable examples in the training dataset to fix problems like that.


I think this problem is mostly already solved with more capable models.

Here's the shoe-identification experiment with GPT-4 and it's not confused by a suit jacket:

https://imgur.com/a/cPP6pH7

It's also not confused by a tuxedo:

https://imgur.com/a/mWMxZbA

Also tried the same examples with Google Gemini 1.5 Pro and it doesn't hallucinate either.

https://imgur.com/a/3Yv7wuP


If it sees 90% of photos of basketball players wearing sneakers it is still going to get the idea that basketball players wear sneakers. I guess I could make up some synthetic data where I mix and match people's shoes but it's a strange way to get at the problem: train on some image set that has completely unrealistic statistics to try to make it pay attention to particular features as opposed to other ones.

Problems like that have been bugging me for a while, if you had some Bayesian model you could adjust the priors to make the statistics match a particular set of cases and it would be nice if you could do the same with a wide range of machine learning approaches. For instance you might find that 90% of the cases are "easy" cases that seem like a waste to include in the training data, keeping there gives the model the right idea about the statistics but may burn up the learning capacity of the model such that it can't really learn from the other 10%.

I talked w/ some contractors who made intelligence models for three-letter agencies and they told me all about ways for dealing with that come down to building multiple-staged models where you build a model that separates the hard cases from the easy cases with specialized training sets for each one. It's one of those things that some people have forgotten in the Research Triangle Park area but Silicon Valley never knew.


Your example is really simple to fix.

Just add another model asking it if there are shoes visible or not.


That's the right track but you have to go a little further.

You probably need to segment out the feet and then have a model that just looks at the feet. Just throwing out images without feet isn't going to tell the system that it is only supported to look at the feet. And even if you do that, there is also the inconvenient truth that a system that looks at everything else could still beat a "proper" model because there are going to be cases where the view of the feet is not so good and exploiting the bias is going to help.

This weekend I might find the time to mate my image sorter to a CLIP-then-classical-ML classifier and I'll see how good it does. I expect it to do really well on "indoors vs outdoors" but would not expect it to do well on shoes (other than by cheating) unless I put a lot of effort into something better.


There are lots of papers about using some kind of LLM to cut costs and staffing in hospitals. Big companies believe there is a lot of money to be made here, despite the obvious dangers.

A quick Google search found this paper, here's a quote:

"Our evaluation shows that GPT-4V excels in understanding medical images and is able to generate high-quality radiology reports and effectively answer questions about medical images. Meanwhile, it is found that its performance for medical visual grounding needs to be substantially improved. In addition, we observe the discrepancy between the evaluation outcome from quantitative analysis and that from human evaluation. This discrepancy suggests the limitations of conventional metrics in assessing the performance of large language models like GPT-4V and the necessity of developing new metrics for automatic quantitative analysis."

https://arxiv.org/html/2310.20381v5

For sure there's some waffling at the end, but many people will come away with the feeling that this is something GPT-4V can do.


There is actually a lot of work involved in tracking everything for an operation.

If an ai is able to record an operation with 9x% accuracy but you save humans, the insurance might just accept this.

Nonetheless the chance that ai will be able to continually become better and better at it, is very high.

Our society will switch. Instead of rewriting software or updating software we will fine-tune models and add more examples.

Because this is actually sustainable (you can reuse the old data) this will win at the end.

The only thing changing in the future is model architecture and training data will only be added.


Same reason we can't trust people to drive 35 MPH when the road is straight and wide, no matter how many signs are posted to declare the speed limit. It's just too tempting and easy to become complacent.

That, and these companies have a substantial financial interest in pushing the omniscience/omnipotence narrative. OpenAI trying to encourage responsible AI usage is like Phillip Morris trying to encourage responsible tobacco use. Fundamental conflict of interest.


Not in the human nature. A friend of mine just asked me how to make ChatGPT create a PowerPoint presentation. He meant the pptx file. It can't. Googling for him I learned that of course it can create the text and a little more surpringly even the VBA program that creates the slides. That's our of scope for that friend of mine. He was very surprised. He was like, with all it can do why not the slides?


If you down mind marp (markdown for slides), it can do a pretty good job of generating that.


Did you look at Office365's Co-Pilot? It can do a lot of stuff and it can aid in creating a PowerPoint presentation for sure. Technically, Co-Pilot would be using Office365's PowerPoint software to create the slideshow file manually but it would insert the content to the best of it's ability based on the prompts.


Last week, chatgpt literally generated a pptx file that I have downloaded, with the content that I asked for.


Chatgpt4? Did you use some plugin?


Can it output LaTeX? If so, you could try beamer.


Two reasons:

1. Because you've got one or more of the below spinning it into either a butterfly to chase or a product to buy:

- 'Research Groups' e.x. Gartner

- Startups with an 'AI' product

- Startups that add an 'AI' feature

- OpenAI [0]

2. I'm currently working on a theory that a reasonable portion of population in certain circles is viewing ChatGPT and it's ilk as the perfect way to mask their long COVID symptoms and thus embracing blindly. [1]

[0] - The level of hyuperis in some articles about ChatGPT3 reminded me a little too much of the fake viral news around the launch of Pokemon Go, adjusted for fake viral news producers improving quality of tactics. Especially because it flares up when -they- do things... but others?

[1] - Whoever needs to read this probably won't, but; I know when you had ChatGPT wrote the JIRA requirements and more importantly I know when you didn't sanity check what it spit out.


Because companies like OpenAI market the shit out of them to get people to believe that ChatGPT can do anything.


Every piece of marketing coming out of Google and Microsoft is about how ai is coming and it's the future and there's people still asking why people have unrealistic expectations for these models.


OpenAI is very conservative with their marketing. All they do is release blog posts once in a while and have Sam Altman talk to journalists sometimes.


There are production ready AI tools in hospitals, with FDA approval. The writers of this article just decided to try a non-FDA approved tool and ignore the approved ones


Because none of us really know what we mean by "intelligent".

When we see a thing with coherent writing about any subject we're not experts in, even when we notice the huge and dramatic "wet pavements cause rain" level flaws when it's writing about our own speciality, we forget all those examples of flaws the moment the page changes and we revert once more to thinking it is a font of wisdom.

We've been doing this with newspapers for a century or two before Michael Crichton coined the Gell-Mann Amnesia effect.


It's called Artificial Intelligence.


Because the average user of AI is not a Hacker News user who understands its limitations, and the ones who do understand their limitations tend to exaggerate and overhype it to make people think it can do anything. The only real fix for it is for companies like OpenAI to encourage better usage (e.g. tutorials), but there's no incentive for them to do so yet.

I wrote a rant a few months back about the greatest threat to generative AI is people using it poorly: https://minimaxir.com/2023/10/ai-sturgeons-law/


Then there are the people who should know better but get seduced by the things, which they are really good at.


because everybody is shit scared of the implications of it being good enough to replace them. This is just a protective defense mechanism because it threatens the status quo upon which entire careers have been built.

"It is difficult to get a man to understand something when his salary depends on his not understanding it."


Last week, a tweet went viral of a Claude user fighting with radiologists claiming the LLM found a tumor when the radiologists did not (most of the replies were rightfully dunking on it):

> A friend sent me MRI brain scan results and I put it through Claude.

> No other AI would provide a diagnosis, Claude did.

> Claude found an aggressive tumour.

> The radiologist report came back clean.

> I annoyed the radiologists until they re-checked. They did so with 3 radiologists and their own AI. Came back clean, so looks like Claude was wrong.

> But looks how convincing Claude sounds! We're still early...

https://twitter.com/misha_saul/status/1771019329737462232


This is super dangerous. It's WebMD all over again but much worse. It's hard enough to diagnose but now you have to fight against some phantom model that the patient thinks is smarter than their doctor.


WebLLMD


There is probably a disproportionate amount of texts describing the presence of a tumor on a scan than texts describing what a normal scan would look like. So this is expected. But I think that is not the purpose of a LLM. There should be a classification algorithm detecting the presence or not of a tumor and then a a LLM can be used to generate a description of it or why the scan is normal.

And I don't know any, but there is probably already tools to help radiologists, what they call their "own AI".


After reading the diagnosis I caught myself wanting to examine the MRI to see if the bright area really exists, which means I fell for this too. Imagine being the person who received this diagnosis. Of course you're going to be concerned, even if it is LLM garbage.


Wish we'd get more articles from actual practitioners using generative AI to do things. Nearly all the articles you see on the subject are on the level of existential threats or press releases, or dunking on mistakes made by LLMs. I'd really rather hear a detailed writeup from professional people who used generative AI to accomplish something. The only such article I've run across in the wild is this one [0] from jetbrains. Anyway, if anyone has any article suggestions like this please share!

https://blog.jetbrains.com/blog/2023/10/16/ai-graphics-at-je...


I'm using chatgpt right now to create a Hugo page.

For whatever reason Hugo's docu is weird to get into while chatgpt is shockingly good in telling me what I actually look for


This is expected, right? It would be surprising if something as general as GPT-4V was trained on a diverse and nuanced set of radiology images, vs say, a traditional CNN trained and refined for particular detection of specific diseases. It feels akin to concluding that a picnic basket doesn't make a good fishing toolbox after all. Worse would be if someone in power was actually enthusiastically recommending plain GPT-V as a realistic solution for specialized vision tasks.


The ECG stood out to me because my wife is a cardiologist, and worked with a company iCardiac to look for specific anomalies in ECGs. They were looking for LongQT to ensure clinical trials didn't mess with the heart. There was a team of data scientists that worked to help automate this, and they couldn't so they just augmented the UI for experts - there was always a person in the loop.

Looking at an ECG as a layperson it's a problem that seems easy if you know about some tools in your math toolbox, but it's deceptively hard and a false negative might mean death for the patient. So, I'm not going to trust a generic vision transformer model to this task, and until I see overwhelming evidence I won't trust a specifically trained model for this task.


Have there been many studies like this one that have been judged blind?

I'd trust a study like this a little more if the human evaluators were presented with the output of GPT-4 mixed together with the output from human experts, such that they didn't know if the explanation they were evaluating came from a human or an LLM.

This would reduce the risk that participants in the study, consciously or subconsciously, marked down AI results because they knew them to be AI results.


Soon enough we will train models with the firepower of GPT4 (5?), that are purpose built and trained from the ground up to be medical diagnostic tools. A hard focus on medicine with thousands of hours of licensed physician RLHF. It will happen and it almost certainly already underway.

But until it comes to fruition, I think it's largely a waste for people to spend time studying the viability of general models for medical tasks.


Definitely. Saw a video recently mentioning the increase in well-paid gigs for therapists in a metro area, which ask only to record all therapist-patient interactions and treat it as IP. It seems likely that the data would be part of a corpus to train specialists models for psychotherapy AI, and if this kind of product can actually work I don’t see why every other analytical profession wouldn’t be targeted and well underway. Lots of guesses there though, and personally I hope we aren’t rushing into this.


I question the competence of anyone using any modern AI (not just GPTs) for medical decisions. Beyond being capable of passing a multiple-choice exam (which can be done by trained monkeys) they're not ready for this, and they won't be for years. I guess confirmation of this is still good to know.


This seems like a pretty useless study as they don't collect any results from human doctors, therefore there is nothing to compare their GPT-4V results to.


Instead of comparing against some "average doctor", they used a few doctors as "source of truth"

> All images were evaluated by two senior surgical residents (K.R.A, H.S.) and a board-certified internal medicine physician (A.T.). ECGs and clinical photos of dermatologic conditions were additionally evaluated by a board-certified cardiac electrophysiologist (A.H.) and dermatologist (A.C.), respectively


I think the parent comment was referring to something else.

In the paper the tasks are only completed by GPT-4V. For a valid scientific investigation, there should be a control set completed by e.g. qualified doctors. When the panel of experts does their evaluation, they should rate both sets of responses so that the difference in score can be compared in the paper.


Agreed. Those are different evaluations (is what I meant by "Instead of comparing against"). The paper cannot conclude that "doctors are better/more correct"

It assumes that "here are 5 doctors which are always correct". Then measures GPT's correctness against them.


Healthcare is stupid expensive. Even for those with government provided coverage, it still costs a lot of your tax dollars.

Clearly we are not there yet, but within certain conditions we are getting close. If we enable nurses and doctors to be more productive with these tools, we absolutely should explore that. Keep a doctor in the loop as AI is still error prone, but you can line things up in a way such that doctors make the decision with AI gathered information.


IMHO, we need structural change in the way US healthcare works. Slathering some AI on top won't solve our cost problems but it will negatively impact our already declining quality of patient care.


Yeah, I don’t think we’re more than a decade off from GPs being replaceable with AI. But you can bet they will lobby hard against it.


This is deeply wishful thinking. We're not even remotely close to self driving cars and that is numerous orders of magnitude easier than trying to diagnose a human being with anything.


I’ve been using self driving cars for a year.


This feels like it was done for the clicks and not actual research. There are plenty of AI startups in the radiology game, GPT-4V isn't free - so why is that the one being tested?

Run research on models that are actually trained to solve these issues, that is relevant research

Just as an example, Viz.ai applied[1] and received FDA clearance/approval for their model in hospitals.

Have OpenAI ever submitted a request for use of GPT-4V? Whats next, try autopilot driving with GPT-4V?

[1] https://www.viz.ai/news/viz-ai-receives-fda-510k-clearance-f...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: