Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The authors did "X - Y vs X," but that's not even the biggest problem. The authors subtracted two measures that had been transformed and bounded from 0 to 1 (think percentiles). What happens at the extremes of those bounds? How much can your top performers overestimate their performance? They're almost at 1 already, so not much. If they were to overestimate and underestimate at the same rate and by the same magnitude in terms of raw values, the ceiling effect on the transformed values means that the graph will make it look like they underestimate more often. The opposite problem happens for the worst performers.

See "Random Number Simulations Reveal How Random Noise Affects the Measurements and Graphical Portrayals of Self-Assessed Competency." Numeracy 9, Iss. 1 (2016), particularly figures 7, 8, and 9.



Exactly, that was my thought. How would it be possible to get anything other than the D-K effect, even if it wasn't just averaging to the mean?

The lowest quartile can't say they're below the lowest quartile, so any error at all will be counted as "overconfidence." The top quartile can't say they're above the top quartile, so any error at all will be counted as "underconfidance."


> Exactly, that was my thought. How would it be possible to get anything other than the D-K effect, even if it wasn't just averaging to the mean?

Quite easily with the method they demonstrate in the study in figure 11. In that study test participants are not rating themselves in terms of population percentages, but in terms of the percentage correct they got on the test. In such a case the test could be designed to have a huge ceiling that even the most knowledgeable participants would have trouble reaching. And could have such a low floor that even the least knowledgeable participants would still get some answers correct (unless they weren't even trying, which would allow throwing out their data points).

With 20 questions you could have four gimmes and four impossible questions, bounding the worst participants to about 20% and the best to about 80%.


Right. To clarify, I meant: with the original study design, how could they not have gotten the result they did? (And that's rhetorical.)


It would have been noteworthy in the original design if more than one group of participants were, on average, within their quartiles on the guessing. I also find it noteworthy that the average guess of the lowest quartile is lower than the average guess of the second lowest quartile, and on up the quartiles. On one hand this shows some awareness of relative ability along a massively smooshed logarithmic scale. On the other hand I wonder if this laddering follows as the averages are split into quintiles and deciles.


I wonder if estimating on the logit scale would solve this problem.


I think if people at all levels of skill were reasonably good at measuring their own ability, we would see two curves that roughly overlap. Instead we see the graph given.

The fact that random noise can generate a mean curve on the Y axis doesn’t mean DK doesn’t exist. It just means DK’s mean self analysis resembles a middling random mean, which if you think about it, makes sense. Most people will probably self evaluate as average, regardless of their actual skill. This means DK is right as rain.


> I think if people at all levels of skill were reasonably good at measuring their own ability, we would see two curves that roughly overlap. Instead we see the graph given.

Actually, due to the construction of the test, the ability to evaluate your own absolute ability in a subject isn't sufficient for the two lines to be able to overlap.

It's a percentile axis, so you need to be able to reasonably accurately estimate the ability of everyone taking the test, and where you fall in the quartile range of those participants.


Why does it matter if it’s absolute result vs percentile result?

In the former, you’re asked to predict your score.

In the latter, you’re asked to predict your place among others.

Yes, the latter more difficult to do accurately, but if people were really able to evaluate themselves, they would be able to understand they’re, on average, below or above the median. The results of DK show that most people think they _are_ the median (we all think people are like us). This means, as a result, less capable people overestimate their abilities and more capable people underestimate their abilities. It tracks.


> Yes, the latter more difficult to do accurately, but if people were really able to evaluate themselves, they would be able to understand they’re, on average, below or above the median.

Lets suppose I ask you "How tall are you?" would you be able to answer? Good, then you are able to accurately asses your own height.

Now lets suppose I ask you "How tall are you as a percentile of this group including you and 99 people who you don't know who are". You should realize that you can't do that exactly as accurately because you don't have perfect knowledge of their heights just from knowing your own.

Now for the even more convoluted and actual Dunning-Kruger assessment. I ask you "How tall are you?" great, now at which percentile do you think your deviation from your actual height falls compared to the deviations from these other 99 peoples esstimates of their actual heights? How on earth are you supposed to answer that unless you have some sort of knowledge about how they perform? Are people a cm off? Are some people 10cm off? Are people being mm precise?

The problem with the Dunning-Kruger effect is that it effectively says "People who are on average worse at estimating their own height tend to underestimate it, while people who are on average better at estimating their own height tend to overestimate it", but if you look at the absolute ability of people to estimate their own height it's similar independent of how close people get. But the Dunning-Kruger analysis methodology is set up such that it transforms random noise into an observation of the Dunning-Kruger effect, which is the problem highlighted in the OP. Part of the problem here is to have participants estimate on a percentile range instead of doing a simple absolute estimation. You can ask people "So how far off do you think you are in cm's?" And you'll see that people are fairly consistent in accessing their own ability and so the Dunning-Kruger effect goes away. The effect is a result of the methodology not of the actual people being test.

But that's a hard sell for most people because they have a bias about "dumb people" and the effect as originally stated confirms that bias, so people hold on to the conclusion even as holes in the methodology becomes apparent.


This can be dealt with to an extent by truncating the extreme ends. Even the middle quartiles in the graphs in the linked article show the same trends.


Not that simple. This article demonstrates why enforcing bounds results in the changes in slope that you see in the expected grades (figure 2 and 4): https://www.frontiersin.org/articles/10.3389/fpsyg.2022.8401...


Lognormality of data is killing for the methods of social scientists. If I were to hypothesize the underlying mechanism then it would be that raw skill is lognormally distributed for those taking tests at all (at least participating in these test usually entails an implicit lower bound on IQ, but also from the long tail of high performance in say sports), tests try to measure performance but with a reduction to normality (or 4 categories) and then people estimate their own skills based on their task and grading experiences which are also reduction to a normal or constant distribution. (“I was always a B- in math in high school and expect that to have distribution X and this test to follow that distribution“).

It’s three places where reductions in dimensionality take place both implicitly and explicitly. I don’t envy researchers trying to unpeel this onion. I do like the unraveling of all these problems that pop up in pretty accessible designed experiments. It makes for better understanding.


The boundedness of the data is also the main argument here https://www.frontiersin.org/articles/10.3389/fpsyg.2022.8401...


Thanks for stating just how much of a statistical minefield this is. The reference does a great job showing just how wrong the DK studies are. Unfortunately, most people have already made up their minds and are happy to link conflicting blog posts as evidence.


> wrong the DK studies are

The DK studies are not wrong, they are misinterpreted by people who don't know what they're talking about (e.g. what tge DK effect actually is), like this blogger.

"People have worse self assessment ability as their real ability declines" would be a valid interpretation of the DK data and notably would NOT be a valid conclusion from the random data in the blog post.


You should read the reference we are discussing which makes no such mistakes.


Probably in another year or two they'll find another statistic that will render the old one moot like again and again.


Dunning himself addressed this back in 2011:

> 4.1. Regression to the mean

> The most common critique of our metacognitive account of lack of self-insight into ignorance centers on the statistical notion of regression to the mean. Recall from elementary statistics classes that no two variables are ever perfectly correlated with one another. This means that if one selects the poorest performers along one variable, one will see that their scores on the second variable will not be so extreme. Similarly, if one selects the best performers along a variable, one is guaranteed to see that their scores on the second variable will be lower…

His full response is longer than is appropriate to quote here, but you can easily find the chapter online.

Dunning, David (1 January 2011). "Chapter Five – The Dunning–Kruger Effect: On Being Ignorant of One's Own Ignorance". Advances in Experimental Social Psychology. Vol. 44. Academic Press. pp. 247–296. doi:10.1016/B978-0-12-385522-0.00005-6. ISBN 9780123855220


The author continues,

> Some scholars observe that Fig. 5.2 looks like a regression effect, and then claim that this constitutes a complete explanation for the Dunning–Kruger phenomenon. What these critics miss, however, is that just dismissing the Dunning–Kruger effect as a regression effect is not so much explaining the phenomenon as it is merely relabeling it. What one has to do is to go further to elucidate why perception and reality of performance are associated so imperfectly. Why is the relation so regressive? What drives such a disconnect for top and bottom performers between what they think they have achieved and what they actually have? [...] As can be seen in the figure, correcting for measurement unreliability has only a negligible impact on the degree to which bottom performers overestimate their per-formance (see also Kruger & Dunning, 2002). The phenomenon remains largely intact.

The DK effect says roughly, "low performers tend to overestimate their abilities." Yet when researchers analyzed the data, they found that high and low performers overestimate and underestimate with the same frequency. [0] It's just that high performers are more accurate than low performers (note how this statement differs from the DK effect). Since you can completely explain the "X graph" by the random noise combined with the ceiling effect, and since beginners' self evaluations are noisier than experts', you don't even need regression to the mean to explain why you get the "X graph."

0. Nuhfer, Edward, Steven Fleisher, Christopher Cogan, Karl Wirth, and Eric Gaze. "How Random Noise and a Graphical Convention Subverted Behavioral Scientists' Explanations of Self-Assessment Data: Numeracy Underlies Better Alternatives." Numeracy 10, Iss. 1 (2017): Article 4. DOI: http://dx.doi.org/10.5038/ 1936-4660.10.1.4




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: