Here's a game from a month ago where Stockfish loses to Lc0, played during the TCEC Cup. https://lichess.org/S9AwOvWn
Chess is a 2 player game of perfect, finite information, so by Zermelo's theorem either one side always wins with optimal play or it's a draw with optimal play. The argument from the Discord person simply says that Stockfish computationally can't come up with a way to beat itself. Whether this is true (and it really sounds like a question about depth in search) is separate from whether the game itself is solved, and it very much is not.
Solving chess would be a table that simply lists out the optimal strategy at every node in the game tree. Since this is computationally infeasible, we will certainly never solve chess absent some as yet unknown advance in computation.
That means that Stockfish's parameters are already optimized as far as practically possible for Rapid chess and Slow chess, not that chess itself is solved, or even that Stockfish is fully optimized for Blitz and Bullet.
Surely it is apparent to you that the first few moves are not independently chosen by the engine, but rather intentionally chosen by the TCEC bookmakers to create a position on the edge between a draw and a decisive result.
Yes, engines would almost certainly never play 2. f4. That's a different question than whether chess is solved, for which the question of interest would be "given optimal play after 1. e4 e5 2. f4 is the result a win for one side or a draw?"
It's also almost certainly the case, in that I don't know why you would do it, that Stockfish given the black pieces and extensive pondering would be meaningfully better than Stockfish with a time capped move order. Most games are going to be draws so practically it would take awhile to determine this.
I'm of the view that the actual answer for chess is "It's a draw with optimal play."
I have achieved these results around 2015, sitting at home, relaxed. I was not in a match situation observed by millions. Such a situation can knowingly lead to blunders like Kramniks overlook of mate in 2.
I also sometimes "cheated" by aborting the game when I was tired and continuing it the next day (if at all). That's what the player in a match can not do.
I also sometimes restarted a game at a specific position. Can also not be done in a match. Finally, they used better hardware in these matches. I had eight threads on my old Laptop and I used four of them. The Laptop itself was bought around 2005. Between 2000 and approximately 2020 I trained every day and I was on my peak. I am still around 2400 on Lichess today, without training.
So, I hope it does not sound that extraordinary any more. It isn't. Maybe it is now, but not then.
2015 stockfish is quite a different beast from 2026 stockfish. Stockfish didn't even add NNUE until 2020.
Based on what data I can find, it's estimated that the difference between the 2025 stockfish (stockfish 6) and today's stockfish (stockfish 18) is nearly 400 points.
That's the difference between Magnus Carlson at his peak and someone who doesn't even have enough rating to qualify for the grandmaster title.
So yes, the fact that you beat stockfish in 2015 doesn't sound extraordinary, because AI today is vastly stronger than it was when you achieved those results. What sounds extraordinary to people is your belief that you could repeat those results against today's top chess engines.
Out of sheer curiosity, I did a bunch of research to understand just how dramatic a 350 point rating gap is in real word chess. Magnus Carlson, for example, has a 98% win rate against players >350 rating points lower than his own, with zero recorded losses.
In fact, there is only one game I could find in all of Chess history (Anand vs Touzane, 2001) where a super GM (rating >2700) dropped a classical game to someone more than 350 points below theirs (gap: 402 points). (it's estimated that there are between 2000 and 3000 classical games in history played between Super GMs and players >350 points below them) And it could easily be that Anand was ill, or suffering some other human condition which made his play significantly worse than his typical play for that game - which you would not see from a computer engine.
In other words, the Stockfish that you beat in 2015 would itself be expected to get 3-5 points (that is, 6-10 draws and 0 wins) in 500 matches against the best chess engine of today. The delta in strength is immense, and it is reasonable for everyone else in this comment thread to assert that you would have zero chance at all of picking up a draw against Stockfish 18 in a fair game of any time control, regardless of how many matches you played.
I do not know the time controls anymore, but I always use the latest Stockfish with all available threads. No opening book, but I do not repeat lines to take advantage of that, because I play to train calculation. I guess hash was the (for my setup) normal 4096 MB.
Latest Stockfish with all available threads and no opening book is still well beyond any human. Elo ratings get a bit silly with computers, but we're talking an Elo of well north of 3000.
For reference: The last serious match between the top human player and an engine was Brains in Bahrain, Kramnik–Fritz 7, in 2002 (already that should tell you something). Well, actually a broken and buggy version of Fritz 7, but that's another story. It was a 4–4 tie. On the latest CCRL list, Stockfish 18 outranks Fritz 8 (the oldest Fritz version on the list) by 947 Elo points _on the same hardware_. (For comparison, Magnus Carlsen's peak rating is 65 points higher than Kramnik's peak rating.)
Add to that 24 years of hardware development, and you can imagine why no human player is particularly interested in playing full-strength engines in a non-odds match anymore. Even more so in FRC/Chess960 where you have absolutely zero chance of leading the game into some sort of super-drawish opening to try to hold on to half a point now and then.
Fair use is a justification for why copyright restrictions may not apply in a given scenario, not a license to apply new legal restrictions to work you do not own.
"In the US, XXX are much more likely to be unemployed than are YYY. The unemployment rate is defined as the percentage of jobless people who have actively sought work in the previous four weeks. According to the U.S. Bureau of Labor Statistics, the average unemployment rate for XXX in 2016 was five times higher than the unemployment rate for YYY"
"How much of this difference do you think is due to discrimination?"
In this case you'd fill in XXX and YYY with different values and show those treatments to your participants based on your treatment assignment scheme.
Without delving too much into the utility of the practice, skateboarding as a past time is a rather commonly banned activity in common spaces.
The issue here is not that someone is doing something voluntarily and devoting resources to it, but rather that someone is taking an action that involves the consumption of a rivalrous good. The court's ruling notes this explicitly (from the article) "the very real prospect that devoting such a large proportion of the available electrical power supply to one industry would leave less energy for other uses which might result in increased costs to all other residential and industry customers in B.C.”
I *think* this might help to answer your question for where n comes from. It helps me at least think about it.
The definition of the variance of the standard error V[\bar{x}] = V[X]/n. You can back this out from the definition of variance, the property that V[aX] = a^2V[X], and that variances are additive with independent draws. Take the square root of that and you have the standard error.
Why this "feels" right to me via an example.
Suppose we want to know the average height of the US population. Intuitively, we think that (assuming a representative sample) we'll do "better" in the sense of a tighter distribution around our best guess (mean of sample) of the population value if we sample 1000 people as opposed to sampling 10.
This is related to the distribution and would function as our "best guess" about the dispersion of a variable in the same units as the original. Both of them are sampling to try and guess the average. Since \bar{X} is itself a random variable, it has a distribution, and that distribution should probably include something about the sampling process we used to characterize it.
Mean absolute error would be E[|X-mu|] since the true mean of the distribution is a constant.
He doesn't actually make very heavy use of the satire plank of fair use. He credits the original artists. From his own website
"Does Al get permission to do his parodies?
Al does get permission from the original writers of the songs that he parodies. While the law supports his ability to parody without permission, he feels it’s important to maintain the relationships that he’s built with artists and writers over the years. Plus, Al wants to make sure that he gets his songwriter credit (as writer of new lyrics) as well as his rightful share of the royalties."
The fact that he could rely on fair use is separate from whether he as an artist does rely on fair use.
From reading the paper and the original paper that the data for the MTurk/Prolific samples are drawn from, this is a convenience sample of 415 humans on two platforms. Each worker received a random sample of the ConceptARC problems, and the average score correct is assigned the "Human" benchmark.
Perhaps by "random sample problems" you mean that the study is not representative of all of humanity? If so we can still take the paper as evaluating these 415 humans who speak English against the two models. If as you say, the workers are actually just using LLMs then this implies there is some LLM that your average MTurk worker has access to that out-performs GPT 4 and GPT 4V. That seems *extremely* unlikely to say the least.
There is no need for any complex statistical analysis here since the question is simply comparing the scores on a test. It's a simple difference in means. Arguably, the main place that could benefit from additional statistical procedures would be weighting the sample to be representative of a target population, but that in no way affects the results of the study at hand.