They made a movie to make money. I doubt anyone holding the purse strings cared one iota if that bit were corrected or not. It’s not really a retcon either because they didn’t change anything.
I disagree that evaluation is always a coding task. Evaluation is scrutiny for the person who wants the thing. It’s subjective. So, unless you’re evaluating something purely objective, such as an algorithm, I don’t see how a self contained, self “improving “ agent accomplishes the subjectivity constraint - as by design you are leaving out the subject.
Sure. There will always be subjective tasks where the person who asks for something needs to give feedback. But even there we could come up with ways to make it easier / faster / better ux. (one example I saw my frontend colleagues do is use a fast model to create 9 versions of a component, in a grid. And they "at a glance" decide which one is "better", and use that going forwards).
OTOH, there's loads you can do for evaluation before a human even sees the artifact. Things like does the site load, does it behave the same, did anything major change on the happy path, etc etc. There's a recent-ish paper where instead of classic "LLM as a judge" they used LLMs to come up with rubrics, and other instances check original prompt + rubrics on a binary scale. Saw improvements in a lot of evaluations.
Then there's "evaluate by having an agent do it" for any documentation tracking. Say you have a project, you implement a feature, and document the changes. Then you can have an agent take that documentation and "try it out". Should give you much faster feedback loops.
> Things like does the site load, does it behave the same, did anything major change on the happy path, etc etc.
I asked Claude to build a web app to run locally polling data from the LAN. It fought me for four rounds of me telling it that the data from the api wasn’t rendered on the page. It created tests with mock data, it validated the api, it tested that the page loaded. It was gaslighting telling me that everything worked every time I told it that it didn’t work. I had to tell it to inspect the dom and take screenshots with Playwright to make it stop effing around. I don’t think it ever would have found the right response on its own.
Even after deliberate intervention, it regressed a few rounds later and stopped caring that tests failed. Whatever, I don’t treat it as anything more than a sometimes-correct random output machine.
In science there are ways to surface subjectivity (cannot be counted) into observable quantized phenomena. Take opinion polls for instance: "approval" of a political figure can mean many things and is subjective, but experts in the field make "approval" into a number through scientific methods. These methods are just an approximation and have many IFs, they're not perfect (and for presidential campaign analysis in particular they've been failing for reasons I won't clarify here), but they're useful nonetheless.
Another thing that get quantized is video preferences to maximize engagement.
Why not use skills? They follow a three-tier loading approach, and you can stick an MCP as part of the toolset for the skills, so it will only load it when the skill is selected.
An odd take on a regime that has known and significant human rights violations. I’m not saying the US is doing great right now, but China is not something to look up to.
The US is doing much, much worse. This is no compliment to China. The US
* murdered a political leader the were negotiating with in Iran after using the military to kidnap the leader of Venezuela.
* is credibly threatening military allies in Greenland, countries with which it has mutual protection treaties and is credibly threatening, without casus belli, to militarily invade another weaker neighbor, Cuba.
* spent months threatening to invade Canada. This wasn't trolling, it is forgotten with their strategy of constant chaos, but they really tried and Canada has made alliances with other countries as a result
* is actively murdering thousands of people in Haiti via a Republican allied private military contractor
* is actively subverting domestic elections
* is building and filling concentration camps with people who have committed no other crime than illegal residency, without due process, and giving them substandard care, leading to many deaths in custody.
* has masked secret police detaining people without due process and deporting them to foreign prison camps, frequently in violation of judicial orders
* has masked secret police arresting citizen because of their nationality and because they are not carrying "their papers"
* is using the power of government to force mergers and ownership changes of corporations to political allies
* is using the power of the government to hide an embarrassing a criminal conspiracy involving leadership in the country, in violation of the US Constitution, since it was ordered by Congress.
* has completely disregarded conflict of interest laws which the leader of the country is using to enrich himself and his family at completely unprecedented levels in US history.
I could go on, but China is a more ethical superpower by a lot of measures and that is a very painful conclusion to state.
This is not even touching on the subject of competency.
The internationally accepted US hegemony and the privileged role of the US dollar was the result of almost a century of goodwill. It is now gone, and then some. The next two decades will not be pleasant for regular Americans who have grown accustomed to, and frankly taken for granted, the level of privilege they had.
edit: and speak of the devil, and competency: https://mitsloan.mit.edu/ideas-made-to-matter/what-happens-w... just made it to the top of the front page. Even with everything I wrote above, I had neglected to include that the US lies about embarrassing economic data now due to political intererence.
edit 2: this didn't even make the front page and would have been the biggest scandal in modern American history: https://www.wsj.com/tech/tiktok-deal-fee-trump-administratio... "Trump Administration Set to Receive $10 Billion Fee for Brokering TikTok Deal"
Wow, nativism from the left is wild to see. Obama was the son of an immigrant vs the grandchild of one for Trump. There’s a lot of valid criticism of Melania but claiming it’s because she “can’t” speak English is wild (she speaks with an accent but so what).
I’m tired of attacks on personal characteristics that have no bearing (or are even outside their control) rather than on legitimate things like ideas, temperament, decision making, track record. Do better.
Swipe the dirt under the carpet. For what we know there’s no difference between “throwaway” and “jumping criss cross”. You’re hiding behind a nickname, too!
The US has due process, judicial transparency, and free speech. There are still rich people that operate above the system, but they're largely still accountable and the free press can crucify them.
Authoritarian regimes have execution vans, no freedom of the press, no free speech, and a paranoid leadership that will jail or kill anyone who threatens their power. They lock citizens inside and prohibit capital flight.
No system is perfect, but democracy is strictly better.
I love China and the Chinese people, but the CCP is a drag on both.
I'm no fan of the party in power in the US, but I can campaign and speak out against them. I can raise money to oppose them. I can band together with like minded individuals to protest. That's superior to unilateral oppression.
> I'm no fan of the party in power in the US, but I can campaign and speak out against them. I can raise money to oppose them. I can band together with like minded individuals to protest.
You can. Just not in any way that matters. And you won’t. Because that takes organization and all existing organizations that matter are captured by the system and novel ones would quickly be.
Perfect example: The US just launched a disastrous and illegal (both in their own and the UN system) war at the behest of a foreign power/influential minority against the will of its people and against its geopolitical interest. And the “opposition” does less than nothing. There is little anti-war protest and none of consquence.
Compare it with 2003 and earlier wars: The American public has been all but neutralized as a political force. Not that it could do much even then.
> You can. Just not in any way that matters. And you won’t.
I’ve gotten language I wrote passed into state and federal law. The bottom line is a lot of people are too busy, lazy or nihilistic to call their electeds and show up to create political pressure. That’s unfortunate. But it also means that the payoff for relatively small amounts of effort are huge.
Did that language imping on the interests of America’s ruling elites, its security apparatus, or the interests of a certain entity in the eastern Mediterranean? No? Then we’re talking about entirely different things.
The “opposition” attempted to assert the congressional authority the branch is supposed over war powers, and were defeated because the American public gave majority power to the current majority whom rejected that authority to the executive branch to do whatever.
Dude there’s only three graphs in there. Do they really bother you that much? The third may be a bit unnecessary but I think the visuals add to the post.
If you’re “helping a kid” then I guess I can help you. Help is criticism delivered with a constructive tone. Criticism can be helpful if you look past the tone.
Fully agreed; this is something that always baffles me when it's misunderstood so often. Regardless of whether it's logical or not, tone and attitude in practice does influence whether people are convinced by something, so if your goal is to actually change how someone else acts, you will not be as effective if you don't care about how you come across. Being right is not always enough, so even if the style of communicating doesn't seem like it "should" matter, in practice it genuinely does if success is measured by whether the change happens or not.
Of course, if the goal is just to be right rather than to convince someone else about what's right, how you're saying something doesn't matter, but at that point you've already reached the goal before you started talking to them, so it's worth reexamining what you're actually looking to get out of a conversation at that point.
I liked the graphs. When skimming posts i often stop on graphical elements and decide if I want to understand the context or continue skimming. In this context, all three graphs were useful for me.
Posts with just text are sense and just not nice to read. That's why even text-only blog posts have a tendency to include loosely-related image at the top, to catch reader's eye.
Based on the post, it seems likely that they'd just delay per the robots.txt policy no matter what, and do a full browser render of the cached page to get the content. Probably overkill for lots and lots of sites. An HTML fetch + readability is really cheap.
Really hard to understand costs here. What is a reasonable pages per second? Should I assume with politeness that I'm basically at 1 page per second == 3600 pages/hour? Seems painfully slow.
The reasoning for some of these questions is that if you are caught, it’s sometimes easier to charge you with fraud (lying on the form) than the actual thing (such as espionage).
Thats why I presume its asking about previous engagements, if they catch someone they suspect of espionage, dig into their background and find proof of previous activity they have a clear fraud charge without having to prove their suspicions about current activities.
There's often also some arbitrage on standard of proof or statutes of limitation or jurisdiction.
Maybe to deport you for espionage requires a jury trial, but to revoke status for misleading answers on an immigration form is administrative and so is deportation for lack of status.
I seem to recall some extraordinary cases where untruthful answers on immigration forms were used to justify denaturalization.
The fact you worked for an intelligence agency doesn’t mean you were an intelligence officer. You could’ve been a cleaner, or an executive assistant, or maybe you were working as a software developer on the payroll system.
reply