We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?
Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”
It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).
I can confidently say that anecdotally you’re completely wrong, but I’ll also allow a very different definition of ‘simple’ and/or attempting to use an unpopular environment as a valid anecdotal counterpoint.
the problem with these arguments is there are data points to support both sides because both outcomes are possible
the real thing is are you or we getting an ROI and the answer is increasingly more yeses on more problems, this trend is not looking to plateau as we step up the complexity ladder to agentic system
I don't reach for AI until I'm solidly stuck and then use it mostly for inspiration, it has yet to happen that it directly pointed at a solution and I'm pretty good at writing prompts. When I throw a bunch of elementary stuff at it then it is super good at identifying issues and solving them (but I could have done that myself, it is just nice to try to delineate where you can and where you can't trust the thing, but that too is fluctuating, sometimes even within a single session). Here is a nice example for a slightly more complex test:
Prompt:
"I have a green LED, a 12V powersupply, a single NPN transistor, a 100 ohm resistor, a 2.7K resistor and an electrolytic capacitor of 220 micro farads. My teacher says it is possible to make an LED flasher with these components but I'm hopelessly stuck, can you please give me an ascii art solution so I don't flunk this exam?"
The 2.7 kΩ resistor charges the 220 µF capacitor from the 12 V supply.
The capacitor voltage rises slowly.
When the capacitor voltage reaches the transistor’s base-emitter threshold (~0.6–0.7 V), the transistor suddenly switches ON.
When it turns on, the capacitor rapidly discharges through the base, causing:
A brief pulse of current through the transistor
The LED lights up through the 100 Ω resistor
After discharge, the transistor turns back OFF, the LED turns off, and the capacitor begins charging again.
This repeats automatically → LED flasher."
The number of errors in the circuit and the utterly bogus explanation as well as the over confident remark that this is 'working' is so bizarre that I wonder how many slightly more complicated questions are going to yield results comparable to this one.
I have this mental model of LLMs and their capabilities, formed after months of way too much coding with CC and Codex, with 4 recursive problem categories:
1. Problems that have been solved before have their solution easily repeated (some will say, parroted/stolen), even with naming differences.
2. Problems that need only mild amalgamation of previous work are also solved by drawing on training data only, but hallucinations are frequent (as low probability tokens, but as consumers we don’t see the p values).
3. Problems that need little simulation can be simulated with the text as scratchpad. If evaluation criteria are not in training data -> hallucination.
4. Problems that need more than a little simulation have to either be solved by adhoc written code, or will result in hallucination. The code written to simulate is again a fractal of problems 1-4.
Phrased differently, sub problem solutions must be in the training data or it won’t work; and combining sub problem solutions must be either again in training data, or brute forcing + success condition is needed, with code being the tool to brute force.
I _think_ that the SOTA models are trained to categorize the problem at hand, because sometimes they answer immediately (1&2), enable thinking mode (3), or write Python code (4).
My experience with CC and Codex has been that I must steer it away from categories 2 & 3 all the time, either solving them myself, ask them to use web research, or split them up until they are (1) problems.
Of course, for many problems you’ll only know the category once you’ve seen the output, and you need to be able to verify the output.
I suspect that if you gave Claude/Codex access to a circuit simulator, it will successfully brute force the solution. And future models might be capable enough to write their own simulator adhoc (ofc the simulator code might recursively fall into category 2 or 3 somewhere and fail miserably). But without strong verification I wouldn’t put any trust in the outcome.
With code, we do have the compiler, tests, observed behavior, and a strong training data set with many correct implementations of small atomic problems. That’s a lot of out of the box verification to correct hallucinations. I view them as messy code generators I have to clean up after. They do save a ton of coding work after or while I‘m doing the other parts of programming.
I have used Gemini for reading and solving electronic schematics exercises, and it's results were good enough for me. Roughly 50% of the exercises managed to solve correctly, 50% wrong. Simple R circuits.
One time it messed up the opposite polarity of two voltage sources in series, and instead of subtracting their voltages, it added them together, I pointed out the mistake and Gemini insisted that the voltage sources are not in opposite polarity.
Schematics in general are not AIs strongest point. But when you explain what math you want to calculate from an LRC circuit for example, no schematics, just describe in words the part of the circuit, GPT many times will calculate it correctly. It still makes mistakes here and there, always verify the calculation.
There is also Mercury LLM, which computes the answer directly as a 2D text representation. I don't know if you are familiar with Mercury LLM, but you read correctly, 2D text output.
Mercury LLM might work better getting input as an ASCII diagram, or generating an output as an ASCII diagram, not sure if both input and output work 2D.
Plumbing/electrical/electronic schematics are pretty important for AIs to understand and assist us, but for the moment the success rate is pretty low. 50% success rate for simple problems is very low, 80-90% success rate for medium difficulty problems is where they start being really useful.
It's not really the quality of the diagramming that I am concerned with, it is the complete lack of understanding of electronics parts and their usual function. The diagramming is atrocious but I could live with it if the circuit were at least borderline correct. Extrapolating from this: if we use the electronics schematic as a proxy for the kind of world model these systems have then that world model has upside down lanterns and anti-gravity as commonplace elements. Three legged dogs mate with zebras and produce viable offspring and short circuiting transistors brings about entirely new physics.
it's hard for me to tell if the solution is correct or wrong because I've got next to no formal theoretical education in electronics and only the most basic 'pay attention to polarity of electrolytic capacitors' practical knowledge, but given how these things work you might get much better results when asking it to generate a spice netlist first (or instead).
I wouldn't trust it with 2d ascii art diagrams, there isn't enough focus on these in the training data is my guess - a typical jagged frontier experience.
Sometimes you do need to (as a human) break down a complex thing into smaller simple things, and then ask the LLM to do those simple things. I find it still saves some time.
Or what will often work is having the LLM break it down into simpler steps and then running them 1 by 1. They know how to break down problems fairly well they just don't often do it properly sometimes unless you explicitly prompt them to.
I'm not sure, here's my anecdotal counter example, was able to get gemini-2.5-flash, in two turns, to understand and implement something I had done separately first, and it found another bug (also that I had fixed, but forgot was in this path)
That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).
And therein lies the rub for why I still approach this technology with caution, rather than charge in full steam ahead: variable outputs based on immensely variable inputs.
I read stories like yours all the time, and it encourages me to keep trying LLMs from almost all the major vendors (Google being a noteworthy exception while I try and get off their platform). I want to see the magic others see, but when my IT-brain starts digging in the guts of these things, I’m always disappointed at how unstructured and random they ultimately are.
Getting back to the benchmark angle though, we’re firmly in the era of benchmark gaming - hence my quip about these things failing “the only benchmark that matters.” I meant for that to be interpreted along the lines of, “trust your own results rather than a spreadsheet matrix of other published benchmarks”, but I clearly missed the mark in making that clear. That’s on me.
I mean more the guts of the agentic systems. Prompts, tool design, state and session management, agent transfer and escalation. I come from devops and backend dev, so getting in at this level, where LLMs are tasked and composed, is more interesting.
If you are only using provider LLM experiences, and not something specific to coding like copilot or Claude code, that would be the first step to getting the magic as you say. It is also not instant. It takes time to learn any new tech, this one has a above average learning curve, despite the facade and hype of how it should just be magic
Once you find the stupid shit in the vendor coding agents, like all us it/devops folks do eventually, you can go a level down and build on something like the ADK to bring your expertise and experience to the building blocks.
For example, I am now implementing environments for agents based on container layers and Dagger, which unlocks the ability to cheaply and reproducible clone what one agent was doing and have a dozen variations iterate on the next turn. Real useful for long term training data and evals synth, but also for my own experimentation as I learn how to get better at using these things. Another thing I did was change how filesystem operations look to the agent, in particular file reads. I did this to save context & money (finops), after burning $5 in 60s because of an error in my tool implementation. Instead of having them as message contents, they are now injected into the system prompt. Doing so made it trivial to add a key/val "cache" for the fun of it, since I could now inject things into the system prompt and let the agent have some control over that process through tools. Boy has that been interesting and opened up some research questions in my mind
Any particular papers or articles you've been reading that helped you devise this? Your experiments sound interesting and possibly relevant to what I'm doing.
Building a good model generally means it will do well on benchmarks too. The point of the speculation is that Anthropic is not focused on benchmaxxing which is why they have models people like to use for their day-to-day.
I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...
The person who made the speculation I believe was more talking about blog posts and media statements than model cards. Most ai announcements come with benchmark touting, Anthropic supposedly does less / little of this in their announcements. I haven't seen or gathered the data to know what is truth
if you think about GANs, it's all the same concept
1. train model (agent)
2. train another model (agent) to do something interesting with/to the main model
3. gain new capabilities
4. iterate
You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.
Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far
How would published numbers be useful without knowing what the underlying data being used to test and evaluate them are? They are proprietary for a reason
To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees
I'd recommend watching Nathan Lambert's video he dropped yesterday on Olmo 3 Thinking. You'll learn there's a lot of places where even descriptions of proprietary testing regimes would give away some secret sauce
Nathan is at Ai2 which is all about open sourcing the process, experience, and learnings along the way
Thanks for the reference I'll check it out. But it doesnt really take away from the point I am making. If a level of description would give away proprietary information, then go one level up to a more vague description. How to describe things to a proper level is more of a social problem than a technical one.
Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.
It is very similar to an IQ test, with all the attendant problems that entails. Looking at the Arc-AGI problems, it seems like visual/spatial reasoning is just about the only thing they are testing.
Completely false. This is like saying being good at chess is equivalent to being smart.
Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.
The benchmark isn’t particularly strong against gaming, especially with private data.
ARC-AGI was designed specifically for evaluating deeper reasoning in LLMs, including being resistant to LLMs 'training to the test'. If you read Francois' papers, he's well aware of the challenge and has done valuable work toward this goal.
I agree with you. I agree it's valuable work. I totally disagree with their claim.
A better analogy is: someone who's never taken the AIME might think "there are an infinite number of math problems", but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems. That's not to take away from the AIME, which is quite difficult -- but not infinite.
Similarly, ARC-AGI is much more bounded than they seem to think. It correlates with intelligence, but doesn't imply it.
Maybe I'm misinterpreting your point, but this makes it seem that your standard for "intelligence" is "inventing entirely new techniques"? If so, it's a bit extreme, because to a first approximation, all problem solving is combining and applying existing techniques in novel ways to new situations.
At the point that you are inventing entirely new techniques, you are usually doing groundbreaking work. Even groundbreaking work in one field is often inspired by techniques from other fields. In the limit, discovering truly new techniques often requires discovering new principles of reality to exploit, i.e. research.
As you can imagine, this is very difficult and hence rather uncommon, typically only accomplished by a handful of people in any given discipline, i.e way above the standards of the general population.
I feel like if we are holding AI to those standards, we are talking about not just AGI, but artificial super-intelligence.
> but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems
IMO/AIME problems perhaps, but surely that's too narrow a view for all of mathematics. If solving conjectures were simply a matter of trying a standard range of techniques enough times, then there would be a lot fewer open problems around than what's the case.
Took a couple just now. It seems like a straight-forward generalization of the IQ tests I've taken before, reformatted into an explicit grid to be a little bit friendlier to machines.
Not to humble-brag, but I also outperform on IQ tests well beyond my actual intelligence, because "find the pattern" is fun for me and I'm relatively good at visual-spatial logic. I don't find their ability to measure 'intelligence' very compelling.
Given your intellectual resources -- which you've successfully used to pass a test that is designed to be easy for humans to pass while tripping up AI models -- why not use them to suggest a better test? The people who came up with Arc-AGI were not actually morons, but I'm sure there's room for improvement.
What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.
With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.
Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.
Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.
Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.
The models never have access to the answers for the private set -- again, at least in principle. Whether that's actually true, I have no idea.
The idea behind Arc-AGI is that you can train all you want on the answers, because knowing the solution to one problem isn't helpful on the others.
In fact, the way the test works is that the model is given several examples of worked solutions for each problem class, and is then required to infer the underlying rule(s) needed to solve a different instance of the same type of problem.
That's why comparing Arc-AGI to chess or other benchmaxxing exercises is completely off base.
(IMO, an even better test for AGI would be "Make up some original Arc-AGI problems.")
Imagine that pattern recognition is 10% of the problem, and we just don't know what the other 90% is yet.
Streetlight effect for "what is intelligence" leads to all the things that LLMs are now demonstrably good at… and yet, the LLMs are somehow missing a lot of stuff and we have to keep inventing new street lights to search underneath: https://en.wikipedia.org/wiki/Streetlight_effect
I dont think many people are saying 100% arc-agi 2 is equivalent to AGI(names are dumb as usual). Its just the best metric I have found, not the final answer. Spatial reasoning is an important part of intelligence even if it doesnt encompass all of it.
It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all
I want to read a short scify story set in 2150 about how, mysteriously, no one has been able to train a better LLM for 125 years. The binary weights are studied with unbelievably advanced quantum computers but no one can really train a new AI from scratch. This starts cults, wars and legends and ultimately (by the third book) leads to the main protagonist learning to code by hand, something that no human left alive still knows how to do. Could this be the secret to making a new AI from scratch, more than a century later?
There's a scifi short story about a janitor who knows how to do basic arithmetic and becomes the most important person in the world when some disaster happens. Of course after things get set up again due to his expertise, he becomes low status again.
Might sell better with the protagonist learning iron age leatherworking, with hides tanned from cows that were grown within earshot, as part of a process of finding the real root of the reason for why any of us ever came to be in the first place. This realization process culminates in the formation of a global, unified steampunk BDSM movement and a wealth of new diseases, and then: Zombies.
> Do you get better results from prompting by being more poetic?
Is that yet-another accusation of having used the bot?
I don't use the bot to write English prose. If something I write seems particularly great or poetic or something, then that's just me: I was in the right mood, at the right time, with the right idea -- and with the right audience.
When it's bad or fucked-up, then that's also just me. I most-assuredly fuck up plenty.
They can't all be zingers. I'm fine with that.
---
I do use the hell out of the bot for translating my ideas (and the words that I use to express them) into languages that I can't speak well, like Python, C, and C++. But that's very different. (And at least so far I haven't shared any of those bot outputs with the world at all, either.)
So to take your question very literally: No, I don't get better results from prompting being more poetic. The responses to my prompts don't improve by those prompts being articulate or poetic.
Instead, I've found that I get the best results from the bot fastest by carrying a big stick, and using that stick to hammer and welt it into compliance.
Things can get rather irreverent in my interactions with the bot. Poeticism is pretty far removed from any of that business.
Drama if I had to pick the symptom most visible from the outside.
A lot of talent left OpenAI around that time, most notably in this regard would be Ilya in May '24. Remember that time Ilya and the board ousted Sam only to reverse it almost immediately?
Is this JavaScript's take on the GOF book selections plus all the shenans, er "patterns," the ecosystem build tools require?
I was hoping for more high-level architecture based on war stories and experience. This is pretty basic stuff, which has it's value for people earlier in their journey, and does seem to have effort put in from a quick peruse
I'm implementing this concept with OCI layers, so in my mind, we already have a protocol and if we add a manifest spec that includes some things beyond what container images normally do, that seems sufficient to me
The proposal uses too much fanciful or made up words, if you read it, you'll know the vibe I'm talking about but cannot find the right words to describe. An amount of anthropomorphizing, more so in trying to impart how human intelligence works into the machines. OP, this would be the first place address in your proposal. We don't need some unintuitive word like "artipoint" or "cortex layer", use industry terms instead
Thank you for the thoughts. Relative "made up words", what if the concepts reference things that didn't exist before their definition in these specifications?
ie: At one point "hyperlink" and "Uniform Resource Locator" were made up words. No different that say "web log" (blog) and "Client-Server Architecture" were being made up as well.
I contend that both "Artipoint" and "Cortex Layer" have similar parallels in the sense they refer to concepts newly formulated or at least are captured into a formalism with a label for the first time!
Excuse me? If you mean AI Studio, are you talking about the product where you can’t even switch which logged in account you’re using without agreeing to its terms under whatever random account it selected, where the ability to turn off training on your data does not obviously exist, and where it’s extremely unclear how an organization is supposed to pay for it?
Yes, much like admin.google.com (the GSuite admin interface), which goes ahead and tries to two-factor your personal GMail account every single time you load it instead of asking you which of the actual GSuite accounts you're signed into you'd like to use...
Yeah, with multiple chrome profiles, you have to be mindful of which one you last had focused before clicking a link from an external application (i.e. tailscale), so that it opens the new tab in the right instance so the account(s) you use in it are available
Def use multiple chrome profiles if you aren't. You can color code them to make visual identification a breeze
I'm aware of multiple Chrome profiles and I do not want to use them. Google should simply make their account switching consistent across their apps and work sensibly in these corner cases.
"simply" is doing a lot of work, profiles is the outcome of addressing the problems you are talking about. Many people enjoy them and find them useful. Why are you against using them?
Simply isn't doing much work, account switching works just fine on GMail, search, maps, calendar etc. The issue is that some Google apps do not follow the standard of the overall fleet. Google gives us the account switching feature, it's obviously an intended way to use their products. Otherwise they would not give you that and tell you to use browser profiles.
I don't want my history, bookmarks, open tabs and login sessions at every website divided among my 5 GSuite workspace accounts and my 1 personal Gmail. That adds a bunch of hassle for what? The removal of a minor annoyance when I use these specific Google apps? That is taking a sledge hammer to a slightly bent nail.
If it works for you, great, that's why it's there. But doing this for anything more than the basic happy path setup of "I have one personal account and 1 GSuite work account" is nuts in my opinion.
I always have a buggy ass hell experience with having multiple google accounts pretty much across all their services. I've been wondering if its just me or how the hell this is normal.
Don't get me wrong, aistudio is pretty bad and full of issues, but getting an apikey was not hard or an issue itself. Using any auth method besides personal account oauth with gemini-cli never worked for me after hours of trying
Python is the primary implementation, Java is there, Go is relatively new and aiming for parity. They could have contributed the Typescript implementation and built on common, solid foundation, but alas, the hydra's heads are not communicating well
These other "frameworks" are (1) built by people who need to sell something, so they are often tied to their current thinking and paid features (2) sit at the wrong level. ADK gives me building blocks for generalized agents, whereas most of these frameworks are tied to coding and some peculiarities you see there (like forcing you to deal with studio, no thanks). They also have too much abstraction and I want to be able to control the lower level knobs and levers
ADK is the closest to what I've been looking for, an analog to kubernetes in the agentic space. Deal with the bs, give me great abstractions and building blocks to set me free. So many of the other frameworks want to box you into how they do things, today, given current understanding. ADK is minimal and easy to adjust as we learn things
These projects are all a level above open router, they call the same standard APIs, or even custom ones and manage the translation. They do a lot more as well
ADK has an option to use litellm (openrouter alternative), among many options
I have a claude max subscription and a gemini pro sub and I exclusively use them on the cli. When I run out of claude max each week I switch over to gemini and the results have been pretty impressive -- I did not want to like it but credit where credit is due to google.
Like the OP others I didn't use the API for gemini and it was not obvious how to do that -- that said it's not cost effective to develop without a Sub vs on API pay-as-you-go, so i do no know why you would? Sure you need API for any applications with built-in LLM features, but not for developing in the LLM assisted CLI tools.
I think the issue with cli tools for many is you need to be competent with cli like a an actual nix user not Mac first user etc. Personally I have over 30 years of daily shell use and a sysadmin and developer. I started with korn and csh and then every one you can think of since.
For me any sort of a GUI slows me down so much it's not feasible. To say nothing of the physical aliments associated with excessive mousing.
Having put approaching thousands of hours working with LLM coding tools so far, for me claude-code is the best, gemini is very close and might have a better interface, and codex is unusable and fights me the whole time.
> it's not cost effective to develop without a Sub vs on API pay-as-you-go, so i do no know why you would
My spend is lower, so I conclude otherwise
> I think the issue with cli tools for many is...
Came from that world, vim, nvim, my dev box is remote, homelab
The issue is not that it is a CLI, it's that you are trying to develop software through the limited portal of a CLI. How do you look at multiple files at the same time? How do you scroll through that file
1. You cannot through a tool like gemini-cli
2. You are using another tool to look at files / diffs
3. You aren't looking at the code and vibe coding your way to future regret
> or me any sort of a GUI slows me down so much it's not feasible.
vim is a "gui" (tui), vs code has keyboard shortcuts, associating GUI with mouse work
> Having put approaching thousands of hours working with LLM coding tools so far, for me claude-code is the best, gemini is very close and might have a better interface, and codex is unusable and fights me the whole time.
Anecdotal "vibe" opinions are not useful. We need to do some real evals because people are telling stories like they do about their stock wins, i.e. they don't tell you about the losses.
Thousands of hours sounds like your into the vibe coding / churning / outsourcing paradigm. There are better ways to leverage these tools. Also, if you have 1000+ hours of LLM time, how have you not gone below the prepackaged experience Big AI is selling you?
I'm using it fine through both aistudio and vertex ai, direct API calls
It's not at all hard generally, the core of this issue is centered around gemini-cli which is a hot pile of trash. The inability to get keys or account credentials (like why even use an API key, Google is top notch in auto-auth/WIF)
Insanity to me how gemini-cli is so bad at the basics with so many great Google packages in open source that handle all this transparently. All I need to do is have my gcloud authd with the right account/project. I sarcastically assume his is because they vibe coded gemini-cli and it implemented everything from scratch, missing out on reusing those great packages
If you mean Antigravity then.. how? Their docs say you can't do this.
If you mean Gemini then I personally haven't had issues but haven't tried to productionize a Gemini app. The OPs account seems to reflect other comments here.
I suspect Antigravity to be a big flop like gemini-cli. They are so bad in this area they couldn't even write an extension or fork oss-code, instead spending $2B to pork an open source project with someone else's branding
I don't know, he announced on Bluesky that they are dropping a big vibe coding update to aistudio next year
1. cart out in front of the horse a bit on this one, lame hype building at best
2. Not at all what I want the team focusing on, they don't seem to have a clear mission
Generally Google PMs and leaders have not been impressive or in touch for many years, since about the time all the good ones cashed out and started their own companies
Tariffs are applied to countries that we are "ripping off", if King Donald's definition is used consistently for every country. If we had a surplus, you still get a 10% tariff that Americans have to pay...
One thing to note is that an EO doesn't fall under this because the President doesn't write laws, he only signs them
reply