One thing we've been talking about is creating tasks that are a bit more 'tower ...

aftbit · 2025-03-11T22:41:18 1741732878

>One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)

This sounds like a great idea for a short story in the style of Malak by Peter Watts. Imagine a future warfighter AI that has been fitted with a set of filters to make it think it's really having a pillowfight or building a factory to make screws while it's actually tearing people apart or optimizing a military production line.

noddybear · 2025-03-12T13:29:46 1741786186

There was a black mirror episode about this too, I seem to remember! Soldiers imagining they were fighting monsters - while actually committing war crimes.

aftbit · 2025-03-15T19:13:21 1742066001

This was the central plot twist of "Spec Ops: The Line", a video game from 2012 that started out like your typical Call of Duty clone shooter and escalated to an interesting if a bit twisted look at how PTSD affects soldiers.

spieswl · 2025-03-11T16:52:43 1741711963

Love the suggestion, I'll clone it down and start poking around.

I believe your intuition about layout experiments needing to be of different genres is correct. I think you could have a pretty wide range of debugging opportunities (imbalanced belts, inserters fighting for items, insufficient power at full load leading to throughput loss, etc) for the first. The second feels like it would be nicely encapsulated by focusing on optimizing for ratios, although seeing an agent realize that they can get more throughput by simply copy/pasting a block and upgrading a belt would be pretty wild (depending on the recipe, of course). Maybe nuclear power / heat exchanger ratios are a bit too far down the path, but optimizing for copper cable use in green circuits is pretty important and fairly early in the tech tree?

robotresearcher · 2025-03-11T15:34:13 1741707253

If (1) is a special case of (2), maybe you’d only need (2)?

noddybear · 2025-03-11T16:25:41 1741710341

True - although it might be interesting to benchmark them both, as (1) is more about debugging (something that these agents spend a lot of time doing).

tomrod · 2025-03-11T19:44:15 1741722255

So something like PvZ might work, right?