Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One thing we've been talking about is creating tasks that are a bit more 'tower defence', where biters are released every X steps / seconds. The idea would be to test agents in building a military-industrial complex. One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)

Regarding layout optimisation benchmarks, we actually discussed this yesterday. I think we need 2 types of layout task: 1) fix this subtly broken factory, and 2) improve the throughput of this factory. These should be straightforward to implement, if you'd like to have a look.



>One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)

This sounds like a great idea for a short story in the style of Malak by Peter Watts. Imagine a future warfighter AI that has been fitted with a set of filters to make it think it's really having a pillowfight or building a factory to make screws while it's actually tearing people apart or optimizing a military production line.


There was a black mirror episode about this too, I seem to remember! Soldiers imagining they were fighting monsters - while actually committing war crimes.


This was the central plot twist of "Spec Ops: The Line", a video game from 2012 that started out like your typical Call of Duty clone shooter and escalated to an interesting if a bit twisted look at how PTSD affects soldiers.


Love the suggestion, I'll clone it down and start poking around.

I believe your intuition about layout experiments needing to be of different genres is correct. I think you could have a pretty wide range of debugging opportunities (imbalanced belts, inserters fighting for items, insufficient power at full load leading to throughput loss, etc) for the first. The second feels like it would be nicely encapsulated by focusing on optimizing for ratios, although seeing an agent realize that they can get more throughput by simply copy/pasting a block and upgrading a belt would be pretty wild (depending on the recipe, of course). Maybe nuclear power / heat exchanger ratios are a bit too far down the path, but optimizing for copper cable use in green circuits is pretty important and fairly early in the tech tree?


If (1) is a special case of (2), maybe you’d only need (2)?


True - although it might be interesting to benchmark them both, as (1) is more about debugging (something that these agents spend a lot of time doing).


So something like PvZ might work, right?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: