Why would screenshots be necessary if a textual description of the factory state is both easier to interpret and less prone to confusion? The game is played on a grid, so converting the game state to ascii ought to be trivial.
It actually is engineering wise quite trivial but the underlying question is which modality is the best to elicit spatial reasoning capabilities from the current general models.
We tried (very anecdotally) a couple of months ago to get an agent to reason over a couple of ascii representations of factories and the results weren't very promising. It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens
The question is what is the most efficient and high-quality representation we could use to improve that
> It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens
That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)
Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".
Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]
Did you try providing 2D vectors of where each object relates to every other object? Seems like the most obvious way.
In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.
We did discuss this at some point but didn't end up trying it out. I think it's quite an interesting avenue and worth a shot, my intuition also says that the spatial capabilities will improve if the model has more access to relative info and doesn't need to infer it from absolute coordinates
Given vector space on text is more of a spatial space of semantic distance then spatial distance of geometric objects intuitively feel of a different nature due to the fact that words are not at all likely to be represented in similar ratios of distances.
I think a tokenization of ratios between perceived boundaries might help. But, I’m just shooting in the dark.
You're conflating the use of vectors to only mean how they relate to semantic meaning. As vectors are just spatial relationships, in the case of objects in Factorio we could provide the vectors for every single object as to how they relate to every single other object in literal 2D space. This would essentially provide the LLM a complete relationship mapping, since it is not able to do it by "seeing" a picture or by providing it with absolute coordinates.
Yeah but that’s a biased approximation at the cost of an assumption in equivalence and not truth distills equivalent in ratio. You’d have to treat tokens at some universal distance if one unit to approximate some unit of measurement along hwd/magnitude.
Overall visual perception is about noticing comparative differences not measuring absolute quantity.
Trivial as in only engineering work, sure. But it’s a lottt of tokens. Long context models do a number of things to get all that working context in; some of those things elide details / compress / have token segments that are harder to reason about. When a burner inserter at a location takes up like 50-100 tokens, and you want it to reason about 100 of them, this is still a pretty challenging task for any LLM.
Ah, I don't know much about multi modal models but I wonder what they'd think of pixel art representing the factory where each pixel is a point on the grid and each color is a specific entity, perhaps ignoring things such as bots flying about. Probably easier to comprehend than an actual screenshot?
The thing is that when you create a dense ASCII representation, any gain you might make from the spatial relationships is lost by: a) the tokeniser not working on characters alone (remember strawberrry), and b) the increased number of 'dead' tokens encoding not very much.
Our sparse encoding seems to confuse the models less - even though it certainly isn't perfect.
I mean at some point you compress the board state down to Dwarf Fortress with an extended ASCII representation for each grid-state (maybe 2 bytes each?)