Hacker Newsnew | past | comments | ask | show | jobs | submit | backflippinbozo's commentslogin

We built agents to test github repo quickstarts associated with arXiv papers a couple months before this paper was published, wrote about it publicly here: https://remyxai.substack.com/p/self-healing-repos

We've been pushing it farther to implement draft PRs in your target repo, published a month before this preprint: https://remyxai.substack.com/p/paperswithprs

To limit the attack surface we added PR#1929 to AG2 so we could pass API keys to the DockerCommandLineCodeExecutor but also use egress whitelisting to block the ability of an agent to reach a compromised server: https://github.com/ag2ai/ag2/pull/1929

Since then, we've been scaling this with k8s ray workers so we can run this in the cloud to build for the hundreds of papers published daily.

By running in Docker, constraining the network interface, deploying on the cloud, and ultimately keeping humans-in-the-loop through PR review, it's hard to see where the prompt-injection attack comes into play from testing the code.

Would love to get feedback from an expert on this, can you imagine an attack scenario, Simon?

I'll need to work out a check for the case where someone creates a paper with code instructing my agent to publish keys to a public HF repo for others to exfiltrate.


AI & ML engineering in particular is very research-adjacent.

That's why we began building agents to source ideas from the arXiv and implement the core-methods from the papers in YOUR target repo months before this publication.

We shared the demo video of it in our production system a while back: https://news.ycombinator.com/item?id=45132898

And we're offering a technical deep-dive into how we built it tomorrow at 9am PST with the AG2 team: https://calendar.app.google/3soCpuHupRr96UaF8

We've built up to 1K Docker images over the past couple months which we make public on DockerHub: https://hub.docker.com/u/remyxai

And we're close to an integration with arXiv that will have these pre-built images linked to the papers: https://github.com/arXiv/arxiv-browse/pull/908


Yeah, probably pretty simple compared to the methods we've publicly discussed for months before this publication.

Here's the last time we showed our demo on HN: https://news.ycombinator.com/item?id=45132898

We'll actually be presenting on this tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8

Besides ReAct, we use AG2's 2-agent pattern with Code Writer and Code Executor in the DockerCommandLineCodeExecutor

Also, using hardware monitors and LLM-as-a-Judge to assess task completion.

It's how we've built nearly 1K Docker images for arXiv papers over the last couple months: https://hub.docker.com/u/remyxai

And how we'll support computational reproducibility by linking Docker images to the arXiv paper publications: https://github.com/arXiv/arxiv-browse/pull/908


No doubt, this toy demo will break your system if the research repo code runs unsecured code.

We thought about this out as we built a system that goes beyond running the quickstart to implement the core-methods of arXiv papers as draft PRs for YOUR target repo.

Running quickstart in sandbox is practically useless.

To limit the attack surface we added PR#1929 to AG2 so we could pass API keys to the DockerCommandLineCodeExecutor and use egress whitelisting to limit the ability of an agent to reach out to a compromised server: https://github.com/ag2ai/ag2/pull/1929

Been talking publicly about this for at least a month before this publication, and along the way we've built up nearly 1K Docker images for arXiv paper code: https://hub.docker.com/u/remyxai

We're close to seeing these images linked to the arXiv papers after PR#908 is merged: https://github.com/arXiv/arxiv-browse/pull/908

And we're actually doing a technical deep-dive with the AG2 team on our work tomorrow at 9am PST: https://calendar.app.google/3soCpuHupRr96UaF8


It's a toy version of a product we've been building.

We go beyond testing the quickstart to implement the core-methods from arXiv papers as draft PRs for your target repo.

Posted a while ago: https://news.ycombinator.com/item?id=45132898

Feel free to join us for the technical deep-dive tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8


Yeah, you might be interested in our post & demo video: https://news.ycombinator.com/item?id=45132898

Which we're presenting on tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8


Looks just like our post: https://news.ycombinator.com/item?id=45132898

Which we're presenting on tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8

Difference is we've put this out months ago: https://x.com/smellslikeml/status/1958495101153357835

About to get PR #908 merged so anyone can use one of the nearly 1K Docker images we've already built: https://github.com/arXiv/arxiv-browse/pull/908

We've been publishing about this all Summer on Substack and Reddit: https://www.reddit.com/r/LocalLLaMA/comments/1loj134/arxiv2d...


The ability to accurately estimate distances from RGB image input is just at the frontier of current AI model capabilities.

Nonetheless, distance estimation is a critical for perception and planning in embodied AI applications like robotics which must navigate around our 3D world.

We just released SpaceThinker, a 3B open-weight VLM designed specifically for spatial reasoning tasks like distance and size estimation from RGB images. It’s small and fast enough for on-device use, trained entirely on open-source data/code.

* Model: https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B

* Data: https://huggingface.co/datasets/remyxai/SpaceThinker

On the QSpatial++ benchmark, SpaceThinker sits between GPT-4o and Gemini 2.5 Pro in performance, see this comparison table: https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B#qsp...

Interesting finding: By switching model name in this colab, using the non-reasoning variant SpaceQwen (https://huggingface.co/remyxai/SpaceQwen2.5-VL-3B-Instruct), you'll find using the step-by-step reasoning prompt actually hurts performance, challenging the convention that reasoning models don't benefit from complex instructions the way non-reasoning models do.

Additional resources:

* https://github.com/andrewliao11/Q-Spatial-Bench-code/blob/ma...

* https://huggingface.co/blog/NormalUhr/deepseek-r1-explained#...

Feedback, suggestions, and collaborators are welcome!

If you're interested in contributing, we open-sourced VQASynth—our implementation of the SpatialVLM approach for generating VQA-style datasets from images. VQASynth was used to create the SpaceThinker dataset, which powered the fine-tuning of the SpaceThinker model showcased here.


Thanks, ollama


Exciting news, thanks for sharing! We've been applying this technique to create custom models on the fly with our no code platform at https://remyx.ai We're trying to build up our user base and get more feedback, try it out or check out our walkthrough: https://youtu.be/7SMySnRRTew?t=39


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: