Hacker Newsnew | past | comments | ask | show | jobs | submit | irskep's commentslogin

Working on mrjob was a big part of my first job out of college. Fun to see it get mentioned more than ten years later.

What some commenters don't realize about these bureaucratic IO-heavy expensive tools is that sometimes they are used in order to apply a familiar way of thinking, which has Business Benefits. Sometimes you don't know if your task will take seconds, minutes, hours, days, or weeks on one fast machine with a well-thought-out program, but you really need it to take at most hours, and writing well-thought-out-programs takes time you could spend on other stuff. If you know you can scale the program in advance, it's lower risk to just write it as a Hadoop job and be done with it. Also, it helps to have an "easy" pattern for processing Data That Feels Big Even If It Isn't That Big, Although Yelp's Data Actually Was Big. Such was the case with mrjob stuff at Yelp in 2012. They got a lot of mileage out of it!

The other funny thing about mrjob is that it's a layer on Hadoop Streaming, which is a term for when the Java process actually running the Hadoop worker opens a subprocess to your Python script which accepts input on stdin and writes output on stdout, rather than working on values in memory. A high I/O price to pay for the convenience of writing Python!


That's a good point. Hadoop may not be the most efficient way, but when a deliverable is required, Hadoop is a known quantity and really works.

I did some interesting work ten years ago, building pipelines to create global raster images of the entire Open Street Map road network [1]. I was able to process the planet in 25 minutes on a $50k cluster.

I think I had the opposite problem: Hadoop wasn't shiny enough and Java had a terrible reputation in academic tech circles. I wish I'd known about mrjob because that would have kept the Python maximalists happy.

I had lengthy arguments with people who wanted to use Spark which simply did not have the chops for this. With Spark, attempting to process OSM for a small country failed.

Another interesting side-effect of using the map-reduce paradigm was with processing vector datasets. PostGIS took multiple days to process the million-vertex Norwegian national parks, however splitting the planet into data density sensitive tiles (~ 2000 vertices) I could process the planet in less than an hour.

Then Google Earth Engine came along and I had to either use that, or change career. Somewhat ironically GEE was built in Java.

[1] https://github.com/willtemperley/osm-hadoop


I’ve also seen some Really Really Bad software due to engineers having “Not Invented Here” syndrome. If it takes using big well known frameworks to avoid some of that it’s worth the cost.


Today I'm hacking on automate-terminal, a command line program and Python library that abstracts the various terminal emulator automations (iTerm2, WezTerm, Kitty, tmux) into a single API. Mostly made for use by other tools. https://github.com/irskep/automate-terminal


sounds interesting. what was your first scenario when you thought, “Baam! I need to automate this, and I'm going to do it right now!” ?


I'm really excited by this development! Material for MkDocs has raised the quality level of so many projects' docs, my own included, by making good navigation the default. It's by far my favorite system to browse as a reader, and use as a project maintainer.

I hope the new theme allows for more customization than the old Material theme. It was really hard to create a unique brand identity within the constraints of Material; it just wasn't built with customization in mind beyond a color. The "modern" theme looks minimal in a way that gives me some hope for this.

Looking forward to kicking the tires on Zensical!


I'm working on autowt, a git worktree manager that happens to make LLM coding workflows easier. https://steveasleep.com/autowt/

It has some rough edges, but I use it a ton and get a lot of value out of it.


> AI editors should look into letting you operate on multiple git branches simultaneously

Git worktrees are great for this. I built a little tool to make them more ergonomic: https://steveasleep.com/autowt/

You really don't need every LLM vendor to build their own version of worktrees.


Letting Cursor pick the model for you is inviting them to pick the cheapest model for them, at the cost of your experience. It's better to develop your own sense of what model works in a given situation. Personally, I've had the most success with Claude, Gemini Pro, and o3 in Cursor.


I was once a heavy user of Cursor with Gemini 2.5 Pro as a model, then a Claude Code convert. Occasionally I try out Gemini CLI and somehow it fails to impress, even as Cursor + Gemini still works well. I think it's something about the limited feature set and system prompt.


I'm working on a little wrapper that solves this problem. I have similar needs with .env files, and in my case running 'uv sync' to install dependencies. I linked it elsewhere in this thread so I won't repeat the URL (autowt). It's definitely possible to make this workflow effective with some scripting.


I've been working on a tool for exactly this purpose: https://steveasleep.com/autowt/

I'm the only user at the moment, and I really enjoy the workflow. I run about four claude-codes at once this way. It's a little underbaked but I think this is the way a lot of people are going to go. Seems like the 'par' tool in a sibling comment is a similar approach.

Containers do make things easier, especially since agents can see the log output of any service super easily. To do the same thing outside a container you need to send logs somewhere the agent can see.


There's also container-use by dagger (docker creators): https://github.com/dagger/container-use


Would be nice to have it listed as a brew formula, instead of pip.


uv tool beats brew for me. Certainly by a factor of 100 for install time :)


"Compactions" are just reducing the transcript to a summary of the transcript, right? So it makes sense that it would get worse because the agent is literally losing information, but it wouldn't be due to context rot.

The thing that would signal context rot is when you approach the auto-compact threshold. Am I thinking about this right?


Yes, but on agentic workflows it's possible to do more intelligent compaction.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: