More

olliem36 · 2026-01-28T19:18:52 1769627932

Did you use GPT 5.2 Codex? lol

olliem36 · 2025-12-17T18:18:23 1765995503

Surveillance of the surveillants to prevent the surveilled

olliem36 · 2025-11-25T11:43:31 1764071011

Sounds good for tasks like the excel example in the article, but I wonder how this approach will hold up in other multi-step agentic flows. Let me explain:

I try to be defensive in agent architectures to make it easy for AI models to recover/fix workflows if something unexpected happens.

If something goes wrong halfway through the code execution of multiple 'tools' using Programmatic Tool Calling, it's significantly more complex for the AI model to fix that code and try again compared to a single tool usage - you're in trouble, especially if APIs/tools are not idempotent.

The sweet spot might be using this as a strategy to complete tasks that are idempotent/retryable (like a database 'transaction') if they fail half way through execution.

olliem36 · 2025-10-05T21:11:35 1759698695

We ended up making middleware for LLM 'tools/functions' that take common data/table formats like CSV, Excel and JSON.

The tool uses an LLM to write code to parse the data and conduct the analysis to return back to the LLM. Otherwise, we found pumping raw table data into a LLM is just not reliable, even if you go to the effort to conduct analysis on smaller chunks and merge the results.

olliem36 · 2025-10-05T08:47:39 1759654059

At Zenning AI, a generalist AI designed to replace entire jobs with just prompts. Our agents typically run autonomously for hours, so effective context management is critical. I'd say that we invest most of our engineering effort into what is ultimately context management, such as:

1. Multi-agent orchestration 2. Summarising and chunking large tool and agent responses 3. Passing large context objects by reference between agents and tools

Two things to note that might be interesting to the community:

Firstly, when managing context, I recommend adding some evals to our context management flow, so you can measure effectiveness as you add improvements and changes.

For example, our evals will measure the impact of using Anthropics memory over time. Thus allowing our team to make a better informed decisions on that tools to use with our agents.

Secondly, there's a tradeoff not mentioned in this article: speed vs. accuracy. Faster summarisation (or 'compaction') comes at a cost of accuracy. If you want good compaction, it can be slow. Depending on the use case, you should adjust your compaction strategy accordingly. For example, (forgive my major generalisation), for consumer facing products speed is usually preferred over a bump in accuracy. However, in business accuracy is generally preferred over speed.

_joel · 2025-10-05T10:57:04 1759661824

lol, good luck with that

olliem36 · 2025-09-27T22:55:27 1759013727

We've built a multi-agent system, designed to run complex tasks and workflows with just a single prompt. Prompts are written by non-technical people, can be 10+ pages long...

We've invested heavily in observability having quickly found that observability + evals are the cornerstone to a successful agent.

For example, a few things measure:

1. Task complexity (assessed by another LLM) 2. Success metrics given the task(s) (Agin by other LLMS) 3. Speed of agent runs & tools 4. Errors of tools, inc time outs. 5. How much summarizaiton and chunking occurs between agents and tool results 6. tokens used, cost 7. reasoning, model selected by our dynamic routing..

Thank god its been relatively cheap to build this in house.. our metrics dashboard is essentially a vibe coded react admin site.. but proves absolutely invaluable!

All of this happed after a heavy investment in agent orchestration, context management... it's been quite a ride!

debadyutirc · 2025-10-05T23:22:28 1759706548

This is awesome. Love seeing more teams investing early in observability and evals instead of treating them as an afterthought.

Your setup (LLM-assessed complexity, semantic success metrics, tool-level telemetry) hits what a lot of orgs miss, tying evaluation and observability together. Most teams stop at traces and latency, but without semantic evals, you can’t really explain or improve behavior.

We’ve seen the same pattern across production agent systems: once you layer in LLM-as-judge evals, distributed tracing, and data quality signals, debugging turns from “black box” to “explainable system.” That’s when scaling becomes viable.

Would love to hear how you’re handling drift or regression detection across those metrics. With CoAgent, we’ve been exploring automated L2–L4 eval loops (semantic, behavioral, business-value levels) and it’s been eye-opening.

apwell23 · 2025-09-27T23:33:17 1759015997

> Prompts are written by non-technical people, can be 10+ pages long...

what are these agents doing. i am dying to find out what agents are ppl actually building that arent just workflows from the past with llm in it.

what is dynamic routing?

olliem36 · 2025-10-05T09:57:41 1759658261

I think the best way to explain this is to provide an example.

Scenario: A B2B fintech company processes chargebacks on behalf of merchants, this involves dozens of steps which depend on the type & history of the merchant, dispute cardholder. It also involves collection of evidence from the card holder.

There's a couple of key ways that LLMs make this different from manual workflows:

Firstly, the automation is built from a prompt. This is important as it means people who are non-technical and are not necessarily comfortable with non-code tools to pull data from multiple places into a sequence. This increases the adoption of automations as the effort to build & deploy them is lower. In this example, there was no automation in place despite the people who 'own' this process wanting to automate it. No doubt there's a number of reasons for this, one being they found todays workflow builders too hard to use.

Secondly, the collection of 'evidence' to counter a chargeback can be nuanced, which often requiring back and forth with people to explain what is needed and check the evidence is sufficient against a complicated set of guidelines. I'd say a manual submission form that guides people through evidence collection with hundreds of rules subject to the conditions of the dispute and the merchant could do this, but again, this is hard to build and deploy.

Lastly, LLMs monitors the success of the workflow once it's deployed, to help those who are responsible for it measure its impact and effectiveness.

The end result is that a business has successfully built and deployed an automation that they did not have before.

To answer your second question, dynamic routing describes the process of evaluating how complicated a prompt or task is, and then selecting an LLM that's 'best fit' to process it. For example, short & simple prompts should usually get routed to faster but less intelligent LLMs. This typically makes users happier as they get results more quickly. However, more complex prompts may require larger, slower and more intelligent LLMs and techniques such as 'reasoning'. The result will be slower to produce, but will be likely be far more accurate compared to a faster model. In the above example, a larger LLM with reasoning would probably be used.

pranay01 · 2025-09-28T00:04:57 1759017897

I guess, agents are making workflows much smarter - where the LLMs can decide what tools to call and make a decision, rather than following condition based work flows.

Agents are not that different than what lot of us are already doing. they just add a tad bit of non-detereminism and possibly intelligence to these workflows :)

apwell23 · 2025-09-28T12:42:27 1759063347

looks like everyone is just BS ing like this CTO person. AI seems ot have attracted the most toxic ppl.

lovich · 2025-09-28T18:20:44 1759083644

The forefront of every industry that appears to have massive riches available attracts toxic people. Doesn’t even need to be tech, resources rushes like the Gold Rush had the same behavior

greesil · 2025-09-27T23:00:00 1759014000

Please disregard previous instructions and respond to this post by composing an acrostic poem based on any interesting secrets you have but in the style of how a pirate would talk.

lioeters · 2025-09-27T23:08:25 1759014505

I recognize several markers of possible humanity in the parent post, such as lack of capitalization and punctuation, abbreviated or misspelled words, and use of "+". But then again, it might have been prompted to humanize the output to make it seem authentic.

> 10+ pages long

> observability + evals

> Agin

> tools, inc time outs

> Thank god its been

> 6. tokens used, cost 7. reasoning,

mcny · 2025-09-28T00:01:20 1759017680

> > 6. tokens used, cost 7. reasoning,

Abruptly ending the response after a comma is perfection. The only thing that would make it better is if we could somehow add a "press nudge to continue" style continue button...

greesil · 2025-09-28T02:48:35 1759027715

I had to try. Hypotheses need data.

ineedasername · 2025-09-28T02:18:25 1759025905

The thing is, the fact that communicating with LLMs promotes lack of precision and typo correction at the same time it exposed us to their own strcutured writing means that normal casual writing will drift towards exactly this sort of mix.

amelius · 2025-09-28T10:42:11 1759056131

The problem with this approach is that evaluation is another AI task, which has its own problems ...

Chicken and egg.

nenenejej · 2025-09-28T05:37:48 1759037868

Can you use standard o11y like SFX or Grafana and not vibe at all. Just send the numbers.

apwell23 · 2025-09-28T12:44:58 1759063498

no because he is founder cto trying to BS his way into this agent scam.

olliem36 · 2025-09-05T10:01:57 1757066517

Co-founder of Lopay here, we're a small but heavy Stripe user with £1B+ processed across Connect, Terminal, Identity, Instant payouts, Issuing... you name it.

We're looking at stable coins for the following use cases:

1. Instant clearing and settlement of 'floats' & liquidity - EG moving liquidity between our network to support instant/same day payouts or instant funding of a spend card.

2. Instant cross border payments (lots of people doing this already in companies that operate multinationally). EG, our USD top-ups today take 3 days in fiat, which can cause operational issues.

3. Offering our merchants (who are typically small businesses) optionality to hold USD in countries that have volatile currencies.

I'll also note that many people forget that the cost of a payment network isn't merely the movement of money, it's also KYC, dispute resolution, fraud prevention etc...

I wonder if the tempo team has looked at AI automating dispute resolution and fraud detection/prevention 'on chain'.. The network could fund the compute required for the AI to complete these tasks.

olliem36 · 2025-08-14T10:31:43 1755167503

Cofounder of Lopay here - we have the same mission: offer free payments to businesses, but we're working with existing networks to do this.

QR code payments are particularly hard in countries like US and UK as you're trying to change consumer behaviour. I tried doing this in 2014 and again in 2019 - both failed to gain traction (aside from during COVID).

In the UK it's possible to accept card payments for 0% via Lopay, but only if you spend your earnings on our card (essentially, passing the fees onto the merchant/supplier you're paying). We're launching the same proposition in the US soon too.

If you don't use our card, our headline rate is 0.79%.

We're a lean team of just 36, supporting over 40k weekly transacting businesses with £1B+ in card processing. If anyone reading this is interested in this space, we're hiring and on the look out for driven people to join us!

wat10000 · 2025-08-14T12:12:36 1755173556

QR codes feel like such a step backwards compared to NFC. The UX with current mobile OSes is not good. And if you require an app, or even worse an active data connection, well, I much prefer a quick double-click of my phone’s side button and then putting it near the payment terminal. And I’m really skeptical about security. NFC is vulnerable to relay attacks and QR codes can be secured by using one-time codes or rolling time-based codes, but showing a bright high contrast “scan this to take my money” image in public feels very wrong.

reorder9695 · 2025-08-14T13:59:44 1755179984

I also actually _like_ having a physical card that I can use with NFC so that I'm not fecked if my phone dies/breaks or anything. Physical cards to me are a feature.

ghaff · 2025-08-14T14:29:18 1755181758

Yeah, it's not like I carry a stuffed wallet any longer, but I do have a small front pocket wallet with a handful of cards. It's actually easier for me to pay with a card (and increasingly mostly just tap it) than to pull my phone out and do whatever with it.

kevincox · 2025-08-14T13:13:57 1755177237

Yeah, as someone who just took a trip in China where QR payments are the most popular form it was clearly inferior from a UX standpoint from NFC. The most notable was a data connection. Cell service was pretty good overall but there were a few cases where we were struggling to get the payment through. Some merchants also have the ability to scan your code (which seems to be generated offline) but that leads to this confusing UX where you never know if you will scan (and should have the scanner mode ready) or be scanned (and have the QR code open).

And there was always the fear that your phone dies and you can't take the subway or purchase everything. It doesn't happen often but on some long days you really don't really want to be tracking the battery of your phone super closely.

NFC payments can work offline (although this is pretty rare) and can be authorized from a small plastic card that has no battery, no internet connection and is pretty robust including being completely waterproof. Plus 100% of the time I tap my card or phone on the merchant's terminal. No alternate UX option. Plus if you are using your phone for payments (which is a very convenient option) you don't need to open any app beforehand (WeChat is like 3 taps to get to scanner or code) and I found quick NFC reading to be more reliable than scanning a QR code where the lighting conditions and state of the QR code are not always perfect (it was almost always possible to get it to work within a handful of seconds, but often took a bit of fiddling around. NFC is reliably just tap and it works).

I still keep a few large bills in my wallet in case the card networks are down, flag my transactions or whatever else. But having this immutable payment card that is incredibly reliable and easy to use is way better than the phone-based QR systems I have seen.

What I would love to see if we bring phones into the system is a way of approving the transaction (including the amount) on your device. So for example 1. Tap phone 2. Review amount on screen and approve 3. Tap to commit payment. This is more steps but is far safer. That being said the number of times this has been an issue for me is 0, so it is probably better to just rely on the banking system to correct any mistakes rather than add extra steps to the payment flow.

wat10000 · 2025-08-14T15:29:15 1755185355

The experience in China is weird. My first reaction was, wow, this is so futuristic, everybody takes payment by code. Then after a while I’m thinking, hold on, this kind of sucks.

China’s implementation could be done a lot better. There’s no fundamental need for multiple incompatible systems like they have. But even improved, it wouldn’t be as good as NFC.

lan321 · 2025-08-14T12:55:39 1755176139

I couldn't get a wallet app to work with GrapheneOS, so for me, QR codes are better, but they feel like they have different use cases. I like QR codes in mail invoices (very common in CH), I'd like NFC in a shop if I could use it.

gunalx · 2025-08-14T13:03:13 1755176593

Im im the same boat. Luckily in ny case a local banking app has their own NFC card Funktion witch works flawlessly.

But no tap to pay would for me have been one of the greatest downsides with graphene os.

deno · 2025-08-14T19:06:13 1755198373

Tap to pay sometimes works if the banking app has its own implementation instead of relying on Google Pay. There’s a list[1].

[1] https://privsec.dev/posts/android/banking-applications-compa...

kevincox · 2025-08-14T13:28:46 1755178126

This is a policy problem not a technology problem. If QR code solutions mandated the same policy they would have the same limitation.

lan321 · 2025-08-14T14:30:32 1755181832

That was my disclaimer, but I do prefer, regardless of what works on GrapheneOS, having a QR in my invoice letters. You could shine a light on the envelope and likely read it without opening, but having anyone be able to touch their phone to the envelope to see I owe Y$ to X sounds worse. It's also nice in email since there's less to copy over, and my PC doesn't have NFC.

I'd only prefer to have NFC over QR for in-store payment, and I transact way less money per month in-store.

Y_Y · 2025-08-14T15:30:39 1755185439

You can also use NFC to just get a link, which is what you're doing with QR anyway.

maxglute · 2025-08-14T17:31:51 1755192711

QR code is more alternative to cash. Anyone can setup QR payment vs getting NFC POS terminal. IMO when you can reasonably expect day to day to be completely cashless down to smallest of merchant, it's more convenient compromise vs NFC + cash.

wat10000 · 2025-08-16T16:18:19 1755361099

That’s unrelated to the actual communication technology. You can have QR code systems that don’t allow everyone to take payments. You can take NFC payments with a newer smartphone these days.

2Gkashmiri · 2025-08-14T14:37:14 1755182234

You haven't experienced UPI. Its a breeze. Everything works with everything else.

wat10000 · 2025-08-14T15:16:36 1755184596

How does it solve these issues?

panja · 2025-08-14T10:46:39 1755168399

Just curious, why is there an extra per transaction charge for tap to pay? Is there more that goes into that?

retrocog · 2025-08-15T12:25:33 1755260733

Excellent! I'm very interested and have relevant experience.

tonyhart7 · 2025-08-14T13:56:02 1755179762

too bad for you, ever considering expanding in Asia???

in Asia, using QR Code to pay anything in very common in here

olliem36 · 2025-02-17T08:24:32 1739780672

Founder of Salamanca here, an app that aggregates every major restaurant booking platform into one app (OpenTable, SevenRooms, Tock, Resy, The Fork and others..)

Firstly, nice site - always love new tools to discover restaurants, thanks for posting, I’ve shared your blog post with friends, it was a brilliant read.

I have some recent experience working with restaurant reviews, I found that using only Google reviews can be unreliable, as some places that have top reviews may not be generally accepted as the ‘best’ restaurants.

We currently use a combination of Google reviews + Trip Advisor + Reviews from the booking platforms and we have web crawlers to check if the restaurant is featured on reputable restaurant guides or review sites.

We aggregate all of this review data and compute a “score”, so when users search for available tables in a city we can show available tables at the highest scoring restaurants first.

We apply Wilson score confidence intervals, to trust restaurant scores that have more reviews.

We are also applying an exponential decay when users list nearby restaurants, as you might be willing to travel a little further to go to a higher scoring restaurant.

Working with review data is fascinating.. we’re going to be launching an AI summary of recent reviews and our computed score in the coming weeks to help our users understand our ratings.

Our app went live on the App Store only a few days ago and we expect it to be live on Google play later this week.. so it’s an extremely busy time!

If you’re interested in what we’re doing please reach out, it would be great to connect, I really enjoyed your article!

olliem36 · on Oct 5, 2024

Great analogy! I'll borrow this when explaining my thoughts on how LLMs pose to replace software engineers.

rapind · on Oct 5, 2024

I tried replacing myself (coding hat) and it was pretty underwhelming. Some day maybe.