What's interesting to me is that while it was obvious to all of us who came to think in the Unix Way, that insofar as composability, usage discoverability, and gobs of documentation in posts and man pages that are hugely represented in training corpora for LLMs, that the CLI is a great fit for LLM tool use, it seems only a recent trend to acknowledge this (and also the next hype wave, perhaps.)
Also interesting that while the big vendors are following this trend and are now trying to take a lead in it, they still suggest things like "but use a JSON schema" (the linked article does a bit of the same - acknowledging that incremental learning via `--help` is useful AND can be token-conserving (exception being that if they already "know" the correct pattern, they wouldn't need to use tokens to learn it, so there is a potential trade-off), they are also suggesting that LLMs would prefer to receive argument knowledge in json rather than in plain language, even though the entire point of an LLM is for understand and create plain language. Seemed dubious to me, and a part of me wondered if that advice may be nonsense motivated by desire to sell more token use. I'm only partially kidding and I'm still dubious of the efficacy.
* Here's a TL;DR for anyone who wants to skip the rest of this long message: I ran an LLM CLI eval in the form of a constructed CTF. Results and methodology are in the two links in the section linked:
https://github.com/scottvr/jelp?tab=readme-ov-file#what-else
Anyhow... I had been experimenting with the idea of having --help output json when used by a machine, and came up with a simple module that exposes `--help` content as json, simply by adding a `--jelp` argument to any tool that already uses argparse.
In the process, I started testing, to see if all this extra machine-readable content actually improved performance, what it did to token use, etc. While I was building out test, trying to settle on legitimate and fair ways to come to valid conclusions, I learned of the OpenCLI schema draft, so I altered my `jelp` output to fit that schema, and set about documenting the things I found lacking from the schema draft, meanwhile settling to include these arg-related items as metadata in the output.
I'll get to the point. I just finished cleaning the output up enough to put it in a public repo, because my intent is to share my findings with the OpemCLI folks, in hopes that they'll consider the gaps in their schema compared to what's commonly in use, but at the same time, what came as a secondary thought in service of this little tool I called "jelp", is a benchmarking harness (and the first publishable results from it), the to me, are quite interesting and I would be happy if others found it to be and added to the existing test results with additional runs, models, or ideas for the harness, or criticism about the validity of the method, etc.
The evaluation harness uses constructed CLI fixtures arranged as little CLI CTF's, where the LLMs demonstrate their ability to use an unknown CLI be capturing a "flag" that they'll need to discover by using the usage help, and a trail of learned arguments.
My findings at first confirmed my intuitions, which was disappointing but unsurprising. When testing with GPT-4.1-mini, no manner of forcing them to receive info about the CLI via json was more effective than just letting them use the human-friendly plain English output of --help, and in all cases the JSON versions burned more tokens. I was able to elicit better performance by some measurements from 5.1-mini, but again the tradeoff was higher token burn.
I'll link straight to the part of the README that shows one table of results, and contains links to the LLM CLI CTF part of the repo, as well as the generated report after the phase-1 runs; all the code to reproduce or run your own variation is there (as well as the code for the jelp module, if there is any interest, but it's the CLI CTF eval that I expect is more interesting to most.)
This just gave me a flashback to something I made a long time ago, which was a tool to create a file that was a named pipe - the contents of which were determined by the command in its filename. If I remember correctly (and its embedded man page would seem to validate this memory), the primary impetus for making this tool was to have dynamically generated file content for purposes of enabling a remote process execution over server daemons that did not explicitly allow for it, such as finger, etc, but were intended only to read a specific static file.
Using named pipes in this manner also enabled a hackish method to create server-side dynamic web content by symlinking index.html to a file created in this manner, which was a secondary motivator, which seems kinda quaint and funny now, but at that time, it wasn't very long after just having finally decommed our gopher server, so fingerd was still a thing, Apache was fairly new, and I may still have been trying to convince management that the right move was not from ncsa httpd to Netscape Enteprise Server, but to Apache+mod_ssl. RSA patent licensing may still have been a thing too. Stronghold vaguely comes to mind, but I digress.
Yeah, programs that do stuff based on filename, like busybox. Oh, and this long forgotten artifact this article just reminded me of that I managed to find in the Wayback Machine, a tool to mknod a named pipe on a SunOS 4.1.4 machine, to get server-side dynamic content when remotely accessing a daemon that was supposed to return content from a single static file. Ah, memories.
As a former SUN sysadmin/netadmin (from SunOS 4.1.4 days), I vaguely remember the Solaris releases after 2.5.1, maybe to another re-version/branding called Solaris 7, maybe? And then not paying any attention after Oracle absorbed it. I was honestly surprised enough by this headline to click TFA, simply because I did not think Solaris even existed anymore.
>I was honestly surprised enough by this headline to click TFA, simply because I did not think Solaris even existed anymore.
Oracle would never give up the opportunity to continue milking customers until they're comletely dry.
They did kill all future releases and blew up the SPARC roadmap. They also fired everyone working on new features and releases but kept enough of a skeleton crew to charge legacy customers outrageous support fees.
But for all practical intents and purposes, it's dead. One guy releasing things like "ls -sh now actually shows human readable output" being highlighted as a new feature kind of tells you everything you need to know.
I apologize in advance, but this is my one pre-existing contribution to the world that mentions mapping of Drosophilia brains, and having just used it in its intended "copypasta" role this morning, I was excited to coincidentally see this Emulation link on HN this morning.
If this sounds interesting, and you have a moment and a favorite API, I'd appreciate your experience testing it out, or if the README needs more detail, etc.
I have been solely using an old X86 Darwin MacBook Air for this, so that's the extent of the platform(s) tested. (Writing this has caused me to realize I might want to document the process of installing the FUSE driver on a mac, but I do link to the MacFUSE webpage, which is probably more broadly useful than my experience on this dated laptop.)
Anyway, I actually went searching before posting because it hadn't occurred to me that this might have already been done somewhere else (yeah, one might think searching for an existing solution would be a first thing...) and was happy to see that it doesn't seem to have been exactly put out there before, while also being a bit surprised to see that so many spiritually-related FUSE implementatations have been created since last I had occasion to do anything with FUSE. It might be the case that if I have done this correctly, very-specific niche FUSE implementations will be unneeded, as my hope is that apifusefs is capable of handling any swagger-type, OpenAPI-spec API.
(Eventually, anyway. For example, there exists a massive openapi.json for the GitHub API, but due to the way the root endpoint refers to other endpoints, and each has its own context and auth-requirements, this initial release of apifusefs isn't magic for api.github.com/ even with the spec file, and I had to mount specific endpoints by their URL given in the response to "GET /", to varying degrees of success or failure, which is what caused me to add the --json-file mode,so I could just redirect the output of a curl request to a file and test with that.)
That said, it does now support a variety of ways to pass authentication tokens, so my hope is that if anyone here has an API to try it out against, that it will work for you without hassle.
Hmm.. My first thought is that great, now not only will e.g., HR/screening/hiring hand-off the reading/discerning tasks to an ML model, they'll now outsource the things that require any sort of emotional understanding (compassion, stress, anxiety, social awkwardness, etc) to a model too.
One part of me has a tendency to think "good, take some subjectivity away from a human with poor social skills", but another part of me is repulsed by the concept because we see how otherwise capable humans will defer to "expertise" of an LLM due to a notion of perceived "expertise" in the machine, or laziness (see recent kerfuffles in the legal field over hallucinated citations, etc.)
Objective classification in CV is one thing, but subjective identification (psychology, pseudoscientific forensic sociology, etc) via a multi-modal model triggers a sort of danger warning in me as initial reaction.
Appreciate the feedback truly. It's an interesting concept to explore, deferring human "expertise" to technology has been happening throughout the years (most definitely accelerated in recent times), for which we have found ways to adapt / abstract over the work being deferred, but the growing pains are probably the most acute when such deferment happens rapidly, as in the case of AI.
Don't want this to turn into a Matt Damon in Elysium type of situation for sure with that scene with the parole officer hahah (which would stem from a poor integration of such subjective signals into existing workflows, more so than the availability of those signals)
For emotional intelligence, I personally see this as a prerequisite for any voice / language model that's interacting with humans, just like how an autonomous car has to be able to identify a pothole, so does a voice / video agent navigating a pothole in a conversation.
You cause me to have an additional thought on the topic which is that as much as I expressed a sense of dread at the inevitable use of this sort of tech in hiring pipelines (not by agents, necessarily, but as a sort of HUD overlay on a video call between humans was my initial envisioned use case.) But I suppose that just as the AI interviewer bots that I thus far have refused to engage with will inevitably be unavoidable if one is on the job hunt, so will the use of this sort of multi-modal sentiment analysis be inevitable. (Same with the justice system use case you referenced in your metaphor, and probably therapists and such as well will follow.)
As such, I wish you the best of luck with this project - earnestly so - because if, as I suggest, it is inevitable... we want such a system to be as good as possible.
An aside: another inevitable use case just came to mind - that of the cheap, shoddily implemented and poorly tested (along with the insecure, surveillance-adjacent products that will proliferate) kid's toys with embedded AI and the sardonically-humorous privacy mishaps and unintended actions from such low-quality implementation toys being sold (see: the current LLM-enabled kids toys currently popping up routinely at retailers.) ha! Sorry I keep taking your cool demo to dystopian extremes. :)
Oh, one more thing... Upon re-reading my previous comment, I recognize that the description of my visceral reaction as on of being being "repulsed by the thought" could literally be read as me calling your system "repulsive", which was not my intent. I think your tech is cool, and was just trying to convey two conflicting feelings that occurred within me when thinking about the future commercial use cases. I hope your systems works great so that if it does find market fit with such use cases, that, well... if it's inevitable - as the last few years of "LLMs everywhere!" has forced us all to adapt (accept or reject it, it still requires new effort) - we should hope for a good and working system, so I hope you succeed in making one.
Lastly, to your self-driving/potholes analogy... I do think that that fits more in line with my "objective CV classification" category; I think a closer fit to what you're building would be "self-driving car having to handle the Trolley Car Problem", with the nuances of human value judgements etc; does the car swerve into two adults vs one child? And so on. Pothole classification is more objective while driving into it, swerving to avoid it, classifying pedestrians and choosing one to possibly collide with, etc are subjective and more complicated (as is your system and the functions it can perform.)
"It's a test - designed to provoke an emotional response. "
I was going to follow this with something like "except the role of analyzing the emotional response is reversed", and then I wanted to expound with an "ooh but.. wait, there's another metaphor here since ..." but thought I've already potentially approached "spoiler alert" territory so I'll just stop there. Those who know the reference I am replying to will know; those who don't, well, don't google any of this or its parent cuz spoiler alert
It might be worth mentioning the concept of "stub resolver" and clarifying a bit that a nameserver is a resolver. That might be being pedantic, but thought it might be worth clarifying that the difference conceptually may just be what the particular dns server answering the query is authoritative for, if anything.
One other thing that might be worth a mention is the concept of the OS' resolver and "suffix search order", with an example of connecting (https, ping, ssh, whatever protocol) to a host using just the hostname, and the aforementioned mechanism that (probably) allows this to connect to the FQDN you want. (Also, now that I type that, do you mention "FQDN" at all? If not, maybe should.)
On that note one final thought that occurs to me is the error/confound that may occur if a hostname is entered and is not resolved, but does resolve with one of the domain suffixes attached on a retry (particularly can be confusing with a typo coupled with a wildcard A record in a domain, for example.) I recognize that the lines that look like DNS records are not explicitly stated to be in a format for any particular dns server software, and even if they were, they're snippets without larger context so we don't know what the $ORIGIN for the zone might be, an adjacent concept you might want to explore, even if just for your own edification is that of the effect of a terminating "." at the end of a hostname, either at resolution or configuration time.
Just offering feedback that might help you add to the article.
I don't care if this is an advertisement for buildkite masquerading as a blog post or if this is just an honest rant. Either way, I gotta say it speaks a lot of truth.
Also interesting that while the big vendors are following this trend and are now trying to take a lead in it, they still suggest things like "but use a JSON schema" (the linked article does a bit of the same - acknowledging that incremental learning via `--help` is useful AND can be token-conserving (exception being that if they already "know" the correct pattern, they wouldn't need to use tokens to learn it, so there is a potential trade-off), they are also suggesting that LLMs would prefer to receive argument knowledge in json rather than in plain language, even though the entire point of an LLM is for understand and create plain language. Seemed dubious to me, and a part of me wondered if that advice may be nonsense motivated by desire to sell more token use. I'm only partially kidding and I'm still dubious of the efficacy.
* Here's a TL;DR for anyone who wants to skip the rest of this long message: I ran an LLM CLI eval in the form of a constructed CTF. Results and methodology are in the two links in the section linked: https://github.com/scottvr/jelp?tab=readme-ov-file#what-else
Anyhow... I had been experimenting with the idea of having --help output json when used by a machine, and came up with a simple module that exposes `--help` content as json, simply by adding a `--jelp` argument to any tool that already uses argparse.
In the process, I started testing, to see if all this extra machine-readable content actually improved performance, what it did to token use, etc. While I was building out test, trying to settle on legitimate and fair ways to come to valid conclusions, I learned of the OpenCLI schema draft, so I altered my `jelp` output to fit that schema, and set about documenting the things I found lacking from the schema draft, meanwhile settling to include these arg-related items as metadata in the output.
I'll get to the point. I just finished cleaning the output up enough to put it in a public repo, because my intent is to share my findings with the OpemCLI folks, in hopes that they'll consider the gaps in their schema compared to what's commonly in use, but at the same time, what came as a secondary thought in service of this little tool I called "jelp", is a benchmarking harness (and the first publishable results from it), the to me, are quite interesting and I would be happy if others found it to be and added to the existing test results with additional runs, models, or ideas for the harness, or criticism about the validity of the method, etc.
The evaluation harness uses constructed CLI fixtures arranged as little CLI CTF's, where the LLMs demonstrate their ability to use an unknown CLI be capturing a "flag" that they'll need to discover by using the usage help, and a trail of learned arguments.
My findings at first confirmed my intuitions, which was disappointing but unsurprising. When testing with GPT-4.1-mini, no manner of forcing them to receive info about the CLI via json was more effective than just letting them use the human-friendly plain English output of --help, and in all cases the JSON versions burned more tokens. I was able to elicit better performance by some measurements from 5.1-mini, but again the tradeoff was higher token burn.
I'll link straight to the part of the README that shows one table of results, and contains links to the LLM CLI CTF part of the repo, as well as the generated report after the phase-1 runs; all the code to reproduce or run your own variation is there (as well as the code for the jelp module, if there is any interest, but it's the CLI CTF eval that I expect is more interesting to most.)
https://github.com/scottvr/jelp?tab=readme-ov-file#what-else
reply