Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked for an AI startup that got bought by a big tech company and I've seen the hype up close. In the inner tech circles it's not exactly a big lie. The tech is good enough to make incredible demos but not good enough to generalize into reliable tools. The gulf between demo and useful tool is much wider than we thought.


I work at Microsoft, though not in AI. This describes Copilot to a T. The demos are spectacular and get you so excited to go use it, but the reality is so underwhelming.


Copilot isn't underwhelming, it's shit. What's impressive is how Microsoft managed to gut GPT-4 to the point of near-uselessness. It refuses to do work even more than OpenAI models refuse to advise on criminal behavior. In my experience, the only thing it does well is scan documents on corporate SharePoint. For anything else, it's better to copy-paste to a proper GPT-4 yourself.

(Ask Office Copilot in PowerPoint to create you a slide. I dare you! I double dare you!!)

The problem with demos is that they're staged, they showcase integrations that are never delivered, and probably never existed. But you know what's not hype and fluff? The models themselves. You could hack a more useful Copilot with AutoHotkey, today.

I have GPT-4o hooked up as a voice assistant via Home Assistant, and what a breeze that is. Sure, every interaction costs me some $0.03 due to inefficient use of context (HA generates too much noise by default in its map of available devices and their state), but I can walk around the house and turn devices on and off by casually chatting with my watch, and it work, works well, and works faster than it takes to turn on Google Assistant.

So no, I honestly don't think AI advances are oversold. It's just that companies large and small race to deploy "AI-enabled" features, no matter how badly made they are.


Basically, functional AI interactions are prohibitively resource intensive and expensive. Microsoft's non-coding Copilots are shit due to resource constraints.


Basically, yes. My last 4 days of playing with this voice assistant cost me some $3.60 for 215 requests to GPT-4o, amounting to a little under 700 000 tokens. It's something I can afford[0], but with costs like this, you can't exactly give GPT-4 access out to people for free. This cost structure doesn't work. It doesn't with GPT-4o, so it more than twice as much didn't with earlier model iterations. And yet, that is what you need if you want a general-purpose Copilot or Assistant-like system. GPT-3.5-Turbo ain't gonna cut it. Llamas ain't gonna cut it either[1].

In a large sense, Microsoft lied. But they didn't lie about capability of the technology itself - they just lied about being able to afford to deliver it for free.

--

[0] - Extrapolated to a hypothetical subscription, this would be ~$27 per month. I've seen more expensive and worse subscriptions. Still, it's a big motivator to go dig into the code of that integration and make it use ~2-4x fewer tokens by encoding "exposed entities" differently, and much more concisely.

[1] - Maybe Llama 3 could, but IIRC license prevents it, plus it's how many days old now?


> they just lied about being able to afford to deliver it for free.

But they never said it'll be free - I'm pretty sure it was always advertised as a paid add-on subscription. With that being the case, why would they not just offer multiple tiers to Copilot, using different models or credit limits?


Contrary to what the corporations want you to believe -- no, you can't buy your way out of every problem. Most of the modern AI tools are mostly oversold and underwhelming, sadly.


whoa that's very cool. can you share some info about how you set up the integration in ha? would love to explore doing something like this for myself


With the most recent update, it's actually very simple. You need three things:

1) Add OpenAI Conversation integration - https://www.home-assistant.io/integrations/openai_conversati... - and configure it with your OpenAI API key. In there, you can control part of the system prompt (HA will add some stuff around it) and configure model to use. With the newest HA, there's now an option to enable "Assist" mode (under "Control Home Assistant" header). Enable this.

2) Go to "Settings/Voice assistants". Under "Assist", you can add a new assistant. You'll be asked to pick a name, language to use, then choose a conversation model - here you pick the one you configured in step 1) - and Speech-to-Text and Text-to-Speech models. I have a subscription to Home Assistant Cloud, so I can choose "Home Assistant Cloud" models for STT and TTS; it would be great to integrate third party ones here, but I'm not sure if and how.

3) Still in "Settings/Voice assistants", look for a line saying "${some number} entities exposed", under "Add assistant" button. Click that, and curate the list of devices and sensors you want "exposed" to the assistant - "exposed" here means that HA will make a large YAML dump out of selected entities and paste that into the conversation for you[0]. There's also other stuff (I heard docs mentioning "intents") that you can expose, but I haven't look into it yet[1].

That's it. You can press the Assist button and start typing. Or, for much better experience, install HA's mobile app (and if you have a smartwatch, the watch companion app), and configure Home Assistant as your voice assistant on the device(s). That's how you get the full experience of randomly talking to your watch, "oh hey, make the home feel more like a Borg cube", and witnessing lights turning green and climate control pumping heat.

I really recommend everyone who can to try that. It's a night-and-day difference compared to Siri, Alexa or Google Now. It finally fulfills those promises of voice-activated interfaces.

(I'm seriously considering making a Home Assistant to Tasker bridge via HA app notification, just to enable the assistant to do things on my phone - experience is just that good, that I bet it'll, out of the box, work better than Google stuff.)

--

[0] - That's the inefficient token waster I mentioned in the previous comment. I have some 60 entities exposed, and best I can tell, it generates a couple thousand token's worth of YAML, most of which is noise like entity IDs and YAML structure. This could be cut down significantly if you named your devices and entities cleverly (and concisely), but I think my best bet is to dig into the code and trim it down. And/or create a synthetic entities that stand for multiple entities representing a single device or device group, like e.g. one "A/C" entity that combines multiple sensor entities from all A/C units.

[1] - Outside the YAML dump that goes with each message (and a preamble with current date/time), which is how the Assistant know current state of every exposed entity, there's also an extra schema exposing controls via "function calling" mechanism of OpenAI API, which is how the assistant is able to control devices at home. I assume those "intents" go there. I'll be looking into it today, because there's a bunch of interactions I could simplify if I could expose automation scripts to the assistant.


lol I can’t help but assume that people who think copilot is shit have no idea what they are doing.


I have it enabled company-wide at enterprise level, so I know what it can and can't do in day-to-day practice.

Here's an example: I mentioned PowerPoint in my earlier comment. You know what's the correct way to use AI to make you PowerPoint slides? A way that works? It's to not use the O365 Copilot inside PowerPoint, but rather, ask GPT-4o in ChatGPT app to use Python and pandoc to make you a PowerPoint.

I literally demoed that to a colleague the other day. The difference is like night and day.


I've gone back to using GitHub Copilot with reveal.js [0]. It's much nicer to work with, and I'd recommended it unless you specifically need something from PowerPoint's advanced features.

[0] https://revealjs.com/


GitHub (which is owned by Microsoft) Copilot or Microsoft Copilot?


It's a lot like AR before Vision Pro. The situation for the demo and reality didn't meet. I'm not trying to claim Vision Pro is perfect but it seems to do AR in the real world without the circumstances needing to be absolutely ideal.


The Vision Pro is not doing well. Apple has cancelled the next version.[1] As Carmack says, AR/VR will be a small niche until the headgear gets down to swim goggle size, and will not go mainstream until it gets down to eyeglass size.

[1] https://www.msn.com/en-us/lifestyle/shopping/apple-shelves-n...


It was always the plan for Apple to release a cheaper version of the Vision Pro next. That the next version of the PRO has been postponed isn't a huge sign. It just seems that the technology isn't evolving quickly enough to warrant a new version any time soon.


> swim google size

The "Bigscreen Beyond" [0] is quite close, but doesn't have cameras - so at this stage it's only really good for watching movies and the like.

[0] https://store.bigscreenvr.com/products/bigscreen-beyond


That one does have 6DoF tracking, it's just based on the Valve Lighthouse system. Upside of that system is it's more privacy respecting.


Which it probably won't, because real life physics are not aware about roadmaps and corporate ads.


What physics are you talking about? Limits on power? Display? Sensor size? I ask because I’ve had similar feelings about things like high speed mobile Internet or mobile device screen size (over a couple of decades) and lived to see all my intuition blown away, so I really don’t believe in limits that don’t have explicit physical constraints behind them.


Lens diffraction limits. VR needs lenses that are small and thin enough while still being powerful enough to bend the needed light towards the eyes. Modern lenses need more distance between the screen and the eyes and they’re quite thick.

Theoretically future lenses may make it possible, but the visible light metamaterials needed are still very early research stage.


Apple approved ALVR few days ago too, clearly they're having issues at least wrt getting developer attention.

1: https://apps.apple.com/us/app/alvr/id6479728026


Your article states this differently. The development has not been canceled fully but re focused.

“and now hopes to release a more standard headset with fewer abilities by the end of next year.


That's marketing-speak for "cancelled".


I think both hardware and software in AR have to become unobtrusive for people to adopt it. And then it will be a specialized tool for stuff like maintenance. Keeping large amounts of information in context without requiring frequent changes in context. But I also think that the information overload will put a premium on non-AR time. Once it becomes a common work tool, people using it will be very keen to touch grass and watch clouds afterwards.

I don't think it will ever become the mainstream everyday carry proponents want it to be. But only time will tell...


Until there is an interface for it that allows you to effectively touch type (or equivalent) then 99% of jobs won't be able to use it away from a desk anyway. Speech to text would be good enough for writing (non technical) documentation but probably not for things like filling spreadsheets or programming.


But does what Apple has shown in its demos of the Vision Pro actually meet reality? Does it provide any value at all?

In my eyes, it's exactly the same as AI. The demos work. You can play around with it, and its impressive for an hour. But there's just very little value.


The value would come if it was something you would feel comfortable wearing all day. So it would need perfect pass through, be much much lighter and more comfortable. If they achieved that and can do multiple high resolution virtual displays then people would use it.

The R&D required to get to that point is vast though.


> can do multiple high resolution virtual displays

In most applications, it then would need to compete on price with multiple high resolution displays, and undercut them quite significantly to break the inertia of the old tech (and other various advantages - like not wearing something all day and being able to allow other people to look at what you have on your screen).


I take your point but living in a London flat I don't have the room for multiple high resolution displays. Nor are they very portable, I have a MBP rather than an iMac because mobility is important.

I do think we're 4+ years until it gets to the 'iPhone 1' level of utility though, so we'll see how committed Apple are to it.


That's what all these companies are peddling though. The question is - do humans actually NEED a display before their eyes for all awake time? Or even most of it? Maybe, but today I have some doubts.


Given how we as a society are now having significant second thoughts as to the net utility for everybody having a display in their pocket for all awake time, I also have some doubts.


it's very sad because it's sort of so near but so far kind of situation

It would be valuable if it could do multimonitor, but it can't. It would be valuable if it could run real apps but it only runs iPad apps. It would be valuable if Apple opened up the ecosystem, and let it easily and openly run existing VR apps, including controllers - but they won't.

In fact the hardware itself crosses the threshold to where the value could be had, which is something that couldn't be said before. But Apple deliberately crimped it based on their ideology, so we are still waiting. There is light at the end of the tunnel though.


> But Apple deliberately crimped it based on their ideology

It's in a strange place, because Apple definitely also crimped it by not even writing enough software for it inhouse.

Why can't it run Mac apps? Why can't you share your "screen configuration" and its contents with other people wearing a Vision Pro in the same room as you?


It is not really AR. Reality is not just augmented but captured first with camera. It can make someone dizzy.


It's the opposite of AR, it's VR augmented with real imagery.


I never considered this angle. (Yeah, I am a sucker -- I know.) Are you saying that they cherry pick the best samples for the demo? Damn. I _still_ have high hopes for something like Copilot. I work on CRUD apps. There are so many cases where I want Copilot to provide some sample code to do X.


Sorry I didn’t mean GitHub Copilot. Code generation is definitely one of the better use cases for AI. I meant the “Copilot” brand that Microsoft has trotted out into pretty much everyone of its products and rolled together in this generic “Copilot” app on windows.


They absolutely do. Check out this video https://youtu.be/tNmgmwEtoWE


I just used Groq / llama-7b to classify 20,000 rows of Google sheets data (Sidebar archive links) that would have taken me way longer... Every one I've spot checked right now has been correct, and I might write another checker to scan the results just in case.

Even w/ a 20% failure it's better than not having the classifications


The problem isn't that it's not useful for self driven tasks like that, it's that you can't really integrate that into a product that does task X because when someone buys a system to do task X, they want it to be more reliable than 80%.


Stick a slick UI that lets the end user quickly fix up just the bits it got wrong and flip through documents quickly and 80% correct can still be a massive timesaver.


I think that can kind of work for B2C things, but is much less likely to do so for B2B. Just as an example, I work on industrial maintenance software, and customers expect us to catch problems with their machinery 100% of the time, and in time to prevent it. Sometimes faults start and progress to failure faster than they're able to send data to us, but they still are upset that we didn't catch it.

It doesn't matter whether that's reasonable or not, there are a lot of people who expect software systems to be totally reliable at what they do, and don't want to accept less.


We're thinking about adding AI to the product and that's the path I'd like to take. View AI as an intern who can mistakes, and provide a UI where the user can review what the AI is planning to do.


Except that people hate monitoring an intern all day, regardless of whether it is a human or a machine.


I think this is going to be a heavy lift, and one of the reasons I think a chat bot is not the right UX. Every time someone says “all you need to get to do to get ChatGPT working is provide it explicitly requirements and iterate”, and for a lot of coding tasks it’s much easier to just hack on code for a while, then be a manager to a 80% right intern.


I classified ~1000 GBA game roms files by using their file names to put each in a category folder. It worked like 90% correctly. Used GPT 3.5 and therefore it didn't adhere to my provided list of categories but they were mostly not wrong otherwise.

https://gist.github.com/SMUsamaShah/20f24e80cfe962d26af5315e...


Sorry this actually sounds like a real use case. What was the classification? (I tried google “sidebar archive”). I assume somehow you visited 20,000 web pages and it classified the text on the page? How was that achieved ? You ran a local llama?


We had ChatGPT look at 200.000 products, and make a navigation structure in 3 tiers based on the attributes of each product. The validation took 2% of the time it would have taken to manually create the hierarchy ourselves.

I think that even the simple LLM's are very well suited for classification-tasks, where very little prompting is needed.


Sorry to harp on.

So you had a list of products (what sort - I am thinking like widgets from a wholesaler and you want to have a three tier menu for an e-commerce site?)

I am guessing each product has a description - like from Amazon, and chatgpt read the description and said “aha this is a Television/LCD/50inch or Underwear/flimsy/bra

I assume you sent in 200,000 different queries - but how did you get it to return three tiers? (Maybe I need to read one of those “become a ChatGPt expert” blogs


I'm not this person; but, I've been working on LLMs pretty aggressively for the last 6ish months and I have some ideas of how this __could__ be done.

You could plainly ask the LLM something like this as the query goes on:

"Please provide 3 categories that this product could exist under, with increasing specificity in the following format:

  {
     "broad category": "a broad category that would encompass this product, as well as others, for example 'televisions' for a 50" OLED LG with Roku integration",
     "category": "a narrower category that describes this product more aggressively, for example 'Smart Televisions'",
     "narrow category": "an even narrower category that describes this product and its direct competitors, for example OLED Smart televisions"
  }
A next question you'll have pretty quick is, "Well, what if sometimes it returns 'Smart televisions' and other times it returns 'Smart TVs', won't that result in multiple of the same category?" And that's a good and valid question, so you then have another query that takes the categories that have been provided to you and asks for synonyms, alternative spellings, etc, such as:

"Given a product categorization of a specific level of specificity, please provide a list of words and phrases that mean the same thing".

In OpenAI's backend - and many of them, I think, you can have the api run the query multiple times and get back multiple answers. enumerate over those answers, build the graph, and you can have all that data in an easy to read and follow format!

It might not be perfect, but it should be pretty good!


> Well, what if sometimes it returns 'Smart televisions' and other times it returns 'Smart TVs', won't that result in multiple of the same category

Text similarity works well in this case. You can just use cosine similarity and merge ones that are very close or ask GPT to compare for those on the edge


It sounds like a real use case, but possibly quite overkill to use an LLM.

Unless you need to have some "reasoning" to classify the documents correctly, a much more lightweight BERT-like model (RoBERTa or DistilBERT) will perform on par in accuracy while being a lot faster.


"while being a lot faster", yes; but something that LLMs do that those other tools don't is being hilariously approachable.

LLMs can operate as a very, very *very* approachable natural language processing model without needing to know all the gritty details of NLP.


> Every one I've spot checked right now has been correct, and I might write another checker to scan the results just in case.

If you already have the answers to verify the LLM output against why not just use those to begin with?


Not GP, but I would imagine "another checker to scan the results" would be another NN classifier.

Thinking being that you'd compare outputs of the two, and under assumption of the results being statistically independent from each other and of similar quality, say 1% difference between the two in said comparison, would suggest ~ 0.5% error rate from "ground truth".


Maybe their problem is using LLM to solve f:X→Y, where the validation, V:{X,Y}→{0,1}, is trivial to compute?


Looks like this is another application of the ninety-ninety rule: getting to the stage where you can make incredible demos has required 90% of the effort, and the last 10% to make it actually reliable will require the other 90% (https://en.wikipedia.org/wiki/Ninety%E2%80%93ninety_rule).


80/20, 90/10 etc. are all just https://en.wikipedia.org/wiki/Zipf%27s_law


Excellent find, I’d never heard of Zipf’s law.

GP was talking about something else though, the 90:90 rule is related to an extremely common planning optimism fallacy around work required to demo, and work required to productise.


It's not just demos though. It's that the final 10% of any project, which largely consists of polishing, implementing feedback, ironing out bugs or edge cases, and finalization and getting to a point where it's "done" can end up taking as much effort as what you needed to complete the first 90% of the project.


Can you elaborate? I am curious. In my line of work, the 80/20 rule is often throw about, that being "to do 80% of the work, you only need 20% of the knowledge." I thought the other reply was talking about the same diminutive axiom, but now I am not sure.


The sibling post gives a good account of the 90:90 challenge.

The last part of any semi-difficult project nearly always takes much longer than the officially difficult “main problem” to solve.

It leads to the last 10% of the deliverables costing at least 90% of the total effort for the project (not the planned amount; the total ad calculated after completion, if that ever occurs)

This seems to endlessly surprise people in tech, but also many other semi-professional project domains (home renovations are a classic)


Isn't it a bit similar situation with the self driving cars?


I'm going to copy my answer from zellyn in a thread some time ago:

  "It’s been obvious even to casual observers like myself for years that Waymo/Google was one of the only groups taking the problem seriously and trying to actually solve it, as opposed to pretending you could add self-driving with just cameras in an over-the-air update (Tesla), or trying to move fast and break things (Uber), or pretending you could gradually improve lane-keeping all the way into autonomous driving (car manufacturers). That’s why it’s working for them. (IIUC, Cruise has pretty much also always been legit?)"
https://news.ycombinator.com/item?id=40516532


> The tech is good enough to make incredible demos but not good enough to generalize into reliable tools. The gulf between demo and useful tool is much wider than we thought.

One thing it is good at is scaring people into paying to feed it all the data they have for a promise of an unquantifiable improvement.


> The gulf between demo and useful tool is much wider than we thought.

This is _always_ the problem with these things. Voice transcription was a great tech demo in the 1990s (remember DragonDictate?), and there was much hype for a couple of years that, by the early noughties, speech would be the main way that people use computes. In the real world, 30 years on, it has finally reached the point where you might be able to use it for things provided that accuracy doesn't matter at all.


Assuming it works perfectly, speech still couldn't possibly be the main way to use a computer:

- hearing people next to you speaking to the computer would be tiring and annoying. Though remote work might be a partial solution to this

- hello voice extinction after days of using a computer :-)


Same here, but I'm hoping it takes off for other people.

I get requests all the time from colleagues to have discussions via telephone instead of chat because they are bad at typing.


Oh, yeah, I mean, it would've been awful had it actually happened, even if it worked properly. But try telling that to Microsoft's marketing department circa 1998.

(MS had a bit of a fetish for alternate interfaces at the time; before voice they spent a few years desperately trying to make Windows for Pen Computing a thing).


So a large cluster of nvidia cards cannot predict the future, generate correct http links, rotate around objects with only a picture at source with the right lighting or program a million lines of code from 3 lines of prompt ?

Color me surprised. Maybe we should ask Mira Murati to step aside from her inspiring essays about the future of poetry and help us figure out why the world spent trillions on nvidia equity and how to unwind this pending disaster...


it also can't reliably add two numbers to each other.

> help us figure out why the world spent trillions on nvidia equity and how to unwind this pending disaster..

There are many documented examples of the market being irrational.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: