I worked for an AI startup that got bought by a big tech company and I've seen t...

beoberha · on June 20, 2024

I work at Microsoft, though not in AI. This describes Copilot to a T. The demos are spectacular and get you so excited to go use it, but the reality is so underwhelming.

TeMPOraL · on June 20, 2024

Copilot isn't underwhelming, it's shit. What's impressive is how Microsoft managed to gut GPT-4 to the point of near-uselessness. It refuses to do work even more than OpenAI models refuse to advise on criminal behavior. In my experience, the only thing it does well is scan documents on corporate SharePoint. For anything else, it's better to copy-paste to a proper GPT-4 yourself.

(Ask Office Copilot in PowerPoint to create you a slide. I dare you! I double dare you!!)

The problem with demos is that they're staged, they showcase integrations that are never delivered, and probably never existed. But you know what's not hype and fluff? The models themselves. You could hack a more useful Copilot with AutoHotkey, today.

I have GPT-4o hooked up as a voice assistant via Home Assistant, and what a breeze that is. Sure, every interaction costs me some $0.03 due to inefficient use of context (HA generates too much noise by default in its map of available devices and their state), but I can walk around the house and turn devices on and off by casually chatting with my watch, and it work, works well, and works faster than it takes to turn on Google Assistant.

So no, I honestly don't think AI advances are oversold. It's just that companies large and small race to deploy "AI-enabled" features, no matter how badly made they are.

NautilusWave · on June 20, 2024

Basically, functional AI interactions are prohibitively resource intensive and expensive. Microsoft's non-coding Copilots are shit due to resource constraints.

TeMPOraL · on June 20, 2024

Basically, yes. My last 4 days of playing with this voice assistant cost me some $3.60 for 215 requests to GPT-4o, amounting to a little under 700 000 tokens. It's something I can afford[0], but with costs like this, you can't exactly give GPT-4 access out to people for free. This cost structure doesn't work. It doesn't with GPT-4o, so it more than twice as much didn't with earlier model iterations. And yet, that is what you need if you want a general-purpose Copilot or Assistant-like system. GPT-3.5-Turbo ain't gonna cut it. Llamas ain't gonna cut it either[1].

In a large sense, Microsoft lied. But they didn't lie about capability of the technology itself - they just lied about being able to afford to deliver it for free.

--

[0] - Extrapolated to a hypothetical subscription, this would be ~$27 per month. I've seen more expensive and worse subscriptions. Still, it's a big motivator to go dig into the code of that integration and make it use ~2-4x fewer tokens by encoding "exposed entities" differently, and much more concisely.

[1] - Maybe Llama 3 could, but IIRC license prevents it, plus it's how many days old now?

falcor84 · on June 20, 2024

> they just lied about being able to afford to deliver it for free.

But they never said it'll be free - I'm pretty sure it was always advertised as a paid add-on subscription. With that being the case, why would they not just offer multiple tiers to Copilot, using different models or credit limits?

pdimitar · on June 20, 2024

Contrary to what the corporations want you to believe -- no, you can't buy your way out of every problem. Most of the modern AI tools are mostly oversold and underwhelming, sadly.

dsauerbrun · on June 20, 2024

whoa that's very cool. can you share some info about how you set up the integration in ha? would love to explore doing something like this for myself

TeMPOraL · on June 20, 2024

With the most recent update, it's actually very simple. You need three things:

1) Add OpenAI Conversation integration - https://www.home-assistant.io/integrations/openai_conversati... - and configure it with your OpenAI API key. In there, you can control part of the system prompt (HA will add some stuff around it) and configure model to use. With the newest HA, there's now an option to enable "Assist" mode (under "Control Home Assistant" header). Enable this.

2) Go to "Settings/Voice assistants". Under "Assist", you can add a new assistant. You'll be asked to pick a name, language to use, then choose a conversation model - here you pick the one you configured in step 1) - and Speech-to-Text and Text-to-Speech models. I have a subscription to Home Assistant Cloud, so I can choose "Home Assistant Cloud" models for STT and TTS; it would be great to integrate third party ones here, but I'm not sure if and how.

3) Still in "Settings/Voice assistants", look for a line saying "${some number} entities exposed", under "Add assistant" button. Click that, and curate the list of devices and sensors you want "exposed" to the assistant - "exposed" here means that HA will make a large YAML dump out of selected entities and paste that into the conversation for you[0]. There's also other stuff (I heard docs mentioning "intents") that you can expose, but I haven't look into it yet[1].

That's it. You can press the Assist button and start typing. Or, for much better experience, install HA's mobile app (and if you have a smartwatch, the watch companion app), and configure Home Assistant as your voice assistant on the device(s). That's how you get the full experience of randomly talking to your watch, "oh hey, make the home feel more like a Borg cube", and witnessing lights turning green and climate control pumping heat.

I really recommend everyone who can to try that. It's a night-and-day difference compared to Siri, Alexa or Google Now. It finally fulfills those promises of voice-activated interfaces.

(I'm seriously considering making a Home Assistant to Tasker bridge via HA app notification, just to enable the assistant to do things on my phone - experience is just that good, that I bet it'll, out of the box, work better than Google stuff.)

--

[0] - That's the inefficient token waster I mentioned in the previous comment. I have some 60 entities exposed, and best I can tell, it generates a couple thousand token's worth of YAML, most of which is noise like entity IDs and YAML structure. This could be cut down significantly if you named your devices and entities cleverly (and concisely), but I think my best bet is to dig into the code and trim it down. And/or create a synthetic entities that stand for multiple entities representing a single device or device group, like e.g. one "A/C" entity that combines multiple sensor entities from all A/C units.

[1] - Outside the YAML dump that goes with each message (and a preamble with current date/time), which is how the Assistant know current state of every exposed entity, there's also an extra schema exposing controls via "function calling" mechanism of OpenAI API, which is how the assistant is able to control devices at home. I assume those "intents" go there. I'll be looking into it today, because there's a bunch of interactions I could simplify if I could expose automation scripts to the assistant.

wordofx · on June 20, 2024

lol I can’t help but assume that people who think copilot is shit have no idea what they are doing.

TeMPOraL · on June 20, 2024

I have it enabled company-wide at enterprise level, so I know what it can and can't do in day-to-day practice.

Here's an example: I mentioned PowerPoint in my earlier comment. You know what's the correct way to use AI to make you PowerPoint slides? A way that works? It's to not use the O365 Copilot inside PowerPoint, but rather, ask GPT-4o in ChatGPT app to use Python and pandoc to make you a PowerPoint.

I literally demoed that to a colleague the other day. The difference is like night and day.

falcor84 · on June 20, 2024

I've gone back to using GitHub Copilot with reveal.js [0]. It's much nicer to work with, and I'd recommended it unless you specifically need something from PowerPoint's advanced features.

[0] https://revealjs.com/

fragmede · on June 20, 2024

GitHub (which is owned by Microsoft) Copilot or Microsoft Copilot?

moose_man · on June 20, 2024

It's a lot like AR before Vision Pro. The situation for the demo and reality didn't meet. I'm not trying to claim Vision Pro is perfect but it seems to do AR in the real world without the circumstances needing to be absolutely ideal.

Animats · on June 20, 2024

The Vision Pro is not doing well. Apple has cancelled the next version.[1] As Carmack says, AR/VR will be a small niche until the headgear gets down to swim goggle size, and will not go mainstream until it gets down to eyeglass size.

[1] https://www.msn.com/en-us/lifestyle/shopping/apple-shelves-n...

addandsubtract · on June 20, 2024

It was always the plan for Apple to release a cheaper version of the Vision Pro next. That the next version of the PRO has been postponed isn't a huge sign. It just seems that the technology isn't evolving quickly enough to warrant a new version any time soon.

falcor84 · on June 20, 2024

> swim google size

The "Bigscreen Beyond" [0] is quite close, but doesn't have cameras - so at this stage it's only really good for watching movies and the like.

[0] https://store.bigscreenvr.com/products/bigscreen-beyond

numpad0 · on June 20, 2024

That one does have 6DoF tracking, it's just based on the Valve Lighthouse system. Upside of that system is it's more privacy respecting.

Yizahi · on June 20, 2024

Which it probably won't, because real life physics are not aware about roadmaps and corporate ads.

matthewdgreen · on June 20, 2024

What physics are you talking about? Limits on power? Display? Sensor size? I ask because I’ve had similar feelings about things like high speed mobile Internet or mobile device screen size (over a couple of decades) and lived to see all my intuition blown away, so I really don’t believe in limits that don’t have explicit physical constraints behind them.

throwup238 · on June 20, 2024

Lens diffraction limits. VR needs lenses that are small and thin enough while still being powerful enough to bend the needed light towards the eyes. Modern lenses need more distance between the screen and the eyes and they’re quite thick.

Theoretically future lenses may make it possible, but the visible light metamaterials needed are still very early research stage.

numpad0 · on June 20, 2024

Apple approved ALVR few days ago too, clearly they're having issues at least wrt getting developer attention.

1: https://apps.apple.com/us/app/alvr/id6479728026

salad-tycoon · on June 20, 2024

Your article states this differently. The development has not been canceled fully but re focused.

“and now hopes to release a more standard headset with fewer abilities by the end of next year.

Animats · on June 21, 2024

That's marketing-speak for "cancelled".

namaria · on June 20, 2024

I think both hardware and software in AR have to become unobtrusive for people to adopt it. And then it will be a specialized tool for stuff like maintenance. Keeping large amounts of information in context without requiring frequent changes in context. But I also think that the information overload will put a premium on non-AR time. Once it becomes a common work tool, people using it will be very keen to touch grass and watch clouds afterwards.

I don't think it will ever become the mainstream everyday carry proponents want it to be. But only time will tell...

Turskarama · on June 20, 2024

Until there is an interface for it that allows you to effectively touch type (or equivalent) then 99% of jobs won't be able to use it away from a desk anyway. Speech to text would be good enough for writing (non technical) documentation but probably not for things like filling spreadsheets or programming.

pbmonster · on June 20, 2024

But does what Apple has shown in its demos of the Vision Pro actually meet reality? Does it provide any value at all?

In my eyes, it's exactly the same as AI. The demos work. You can play around with it, and its impressive for an hour. But there's just very little value.

MrScruff · on June 20, 2024

The value would come if it was something you would feel comfortable wearing all day. So it would need perfect pass through, be much much lighter and more comfortable. If they achieved that and can do multiple high resolution virtual displays then people would use it.

The R&D required to get to that point is vast though.

pbmonster · on June 20, 2024

> can do multiple high resolution virtual displays

In most applications, it then would need to compete on price with multiple high resolution displays, and undercut them quite significantly to break the inertia of the old tech (and other various advantages - like not wearing something all day and being able to allow other people to look at what you have on your screen).

MrScruff · on June 20, 2024

I take your point but living in a London flat I don't have the room for multiple high resolution displays. Nor are they very portable, I have a MBP rather than an iMac because mobility is important.

I do think we're 4+ years until it gets to the 'iPhone 1' level of utility though, so we'll see how committed Apple are to it.

Yizahi · on June 20, 2024

That's what all these companies are peddling though. The question is - do humans actually NEED a display before their eyes for all awake time? Or even most of it? Maybe, but today I have some doubts.

Sharlin · on June 21, 2024

Given how we as a society are now having significant second thoughts as to the net utility for everybody having a display in their pocket for all awake time, I also have some doubts.

zmmmmm · on June 20, 2024

it's very sad because it's sort of so near but so far kind of situation

It would be valuable if it could do multimonitor, but it can't. It would be valuable if it could run real apps but it only runs iPad apps. It would be valuable if Apple opened up the ecosystem, and let it easily and openly run existing VR apps, including controllers - but they won't.

In fact the hardware itself crosses the threshold to where the value could be had, which is something that couldn't be said before. But Apple deliberately crimped it based on their ideology, so we are still waiting. There is light at the end of the tunnel though.

pbmonster · on June 20, 2024

> But Apple deliberately crimped it based on their ideology

It's in a strange place, because Apple definitely also crimped it by not even writing enough software for it inhouse.

Why can't it run Mac apps? Why can't you share your "screen configuration" and its contents with other people wearing a Vision Pro in the same room as you?

timeon · on June 20, 2024

It is not really AR. Reality is not just augmented but captured first with camera. It can make someone dizzy.

yencabulator · on June 20, 2024

It's the opposite of AR, it's VR augmented with real imagery.

throwaway2037 · on June 20, 2024

I never considered this angle. (Yeah, I am a sucker -- I know.) Are you saying that they cherry pick the best samples for the demo? Damn. I _still_ have high hopes for something like Copilot. I work on CRUD apps. There are so many cases where I want Copilot to provide some sample code to do X.

beoberha · on June 20, 2024

Sorry I didn’t mean GitHub Copilot. Code generation is definitely one of the better use cases for AI. I meant the “Copilot” brand that Microsoft has trotted out into pretty much everyone of its products and rolled together in this generic “Copilot” app on windows.

Slyfox33 · on June 20, 2024

They absolutely do. Check out this video https://youtu.be/tNmgmwEtoWE

yawnxyz · on June 20, 2024

I just used Groq / llama-7b to classify 20,000 rows of Google sheets data (Sidebar archive links) that would have taken me way longer... Every one I've spot checked right now has been correct, and I might write another checker to scan the results just in case.

Even w/ a 20% failure it's better than not having the classifications

physicsguy · on June 20, 2024

The problem isn't that it's not useful for self driven tasks like that, it's that you can't really integrate that into a product that does task X because when someone buys a system to do task X, they want it to be more reliable than 80%.

p1necone · on June 20, 2024

Stick a slick UI that lets the end user quickly fix up just the bits it got wrong and flip through documents quickly and 80% correct can still be a massive timesaver.

physicsguy · on June 20, 2024

I think that can kind of work for B2C things, but is much less likely to do so for B2B. Just as an example, I work on industrial maintenance software, and customers expect us to catch problems with their machinery 100% of the time, and in time to prevent it. Sometimes faults start and progress to failure faster than they're able to send data to us, but they still are upset that we didn't catch it.

It doesn't matter whether that's reasonable or not, there are a lot of people who expect software systems to be totally reliable at what they do, and don't want to accept less.

kgeist · on June 20, 2024

We're thinking about adding AI to the product and that's the path I'd like to take. View AI as an intern who can mistakes, and provide a UI where the user can review what the AI is planning to do.

whiplash451 · on June 20, 2024

Except that people hate monitoring an intern all day, regardless of whether it is a human or a machine.

techostritch · on June 20, 2024

I think this is going to be a heavy lift, and one of the reasons I think a chat bot is not the right UX. Every time someone says “all you need to get to do to get ChatGPT working is provide it explicitly requirements and iterate”, and for a lot of coding tasks it’s much easier to just hack on code for a while, then be a manager to a 80% right intern.

smusamashah · on June 20, 2024

I classified ~1000 GBA game roms files by using their file names to put each in a category folder. It worked like 90% correctly. Used GPT 3.5 and therefore it didn't adhere to my provided list of categories but they were mostly not wrong otherwise.

https://gist.github.com/SMUsamaShah/20f24e80cfe962d26af5315e...

lifeisstillgood · on June 20, 2024

Sorry this actually sounds like a real use case. What was the classification? (I tried google “sidebar archive”). I assume somehow you visited 20,000 web pages and it classified the text on the page? How was that achieved ? You ran a local llama?

karles · on June 20, 2024

We had ChatGPT look at 200.000 products, and make a navigation structure in 3 tiers based on the attributes of each product. The validation took 2% of the time it would have taken to manually create the hierarchy ourselves.

I think that even the simple LLM's are very well suited for classification-tasks, where very little prompting is needed.

lifeisstillgood · on June 20, 2024

Sorry to harp on.

So you had a list of products (what sort - I am thinking like widgets from a wholesaler and you want to have a three tier menu for an e-commerce site?)

I am guessing each product has a description - like from Amazon, and chatgpt read the description and said “aha this is a Television/LCD/50inch or Underwear/flimsy/bra

I assume you sent in 200,000 different queries - but how did you get it to return three tiers? (Maybe I need to read one of those “become a ChatGPt expert” blogs

t-writescode · on June 20, 2024

I'm not this person; but, I've been working on LLMs pretty aggressively for the last 6ish months and I have some ideas of how this __could__ be done.

You could plainly ask the LLM something like this as the query goes on:

"Please provide 3 categories that this product could exist under, with increasing specificity in the following format:

  {
     "broad category": "a broad category that would encompass this product, as well as others, for example 'televisions' for a 50" OLED LG with Roku integration",
     "category": "a narrower category that describes this product more aggressively, for example 'Smart Televisions'",
     "narrow category": "an even narrower category that describes this product and its direct competitors, for example OLED Smart televisions"
  }

A next question you'll have pretty quick is, "Well, what if sometimes it returns 'Smart televisions' and other times it returns 'Smart TVs', won't that result in multiple of the same category?" And that's a good and valid question, so you then have another query that takes the categories that have been provided to you and asks for synonyms, alternative spellings, etc, such as:

"Given a product categorization of a specific level of specificity, please provide a list of words and phrases that mean the same thing".

In OpenAI's backend - and many of them, I think, you can have the api run the query multiple times and get back multiple answers. enumerate over those answers, build the graph, and you can have all that data in an easy to read and follow format!

It might not be perfect, but it should be pretty good!

acheong08 · on June 21, 2024

> Well, what if sometimes it returns 'Smart televisions' and other times it returns 'Smart TVs', won't that result in multiple of the same category

Text similarity works well in this case. You can just use cosine similarity and merge ones that are very close or ask GPT to compare for those on the edge

hobofan · on June 20, 2024

It sounds like a real use case, but possibly quite overkill to use an LLM.

Unless you need to have some "reasoning" to classify the documents correctly, a much more lightweight BERT-like model (RoBERTa or DistilBERT) will perform on par in accuracy while being a lot faster.

t-writescode · on June 20, 2024

"while being a lot faster", yes; but something that LLMs do that those other tools don't is being hilariously approachable.

LLMs can operate as a very, very *very* approachable natural language processing model without needing to know all the gritty details of NLP.

runeks · on June 20, 2024

> Every one I've spot checked right now has been correct, and I might write another checker to scan the results just in case.

If you already have the answers to verify the LLM output against why not just use those to begin with?

kees99 · on June 20, 2024

Not GP, but I would imagine "another checker to scan the results" would be another NN classifier.

Thinking being that you'd compare outputs of the two, and under assumption of the results being statistically independent from each other and of similar quality, say 1% difference between the two in said comparison, would suggest ~ 0.5% error rate from "ground truth".

TeMPOraL · on June 20, 2024

Maybe their problem is using LLM to solve f:X→Y, where the validation, V:{X,Y}→{0,1}, is trivial to compute?

rob74 · on June 20, 2024

Looks like this is another application of the ninety-ninety rule: getting to the stage where you can make incredible demos has required 90% of the effort, and the last 10% to make it actually reliable will require the other 90% (https://en.wikipedia.org/wiki/Ninety%E2%80%93ninety_rule).

veqq · on June 20, 2024

80/20, 90/10 etc. are all just https://en.wikipedia.org/wiki/Zipf%27s_law

com · on June 20, 2024

Excellent find, I’d never heard of Zipf’s law.

GP was talking about something else though, the 90:90 rule is related to an extremely common planning optimism fallacy around work required to demo, and work required to productise.

Sakos · on June 20, 2024

It's not just demos though. It's that the final 10% of any project, which largely consists of polishing, implementing feedback, ironing out bugs or edge cases, and finalization and getting to a point where it's "done" can end up taking as much effort as what you needed to complete the first 90% of the project.

0xEF · on June 20, 2024

Can you elaborate? I am curious. In my line of work, the 80/20 rule is often throw about, that being "to do 80% of the work, you only need 20% of the knowledge." I thought the other reply was talking about the same diminutive axiom, but now I am not sure.

com · on June 20, 2024

The sibling post gives a good account of the 90:90 challenge.

The last part of any semi-difficult project nearly always takes much longer than the officially difficult “main problem” to solve.

It leads to the last 10% of the deliverables costing at least 90% of the total effort for the project (not the planned amount; the total ad calculated after completion, if that ever occurs)

This seems to endlessly surprise people in tech, but also many other semi-professional project domains (home renovations are a classic)

viking123 · on June 20, 2024

Isn't it a bit similar situation with the self driving cars?

huppeldepup · on June 20, 2024

I'm going to copy my answer from zellyn in a thread some time ago:

  "It’s been obvious even to casual observers like myself for years that Waymo/Google was one of the only groups taking the problem seriously and trying to actually solve it, as opposed to pretending you could add self-driving with just cameras in an over-the-air update (Tesla), or trying to move fast and break things (Uber), or pretending you could gradually improve lane-keeping all the way into autonomous driving (car manufacturers). That’s why it’s working for them. (IIUC, Cruise has pretty much also always been legit?)"

https://news.ycombinator.com/item?id=40516532

surfingdino · on June 20, 2024

> The tech is good enough to make incredible demos but not good enough to generalize into reliable tools. The gulf between demo and useful tool is much wider than we thought.

One thing it is good at is scaring people into paying to feed it all the data they have for a promise of an unquantifiable improvement.

rsynnott · on June 20, 2024

> The gulf between demo and useful tool is much wider than we thought.

This is _always_ the problem with these things. Voice transcription was a great tech demo in the 1990s (remember DragonDictate?), and there was much hype for a couple of years that, by the early noughties, speech would be the main way that people use computes. In the real world, 30 years on, it has finally reached the point where you might be able to use it for things provided that accuracy doesn't matter at all.

jraph · on June 20, 2024

Assuming it works perfectly, speech still couldn't possibly be the main way to use a computer:

- hearing people next to you speaking to the computer would be tiring and annoying. Though remote work might be a partial solution to this

- hello voice extinction after days of using a computer :-)

acuozzo · on June 20, 2024

Same here, but I'm hoping it takes off for other people.

I get requests all the time from colleagues to have discussions via telephone instead of chat because they are bad at typing.

rsynnott · on June 20, 2024

Oh, yeah, I mean, it would've been awful had it actually happened, even if it worked properly. But try telling that to Microsoft's marketing department circa 1998.

(MS had a bit of a fetish for alternate interfaces at the time; before voice they spent a few years desperately trying to make Windows for Pen Computing a thing).

xwolfi · on June 20, 2024

So a large cluster of nvidia cards cannot predict the future, generate correct http links, rotate around objects with only a picture at source with the right lighting or program a million lines of code from 3 lines of prompt ?

Color me surprised. Maybe we should ask Mira Murati to step aside from her inspiring essays about the future of poetry and help us figure out why the world spent trillions on nvidia equity and how to unwind this pending disaster...

waciki · on June 20, 2024

it also can't reliably add two numbers to each other.

> help us figure out why the world spent trillions on nvidia equity and how to unwind this pending disaster..

There are many documented examples of the market being irrational.