Funny, I just picked up a device for use with https://heywillow.io for similar r...

knodi123 · on May 13, 2024

me too, but I bricked mine when flashing the bios. just a fluke, nothing to be done about it.

nkaz123 · on May 13, 2024

I watched the demo, to be honest if I saw it sooner I probably would have tried to start this as a fork from there. Any idea what the issue was?

knodi123 · on May 14, 2024

No, but the core concept had changed so much, between the firmware version when I bought it, and what it is now, that I'm not surprised if the upgrade is a buggy process.

It's a shame, because I only wanted a tenth of what it could do- I just wanted it to send text to a REST server. That's all! And I had it working! But I saw there was a major firmware update, and to make a long story short, KABOOM.

kkielhofner · on May 14, 2024

I'm the founder of Willow.

Early versions sent requests directly from devices and we found that to be problematic/inflexible for a variety of reasons.

Our new architecture uses the Willow Application Server (WAS) with a standard generic protocol to devices and handles the subtleties of the various command endpoints (Home Assistant, REST, OpenHAB, etc). This can be activated in WAS with the "WAS Command Endpoint" configuration option.

This approach also enables a feature we call Willow Auto Correct (WAC) that our users have found to be a game-changer in the open source voice assistant space.

I don't want to seem like I'm selling or shilling Willow, I just have a lot of experience with this application and many hard-learned lessons over the past two decades.

kkielhofner · on May 14, 2024

I'm the founder of Willow.

Generally speaking the approach of "run this on your Pi with distro X" has a surprising number of issues that make it, umm, "rough" for this application. As anyone who has tried to get reliable low-latency audio with Linux can attest the hoops to get a standard distro (audio drivers, ALSA, pulseaudio/pipewire, etc) to work well are many. This is why all commercial solutions using similar architectures/devices build the "firmware" from the kernel up with stuff like Yocto.

More broadly, many people have tried this approach (run everything on the Pi) and frankly speaking it just doesn't have a chance to be even remotely competitive with commercial solutions in terms of quality and responsiveness. Decent quality speech recognition with Whisper (as one example) itself introduces significant latency.

We (and our users) have determined over the past year that in practice Whisper medium is pretty much the minimum for reliable transcription for typical commands under typical real-world conditions. If you watch a lot of these demo videos you'll notice the human is very close to the device and speaking very carefully and intently. That's just not the real world.

We haven't benchmarked Whisper on the Pi 5 but the HA community has and they find it to be about 2x in performance from the Pi 4 for Whisper. If you look at our benchmarks[0] that means you'll be waiting at least 25 seconds just for transcription with typical voice commands and Whisper medium.

At that point you can find your phone, unlock, open an app, and just do it yourself faster with a fraction of the frustration - if you have to repeat yourself even once you're looking at one minute for an action to complete.

The hardware just isn't there. Maybe on the Raspberry Pi 10 ;).

We have a generic software-only Willow client for "bring your distro and device" and it does not work anywhere close to as well as our main target of ESP32-S3 BOX devices based devices (acoustic engineering, targeted audio drivers, real time operating system, known hardware). That's why we haven't released it.

Many people also want multiple devices and this approach becomes even more cumbersome when you're trying to coordinate wake activation and functionality between all of them. You then also end up with a fleet of devices with full-blown Linux distros that becomes difficult to manage.

Using random/generic audio hardware itself also has significant issues. Many people in the Home Assistant community that have tried the "just bring a mic and speaker!" approach end up buying > $50 USB speakerphone devices with the acoustic engineering necessary for reliable far-field audio. They are often powered by XMOS chips that themselves cost a multiple of the ESP32-S3 (as one example). So at this point you're in the range of > $150 per location for a solution that is still nowhere near the experience of Echo, Google Home, or Willow :).

I don't want to seem critical of this project but I think it's important to set very clear expectations before people go out and spend money thinking it's going to resemble anything close to

[0] - https://heywillow.io/components/willow-inference-server/#com...