How Willow Works¶
Software Architecture¶
The root of the project is the Espressif Audio Development Framework (ESP-ADF). We use the ESP-ADF because it enables fairly straightforward management of "audio pipelines" that we use for I2S access to the audio hardware, connection to the recorder, streaming to the inference server, and optional support of AMR-WB for audio compression. In the future for audio playback support, etc. we will have additional audio playback pipelines to stream music, etc. The ESP-ADF then uses the ESP-IDF as a component. This is somewhat counter-intuitive as the ESP-ADF depends on the ESP IDF and one would think this would be the other way around.
We then (manually) pull in a more recent version of the ESP-SR Speech Recognition Framework. ESP-SR is what enables the AFE (Audio Front End) that performs things such as AGC (Automatic Gain Control), etc. Additionally, it provides wake word detection, local command recognition via MultiNet, etc.
We also use various "managed components" to support things like the LCD, etc.
What’s With the Goofy on the Wire Format?¶
Many developers are familiar with things like WebRTC. Again, due to my years in VoIP I certainly am. At first glance Willow may seem like a good application for WebRTC.
Here's why it isn't:
WebRTC is very heavy. It mandates support for things such as ICE, DTLS, etc.
ICE often leads to long session establishment times because ICE candidates need to be collected and evaluated. Yes there is "Trickle ICE" but that's another rant for another day.
DTLS is fairly heavy too.
WebRTC is very cool and offers many advantages. However, most of those advantages relate to WebRTC's ability to establish bidirectional media stream flow between two previously unknown peer devices that can both be behind NAT. We don't have that issue as the streaming server is expected to either be local or at an establish address that isn't behind NAT.
However, because we do use HTTP (currently) you can still have both devices behind NAT as long as you properly forward HTTP/HTTPS through your NAT device. None of this really applies if you are locally self-hosting your inference server. So, in short, we feel that WebRTC isn't appropriate or necessary for Willow.
Willow Inference Server Mode Flow¶
graph TB
A[Wake word] --> B[Start recording] --> C[Stream in real time to inference server] --> D[VAD detects end of speech] --> E[End stream] --> F[Willow Inference Server performs speech to text] --> G[JSON response with text and language] --> H[Send to configured command endpoint] --> I[Depending on configuration play success/failure tone or text to speech result] --> J[Display speech to text results and command endpoint output]
Local Mode Flow¶
graph TB
A[Wake word] --> B[MultiNet] --> C[MultiNet returns detected command ID] --> D[Look up corresponding text for command] --> E[Send text to configured command endpoint] --> F[Play success/failure tone] --> G[Display speech to text results and command endpoint output]