Project Jarvis
Python, Rhasspy
A local voice assistant that doesn't suck
TLDR: I'm building a locally hosted voice assistant Yey!
Google Assistant is not new... and honestly it kinda sucks. With every iteration it requires more and more permissions to access all levels of your account and with every iteration the basic functionality gets more and more broken.
For about a year now everything has deteriorated. Intent analysis is bad and works 50% of the time. ASR works for me but not for others. Wake word gets triggered with household appliances. Timer management is honestly embarrassingly complicated.
Let's face it, 90% of the time Google Assistant is used for setting up timers when you're cooking and playing music from your favourite streaming service. These core components need to work without fail, every time. Can we do better?
MAYBE..?
Requirements
Firstly let's gather the requirements. Whatever we build should be able to:
- support timers (set/modify/remove)
- play music from streaming service
- answer questions using ChatGPT API
- be customizable
- be as private as possible
Tooling
Rhasspy is an open source, fully offline framework of voice assistant services which integrates easily with Home Assistant (which happens to be my main home automation platform). The project has been going for a good while now and has loads of contributors.
It's composed of independent services such as Text To Speech, Wake Word Detection, Intent Analysis that coordinate over MQTT using a superset of the Hermes protocol. This gives us an opportunity to use the most efficient components for each job. The goal would be to achieve the following workflow
We'll use it as the backbone of our project.
ASR or Automatic Speech Recognition is a crucial component of the workflow. In Sept 2022, OpenAI released Whisper, the world’s most accurate speech recognition (ASR) that can transcribe and translate speech audio from 97 languages.
Since its introduction a number of community-led projects have been optimizing the performance, most notably:
-
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
-
whisper.cpp - High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model in Plain C/C++ running on CPU with low memory usage
Depending on how performant the hardware is we'll select one of them to integrate into the workflow
Intent - Probably the most crucial part of of the system. Given the recent announcement that the creator of Rhasspy - Mike Hansen is joining Nabu Casa and the fact that Home Assistant is already deployed at my home the choice is obvious... We're going Home Assistant with this.
Rhasspy will send Home Assistant an event every time an intent is recognized through its REST API. The type of event is determined by the name of the intent, and the event data comes from the tagged words in your sentences.
Home Assistant can accept intents directly over the HTTP API so we can use intent scripts to trigger actions depending on the voice command from Rhasspy
We are getting closer!
Handle - Our 3 scenarios here are timers, Spotify music and ChatGPT for anything else. Few initial ideas about this:
-
Timer
- parse the intent JSON from rhasspy for minutes/seconds
- use these values to set a timer controlled in Home Assistant
- handle multiple timers with names
- handle extending / shortening timer length
- handle listing timers running
-
Music
- parse the intent JSON from rhasspy for name of the band/album/song
- use Home Assistant Spotify integration to play music
- handle forward/back tacking
- pause/unpause music played
-
Anything else
- create an integration with ChatGPT
- send intent payload to API
- listen for response
- TTS response to speakers
Hardware
-
"Old faithful" Raspberry Pi 3b
-
PS3 Eye Camera (good for microphone array)
-
Home Assistant server
TO BE CONTINUED...