Building A ChatGPT-4 Voice Assistant With Vivid U
Building A ChatGPT-4 Voice Assistant With Vivid U
Building A ChatGPT-4 Voice Assistant With Vivid U
By Gavin85 in CircuitsComputers
Published Feb 26th, 2024
1 x Vivid Unit
1 x DC 5V USB type-C adapter (or 1 x Ethernet cable if you have PoE port availalble)
Vivid Unit has everything needed on hardware aspect, and we just need to have it powered and connected to the I
nternet.
Vivid Unit is not that power hungry, a power adapter with 5V/2A should be good enough. If you still worry, give it 2.5
A.
You also need Internet connection, either wired or wireless. You need Internet access to install software, and Lucy
also need Internet connection during the usage.
Step 1: The Idea
ChatGPT is an advanced artificial intelligence developed by OpenAI. It is designed to engage in natural language c
onversations. ChatGPT can assist users with tasks, answer questions, brainstorm ideas, and even generate text in
different styles, making it a versatile tool for communication, learning, and problem-solving.
Text is the bridge between ChatGPT and human. If we use speech recognition to convert what we say to text, we c
an talk to ChatGPT. If we use Text-to-Speech technique to read the text generated by ChatGPT, we can hear Chat
GPT too.
The idea is straightforward and is really nothing unique, but I really like it and I do it out of fun. I will use free API/se
rvice only, so the investment is just time, and I learned a lot from it and enjoyed the process.
Step 2: Preparation
The program will be written with Python, and I also need to find packages to provide those functionalities.
In order to convert our speech to text, we need a speech recognition package. I stumbled across the SpeechRecog
nition project and I am very impressed: it offers APIs to access different Speech-to-text Transcription (STT) tools, a
nd some of them can even work offline. I decide to use it because it will be very easy to switch from different STT t
ools, which maximizes the fun.
There are many projects that provide APIs to access ChatGPT. My favorite is the gpt4free project. It provide APIs t
o access different AI engines from various providers. Again I choose it because that maximizes the fun.
After ChatGPT response with text, we need to convert it to speech (usually in MP3 format). I was hoping to find a p
roject that allows me to easily switch between different TTS engines, but I could not find one. I tried pyttsx3 but felt
its voice quality (using espeak) is terrible. I eventuly choose gTTS, which offers much better voice quality. The dow
n-side however, is that it needs network connection during usage. Considering we need network connection for Ch
atGPT service anyway, this is also acceptable.
Playback
We still need to playback the MP3 generated by TTS engine. The simplest way is to save the MP3 as a file, and us
e os.system() function to call any player that can play MP3 file. However I feel it is less elegent to generate an MP3
file. I finally use the mixer in pygame, which can playback MP3 without actually generate the file.
Packages Installation
Vivid Unit comes with Python3 installed, but it doesn't have PIP (the Python package manager) yet. It will be conve
nient to have PIP to help install some packages, so we install PIP first:
We install the "SpeechRecognition" package, and we will use it to convert our speech to text:
Install the "pygame" package, which is used for playback without actually generating the MP3 file.
Install the "gTTS" package, so it can actually read the text loudly.
pygame.init()
r = sr.Recognizer()
def speak(txt):
mp3_file_object = BytesIO()
speech = gTTS(text=txt, slow=False, lang='en', tld='us')
speech.write_to_fp(mp3_file_object)
mp3_file_object.seek(0)
pygame.init()
pygame.mixer.init()
pygame.mixer.music.load(mp3_file_object, 'mp3')
pygame.mixer.music.play()
if __name__ == '__main__':
while True:
try:
with sr.Microphone() as mic:
voice = r.listen(mic)
txt = r.recognize_google(voice)
print('\n\nQ: ' + txt + '\n')
resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.Bing,
messages=[{"role": "user", "content": txt}],
stream=False,
)
resp = re.sub('[^A-Za-z0-9 ,.:_\'\"]+', '', resp)
speak(resp)
except sr.UnknownValueError:
print("Something goes wrong.\n")
The speak() function accepts text parameter and calls gTTS to generate MP3 accordingly, and it uses pygame.mix
er to play the MP3 without saving it into file.
This program demostrates how to convert human's speech (from microphone) to text, and then forward that text to
ChatGPT4. As a very simple example, it works and you can chat with it already. Try to ask it some simple question
s like "what day is today?" or "which country has tallest people?", you will find it actually answers your question.
Lack of context
This is the biggest problem. The AI always starts a new conversation whenever you ask a new question, you will fe
el frustrated because the AI doesn't remember whatever you have previously said. This prevents you to chat deepl
y about a topic with the AI.
This is also very obvious. ChatGPT likes to talk, a lot. It sometimes generate thousands of words to answer your sh
ort and simple question. You may have to wait for long time when those output are processed.
Lack of GUI
It will be much nicer if the voice assitant has its own GUI, instead of printing the output on the console.
What to do?
If you look at the function call that gets response from ChatGPT:
resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.Bing,
messages=[{"role": "user", "content": txt}],
stream=False,
)
You can see the "messages" parameter is actually an array, and each element is an object. In the simple example
above, we always provide a new array that contains only one element (the new question) as the messages parame
ter.
If we always use the same array as the messages paramter, and append the answer from ChatGPT to the same ar
ray, then AI will know what we have dicussed before. Of course, the newly asked question should also be appende
d to the same array.
You can imagine this will bring some pressure to the device, to the network, and also to ChatGPT, because you are
sending more and more data during the chat. Although those data are just pure text and they are not that big, we st
ill should control how much context should we keep -- and it is pratical too: it is not likely the AI needs the informati
on you mentioned 58 questions ago. So we can define a constant, say MAX_CONTEXT, and we put its value to 3
2. Everytime after we append something to that array, we check the array size, if it is bigger than MAX_CONTEXT,
we delete its first two elements (the question and the answer).
resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.Bing,
messages=chat_data,
stream=False,
)
resp = re.sub('[^A-Za-z0-9 ,.:_\'\"]+', '', resp)
speak(resp)
The good news is that ChatGPT supports output as stream: instead of outputing everything in the buffer, it outputs
text piece by piece. You will eventually get the same output text, but this way you have chance to access early text
while other text are still being output.
The bad news however, is that I don't find a TTS engine that can generate streamed audio according to streamed t
ext.
The solution
I start two theads. One as text generator and the other one as text consumer.
The text generator thread runs a loop that keeps getting output text from ChatGPT, and put the text into a queue.
The text consumer thread runs a loop that keeps taking text from the queue and assemble them as a sentence. On
ce a sentence is complete, it calls speak() function to read it out.
This way the sentence will be read out once it is ready, and no need to wait for other sentences to be processed.
The speak() function becomes a member function of QueueProcessingThread class (the text consumer thread). Be
cause it calls the mixer.music.play() function, which is not blocking during the playback, I have to add a while loop t
o make it blocking, or it will try to play next sentence before the current playback is done.
I create a fullscreen window as the GUI for this voice assistant. The conversation will be displayed on the screen w
hen the chat goes on. I also define three states for the program: inactive, active and listening.
When the program just launched, it is in "inactive" state: the screen is black and it will not react to what you say.
If you say something that contains "Lucy", that will trigger it and its state will become "active" and immedetely go to
"listening": the screen is green and it listens to your question.
After you ask the question, its state will go back to "active" while ChatGPT is outputing the answer: the screen is p
urple and your speech will be ignored. After all output are read out, the state will go to "listening" again.
If you haven't say anything after a while, the state will go to "inactive".
There are severial providers for ChatGPT-4 service, and the gpt4free project gives a very detailed list of them. In th
e list there are two of them (openai and raycast) need authentication, which make them harder to use and (most pr
obably) not free. Also the GeekGpt is no longer available, so there are currently only three remaining:
Bing (bing.com)
Liaobots (liaobots.site)
You (you.com)
When I do testing, I can not make the Liaobots work. I am not sure if it was a temorary issue.
Bing and You are both working quite well. I personally like You better, because I like the way it speaks: it tends to s
peak less and simple. Bing on the other hand, likes to talk more, sometimes a little bit too much.
Switching provider
Switching provider is every easy: you just replace the "provicer" parameter when calling the g4f.ChatCompletion.cr
eate() function. If you want to use Bing, you set provider to "g4f.Provider.Bing"; To use You, set provider to "g4f.Pro
vider.You".
resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.Bing, # set provider here
messages=chat_data,
stream=True,
)
import gi
gi.require_version('Gtk', '3.0')
from gi.repository import Gtk, Gdk, Pango, GLib
import threading
import time
import re
import queue
from io import BytesIO
import speech_recognition as sr
import g4f
import pygame
from gtts import gTTS
import sounddevice
NAME = 'Lucy'
RESP = 'Yes?'
BYE = 'Talk to you later.'
MAX_CONTEXT = 32
MAX_INACTIVE = 60
pygame.init()
q = queue.Queue()
r = sr.Recognizer()
get_sentence = False
sentence = ''
output_done = False
speech_done = True
chat_data = []
active_ts = 0;
class ChatView(Gtk.TextView):
def __init__(self):
Gtk.TextView.__init__(self)
self.set_wrap_mode(Gtk.WrapMode.WORD)
self.set_editable(False)
self.set_cursor_visible(False)
text_buffer = self.get_buffer()
text_iter_end = text_buffer.get_end_iter()
self.text_mark_end = text_buffer.create_mark("", text_iter_end, False)
def clear_text(self):
text_buffer = self.get_buffer()
text_iter_start = text_buffer.get_start_iter()
text_iter_end = text_buffer.get_end_iter()
text_buffer.delete(text_iter_start, text_iter_end);
class LucyWindow(Gtk.Window):
active = False;
listening = False;
chat_view = ChatView()
def __init__(self):
Gtk.Window.__init__(self)
self.set_title('Lucy')
self.fullscreen()
self.set_default_size(640, 360)
self.grid = Gtk.Grid()
self.scrolled_win = Gtk.ScrolledWindow()
self.scrolled_win.set_hexpand(True)
self.scrolled_win.set_vexpand(True)
self.scrolled_win.add(self.chat_view)
self.scrolled_win.set_policy(Gtk.PolicyType.NEVER, Gtk.PolicyType.AUTOMATIC)
text_box = Gtk.Box(orientation=Gtk.Orientation.VERTICAL, spacing=20)
text_box.set_margin_top(20)
text_box.set_margin_bottom(20)
text_box.set_margin_start(20)
text_box.set_margin_end(20)
text_box.add(self.scrolled_win)
self.grid.add(text_box)
self.add(self.grid)
self.connect('destroy', Gtk.main_quit)
self.show_all()
window_context = self.get_style_context()
window_context.remove_class('inactive')
window_context.remove_class('active')
window_context.remove_class('listening')
view_context = self.chat_view.get_style_context()
view_context.remove_class('inactive')
view_context.remove_class('active')
view_context.remove_class('listening')
if active:
if listening:
window_context.add_class('listening')
view_context.add_class('listening')
else:
window_context.add_class('active')
view_context.add_class('active')
else:
window_context.add_class('inactive')
view_context.add_class('inactive')
class QueueProcessingThread(threading.Thread):
window = None
def run(self):
global get_sentence
global sentence
global output_done
global speech_done
while True:
if (get_sentence):
item = q.get()
sentence += item
q.task_done()
if item.endswith(".") or item.endswith("!") or item.endswith("?") or (output_done and q.empty()):
self.speak(sentence)
sentence = ''
get_sentence = False
else:
if q.empty():
if output_done:
if not speech_done:
speech_done = True
if self.window.active:
self.window.set_state(True, True)
else:
if output_done:
get_sentence = True
class VoiceRecognizingThread(threading.Thread):
window = None
def run(self):
global get_sentence
global output_done
global speech_done
global chat_data
global active_ts
while True:
Gtk.main_iteration_do(False)
try:
with sr.Microphone(sample_rate=44100) as mic:
if not speech_done:
continue
ts = time.time()
if self.window.active and active_ts and (ts - active_ts) > MAX_INACTIVE :
active_ts = 0
self.window.set_state(False, False)
speech_done = False
q.put(BYE)
get_sentence = True
output_done = True
GLib.idle_add(self.window.chat_view.clear_text)
voice = r.listen(mic)
txt = r.recognize_google(voice)
active_ts = ts
if not self.window.active:
if NAME in txt:
self.window.set_state(True, False)
speech_done = False
q.put(RESP)
get_sentence = True
output_done = True
else:
active_ts = ts;
output_done = False
speech_done = False
GLib.idle_add(self.window.chat_view.append_text, '\n\nQ: ' + txt + '\nA: ')
self.window.set_state(True, False)
resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.You,
messages=chat_data,
stream=True,
)
answer = ''
for message in resp:
msg = re.sub('[^A-Za-z0-9 ,.:_\'\"\+\-\*\/=]+', '', message.replace('**', ''))
GLib.idle_add(self.window.chat_view.append_text, msg)
answer += msg
q.put(msg)
if msg.endswith("."):
get_sentence = True
output_done = True
chat_data.append({'role': 'assistant', 'content': answer})
active_ts = time.time()
if __name__ == '__main__':
# load CSS
screen = Gdk.Screen.get_default()
provider = Gtk.CssProvider()
style_context = Gtk.StyleContext()
style_context.add_provider_for_screen(
screen, provider, Gtk.STYLE_PROVIDER_PRIORITY_APPLICATION
)
css = b"""
textview {
font: 25px Arial;
background: transparent;
}
textview text {
color: white;
background: transparent;
}
textview.inactive text {
color: black;
}
window.inactive {
background: black;
}
window.active {
background: #7700df;
}
window.listening {
background: #008c8c;
}
"""
provider.load_from_data(css)
# Lucy window
win = LucyWindow()
Gtk.main()
Attachments
Step 9: The Result
Below is a video that shows how Lucy works. As you can see, it does remember the context during the conversatio
n.
Sometimes the sentences are incorrectly connected without period or comma, and the gTTS engine just reads it ou
t that way. I think this can be improved by tuning the text consumer thread (QueueProcessingThread).
Lucy's performance can be significantly affected by the network situation. Lucy uses several API that require Intern
et connection. If the network is slow, or the service server responses late, Lucy may answer you way later than you
expected.
Offline version?
The SpeechRecognition library does provide some APIs that can work offline (e.g. the Vosk API). I tried them and c
onfirm they indeed can work locally on Vivid Unit. However the recongnition accurancy is not as good as Google S
peech Recognition.
Also the text-to-speech engine can switch to an offline version: pyttsx3. But the voice quality is really bad and you
will not like it.
As for the ChatGPT-4 service, it definitely needs Internet connection. It may be possible to run a simplified LLM loc
ally on Vivid Unit, but that will be very slow and that will not be practical.
With that said, if we really make Lucy offline, it will be unfortuantely quite un-usable.
Definitely! Vivid Unit comes with GPIOs and ADC channels, so it is possible to let Lucy to control some external cir
cuits, read some data from sensor etc. It can actually become the center unit of home automation.