LocalVocal: Seamless Live Transcriptions On-the-Go

LocalVocal: Seamless Live Transcriptions On-the-Go v0.1.0

royshilkrot

Member
royshilkrot submitted a new resource:

LocalVocal - Live stream AI assistant - Real-time, local transcribe speech to captions - no GPU, no cloud costs, no network, no downtime!

LocalVocal live-streaming AI assistant plugin allows you to transcribe, locally on your machine, audio speech into text and perform various language processing functions on the text using AI / LLMs (Large Language Models). ✅ No GPU required, ✅ no cloud costs, ✅ no network and ✅ no downtime! Privacy first - all data stays on your machine.

Current Features:
  • Transcribe audio to text in real time in 100 languages
  • Display captions on screen using text sources
Roadmap...

Read more about this resource...
 

LaughterOnWater

New Member
Tried it. Crashed obs. Also, Norton wouldn't allow part of the install:

Norton:
Category: Data Protector
Date & Time,Risk,Activity,Status,Recommended Action,Status,Program Path,Program Name,Date & Time,Action Observed,Target
8/18/2023 8:40:21 AM,High,Data Protector blocked a suspicious action by obs-localvocal-0.0.1-windows-x64-Installer.tmp,Action Blocked,No Action Required,Action Blocked,C:\Users\Chris\AppData\Local\Temp\is-F3AET.tmp\obs-localvocal-0.0.1-windows-x64-Installer.tmp,obs-localvocal-0.0.1-windows-x64-Installer.tmp,8/18/2023 8:40:21 AM,Suspicious process attempted to open a file protected by Data Protector,C:\ProgramData\Microsoft\Windows\Start Menu\Programs\obs-localvocal\obs-localvocal on the Web.url

I've added the crash report to github. https://github.com/royshil/obs-localvocal/issues/2#issuecomment-1683881228
 

Alisizz

New Member
Hi @royshilkrot

I am a newbee OBS develop, and I have noticed that in another post you mentioned that:
i'm also thinking about real-time auto translation to other languages utilizing Speech-to-text -> Translation -> Text-to-speech

How can i achieved the following process: (quite similar with what you are going to do)

I was livestreaming and i need a auto audio reply to the text that my audience type in the chatbox, now i have collect the read time text, but how can i get the following process done, thanks for your help, really thanks!

Real time text (i got from a liverstreamer page and stored in my server) -> AI Models with prompt (like GPT, API i already build in server) to handle real time text and generate answers -> generated answers converted to Speech -> Speech was brodcast via OBS to the Auidence
 

royshilkrot

Member
Hi @royshilkrot

I am a newbee OBS develop, and I have noticed that in another post you mentioned that:
i'm also thinking about real-time auto translation to other languages utilizing Speech-to-text -> Translation -> Text-to-speech

How can i achieved the following process: (quite similar with what you are going to do)

I was livestreaming and i need a auto audio reply to the text that my audience type in the chatbox, now i have collect the read time text, but how can i get the following process done, thanks for your help, really thanks!

Real time text (i got from a liverstreamer page and stored in my server) -> AI Models with prompt (like GPT, API i already build in server) to handle real time text and generate answers -> generated answers converted to Speech -> Speech was brodcast via OBS to the Auidence
Hi
This may be difficult to achieve in OBS internally. But text to speech models exist, like Bark. Right now they require a strong GPU. I was doing research to find a small model that can run in OBS, but didn't find one yet. It's a matter of time before it happens though. We need to be patient.
However if you're capable with coding you can make a local Python server that runs e.g. Bark model. It will need to generate an audio stream which you can pick up in OBS like RTMP... It's not a super easy task but possible. In python you could use gstreamer API to build the stream and push data to it from the speech to text engine.
That's what I'm currently thinking.
 

Alisizz

New Member
Hi
This may be difficult to achieve in OBS internally. But text to speech models exist, like Bark. Right now they require a strong GPU. I was doing research to find a small model that can run in OBS, but didn't find one yet. It's a matter of time before it happens though. We need to be patient.
However if you're capable with coding you can make a local Python server that runs e.g. Bark model. It will need to generate an audio stream which you can pick up in OBS like RTMP... It's not a super easy task but possible. In python you could use gstreamer API to build the stream and push data to it from the speech to text engine.
That's what I'm currently thinking.
Great thanks for you reply, and i am trying to build this, let me try this. Great thanks to you
 

appa561

New Member
Installation was pretty straightforward. Tiny seems to miss or incorrectly identify words easily. Base does much better. I'm curious how much impact each jump in Whisper model has on the system. The real reason I am hoping to use this plugin is for translation. When selecting models other than (Eng), the download fails.

Mostly, my use case would be my English speech to another language CC... I have an international audience in my Twitch stream. The Spanish speakers are the ones that struggle the most with the spoken word, so most of the time the CC would be in Spanish. I can foresee a need to do other languages, depending on who is in the majority.
 

BenAndo

Member
Installation was pretty straightforward. Tiny seems to miss or incorrectly identify words easily. Base does much better. I'm curious how much impact each jump in Whisper model has on the system. The real reason I am hoping to use this plugin is for translation. When selecting models other than (Eng), the download fails.

Mostly, my use case would be my English speech to another language CC... I have an international audience in my Twitch stream. The Spanish speakers are the ones that struggle the most with the spoken word, so most of the time the CC would be in Spanish. I can foresee a need to do other languages, depending on who is in the majority.
I had the same issue adding in other models. See this Github issue with instructions on how to manually add them in: https://github.com/royshil/obs-localvocal/issues/5
 

BenAndo

Member
Is it possible for this to display words in real-time? It seems to wait for a sentence's worth of words before displaying them. I've tried all the settings I can think of but nothing seems to speed it up. As such, it's always a good 3-8 seconds behind what was said.
Perhaps when GPU support is added it'll be closer to real-time?
 

royshilkrot

Member
Is it possible for this to display words in real-time? It seems to wait for a sentence's worth of words before displaying them. I've tried all the settings I can think of but nothing seems to speed it up. As such, it's always a good 3-8 seconds behind what was said.
Perhaps when GPU support is added it'll be closer to real-time?
Thanks for using the plugin!
I'll look at shorter time buffers and perhaps make it parametric so you have control. The minimum is 1 second though that's a whisper.cpp thing. Please open an issue for it so we keep track
 

Destroy666

Member
I'm curious how much impact each jump in Whisper model has on the system.

While tiny -> base is still ok, base -> small can be tough. Small takes around 10 seconds of heavier CPU usage on fairly modern CPUs. There's also an option for GPU now, which worked better (but not perfectly) for me just when testing Whisper directly, but I'll have to test it with the plugin.
 
Top