Looking for audio plugin to remove voices

BarrySDCA · Aug 21, 2023

Hello,

I am using OBS for a live webcam. I have a mic setup outside to capture natural sounds but I need a way to filter out people talking. I don't want to inadvertently capture a conversation, etc.. Does anyone know of such a plugin?

thank you much

AaronD · Aug 21, 2023

Hmm... This is the exact opposite problem from a Noise Suppressor, so my first thought was to have two parallel paths - one noise-suppressed to end up with voices only, and the other delayed to line up with it - then subtract the voices from everything else. (invert one, then mix)

Two problems with that:

OBS doesn't have a delay filter, so you can't line them up. Not lining them up completely kills the cancelling effect. So you'd have to use a DAW for all that, which is certainly possible, but:
The Noise Suppressor can't be perfect. The best you can do will probably still have some residual voices in it. It might be worth trying though - in a DAW, not OBS - to see if the voices are reduced enough.

DAW = Digital Audio Workstation. Essentially a complete sound studio in one app. It only does sound, and it does it REALLY WELL!!! There are some free, some paid, some are better featured than others, and there doesn't seem to be much correlation between price and functionality anymore.

I like this one myself:

https://ardour.org/

On Debian and its derivatives, including *buntu and similar: sudo apt install ardour
Ubuntu Studio has it preinstalled, along with an old version of OBS. Install the PPA to update OBS to the current version, and you have a quite capable rig!

And a noise suppressor plugin for it:

GitHub - lucianodato/noise-repellent: Lv2 suite of plugins for broadband noise reduction

Lv2 suite of plugins for broadband noise reduction - lucianodato/noise-repellent

github.com

BarrySDCA · Aug 22, 2023

Thanks. I don't have control over the voices or background sound. it's just a mic setup outside on the bay to catch the boat and bird sounds. but sometimes people stand near by and I catch them. I want to remove the people. do you think this has a fighting chance? I don't mind trying if you do. thank you

AaronD · Aug 22, 2023

All of the above is free, so the only cost is your time to play with it. Seems like a cheap experiment to me.

---

Another thing you might look at, instead of just a simple subtraction, is echo cancellation. This does not replace the noise suppressor; only the subtraction.

Echo cancellation is designed for things like speakerphones, where the mic and speaker are right next to each other, and you still want to talk over the other party.

In its original application, the echo canceller takes the raw mic and the final speaker signal, and figures out what to take out of the mic before passing it on to the rest of the system. It involves quite a lot of processing, not a simple subtraction.
In your case, it would take the noise-suppressed mic (voices only) as the "speaker" to be removed, and the raw mic as the mic.

Unfortunately, I don't know of a good one of those. I'm sure there *are* some, now that you know the term to look for.

---

Another thing that comes to mind, though I doubt it'd work very well for you, is to use multiple mics like the Grateful Dead did for vocals in front of their "Wall of Sound" experiment. They had two identical mics for each vocal, one subtracted from the other, with the idea that the close-up voice would not be equal in both and thus *not* cancel completely, while the PA *would* be equal in both and thus cancel. It...worked for that specific purpose, but it didn't sound very good for the vocals themselves. And you want to do the opposite anyway.

---

Once we have multiple mics though, another idea is to spread them out. The ambient sounds that you want, won't be much different at all no matter which mic you're listening to, but when someone talks close to one, that's definitely different! So you could split off each one into its own noise suppressor, so now you have a voice detector for each one, and then you can use that to kill the raw signal that it's associated with.

As a first attempt on *that*, I'd feed the noise-suppressed version into the side-chain of an aggressive compressor, then mix them all together into a more gentle bus compressor, just to account for the overall level changing if one of them drops out.

This is also free, except for mics, cabling, and preamps.

jebba · Sep 11, 2023

BarrySDCA said:
I am using OBS for a live webcam. I have a mic setup outside to capture natural sounds but I need a way to filter out people talking. I don't want to inadvertently capture a conversation, etc.. Does anyone know of such a plugin?

I would very much like the same thing, for the same reason. I haven't found anything yet.

There have been large strides made in the last year or so doing this with music, where each instrument and voice can be separated into a different track (e.g. drums.wav, voice.wav, guitar.wav). Here is an example:

GitHub - Anjok07/ultimatevocalremovergui: GUI for a Vocal Remover that uses Deep Neural Networks.

GUI for a Vocal Remover that uses Deep Neural Networks. - Anjok07/ultimatevocalremovergui

github.com

That is a GUI for a couple of the more widely used models. Those models are based on music repositories though, so aren't very well suited for our purpose.

There is another ML task to do similar with movies. In general this is called "speech separation". For example, they want to split out dialog (voice), music, and sound effects. One recent project in this area, with accompanying paper is here:

GitHub - JusperLee/TDANet: An efficient speech separation method

An efficient speech separation method. Contribute to JusperLee/TDANet development by creating an account on GitHub.

github.com

There was a contest for the best way to do this, if you want to explore some others:

AIcrowd | Sound Demixing Challenge 2023 | Leaderboards

Audio Source Separation using AI

www.aicrowd.com

But none of these that I have seen are suited for our particular goal. For it to work well, it would likely be best to train on a similar data set.

The other main issue is that this is to be done in "realtime" in OBS Studio. I didn't see any of the above that are set up to do this as a stream--they all presume pre-recorded audio. The models are computationally expensive...usually requiring a GPU. Nowadays, de facto, that means having an nvidia GPU that can run PyTorch.

I would also like to incorporate a bird identification stream. :)

BarrySDCA · Sep 15, 2023

thank you for all this info. I'm starting from the basics, better directional microphones. But I'll need to polish it digitally be removing any human voices. I also have a digital noise sound that comes in the vhf radio I would like to remove. It's not 24/7, just once in a while. annoying. I'll update the thread as I get further along. Please keep posting helpful information - thank you very much

jebba · Sep 17, 2023

A couple options for better directional microphones are "shotgun" mics or using a parabolic dish.

There's a lot of options out there for shotgun mics; one example would be the Sennheiser MKH 416-P48U3, which can also tolerate bad weather quite well. To protect it, you can put it inside a Rycote Windshield.

For a parabolic dish, which is more directional, there are options from Telinga and others. These setups are a dish that place a microphone in the center. Some can use lavalier mics, others larger mics.

For the hoomans' voices problem, I was also thinking a stop gap solution may be easier short term. Just detecting human voices, then just muting the audio until they are gone may be simpler to implement than trying to filter them out.

Not sure exactly what you mean about vhf there, but using cables with XLR connectors can help with noise issues.

I also see now there are other plugins, which may be helpful:

GitHub - locaal-ai/obs-localvocal: OBS plugin for local speech recognition and captioning using AI

OBS plugin for local speech recognition and captioning using AI - locaal-ai/obs-localvocal

github.com

LocalVocal: Local Live Captions & Translation On-the-Go

LocalVocal plugin allows you to transcribe & translate speech into text locally on your machine in real time. ✅ No GPU required*, ✅ no cloud costs, ✅ no network and ✅ minimal lag! Privacy first - all data stays on your machine. (* GPU...

obsproject.com

jebba · Sep 18, 2023

"RNNoise is a noise suppression library based on a recurrent neural network."

GitHub - xiph/rnnoise: Recurrent neural network for audio noise reduction

Recurrent neural network for audio noise reduction - xiph/rnnoise

github.com

More recent:

"Real-time microphone noise suppression on Linux"

GitHub - noisetorch/NoiseTorch: Real-time microphone noise suppression on Linux.

Real-time microphone noise suppression on Linux. Contribute to noisetorch/NoiseTorch development by creating an account on GitHub.

github.com

AaronD · Sep 18, 2023

Keep in mind that pretty much all of the existing tools do exactly the *opposite* of what the OP wants, and it's not a directly reversible problem.

jebba · Sep 18, 2023

AaronD said:
Keep in mind that pretty much all of the existing tools do exactly the *opposite* of what the OP wants, and it's not a directly reversible problem.

One of the issues that they mentioned was noise reduction though too, as they said they were getting interference from VHF.

Back to the voice removal, another approach is to use Voice Activity Detection, then go mute.

Voice activity detection - Wikipedia

en.wikipedia.org

AaronD · Sep 18, 2023

jebba said:
One of the issues that they mentioned was noise reduction though too, as they said they were getting interference from VHF.

"Noise Reduction" often focuses on spoken voice, and considers everything else as noise to be removed. A bit of a misnomer, until you see their primary use: business conferences.

So you can't assume that "noise reduction" uses the same definition of "noise" as you do.

jebba said:
Back to the voice removal, another approach is to use Voice Activity Detection, then go mute.

That's pretty much what I said about a month ago in this thread. :-)

jebba · Sep 18, 2023

AaronD said:
That's pretty much what I said about a month ago in this thread. :-)

Do you know of a good VAD for this?

Edit:
This looks good, is recommended by OVOS (successor to MyCroft):

GitHub - snakers4/silero-vad: Silero VAD: pre-trained enterprise-grade Voice Activity Detector

Silero VAD: pre-trained enterprise-grade Voice Activity Detector - snakers4/silero-vad

github.com

AaronD · Sep 18, 2023

jebba said:
Do you know of a good VAD for this?

Edit:
This looks good, is recommended by OVOS (successor to MyCroft):

GitHub - snakers4/silero-vad: Silero VAD: pre-trained enterprise-grade Voice Activity Detector

Silero VAD: pre-trained enterprise-grade Voice Activity Detector - snakers4/silero-vad

github.com

Does it have a binary output? Or an audio one that contains only voice? (or is supposed to at least)

If it's audio that contains only voice, then it's functionally no different from any other Noise Suppressor, whether it's a plugin to a DAW, or the filter that OBS has, or any other variation. And that's enough to trigger a side-chained compressor to make it mute. You don't need a binary output for that.

jebba · Sep 18, 2023

VAD, like silero-vad, isn't a noise suppressor.

The VAD just detects voice. It doesn't do any audio changes. The idea is once the VAD detects speech, a macro mutes OBS audio, then when the VAD detects the speech ends, the macro unmutes OBS.

jebba · Sep 18, 2023

This looks pretty cool, multi-species:

"Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection"

GitHub - nianlonggu/WhisperSeg: Code for ICASSP 2024 paper WhisperSeg: Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection

Code for ICASSP 2024 paper WhisperSeg: Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection - nianlonggu/WhisperSeg

github.com

But it is trained on very few species.

AaronD · Sep 18, 2023

jebba said:
VAD, like silero-vad, isn't a noise suppressor.

The VAD just detects voice. It doesn't do any audio changes. The idea is once the VAD detects speech, a macro mutes OBS audio, then when the VAD detects the speech ends, the macro unmutes OBS.

How is that better than an aggressive side-chained compressor?

It probably works too. Just seems like it introduces more problems than it solves in this case. How to get a binary output into something that OBS or something else can use? Do you need a Python script as a supervisor? Can Adv. SS use it? Etc.

Advanced Scene Switcher

This plugin will allow you to automate various tasks using "Macros". Macros consist of a list of conditions under which a list of actions will be performed. Examples and guides can be found in the wiki. Feel free to contribute! If you run...

obsproject.com

Compared to using a mechanism that already exists: the Compressor side-chain, which takes audio, and a Noise Suppressor that gives audio in (probably) the desired pattern.

jebba · Sep 18, 2023

The VAD is for the purpose of making sure one doesn't stream human conversation with OBS.

Noise reduction is another goal, but that is separate from the VAD.

When you talk about a "aggressive side-chained compressor" I'm not sure which you are referring to. It seems you could use those audio filters for cleaning up noise (e.g. buzz, hums, etc). I'm not sure how it would work with removing human voices though. I think I misunderstand what you mean.

Edit: Something like the Scene Switcher or similar plugins could be used hand-in-hand with the detection, perhaps.

AaronD · Sep 18, 2023

jebba said:
When you talk about a "aggressive side-chained compressor" I'm not sure which you are referring to. It seems you could use those audio filters for cleaning up noise (e.g. buzz, hums, etc). I'm not sure how it would work with removing human voices though. I think I misunderstand what you mean.

Inside the Compressor:

Normally, the Side-chain is a copy of the Input, but it doesn't have to be...

For some analog circuitry:

Design Notes – THAT Corporation

thatcorp.com

https://thatcorp.com/datashts/dn00A.pdf

https://thatcorp.com/datashts/dn102.pdf

Notice that even in the first app note (dn00A), the detector taps off right at the input and runs independently from there. So even that design can be disconnected at that point and fed a different signal, to make it work per the diagrams here.

I've noticed that software people tend towards more Rube Goldberg solutions than most do, maybe because the software world is much like this anyway:

Software Development

xkcd.com

2021: Software Development - explain xkcd

Explain xkcd is a wiki dedicated to explaining the webcomic xkcd. Go figure.

www.explainxkcd.com

Is that where you're coming from?

Here's where I'm coming from:

- YouTube

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

www.youtube.com

- YouTube

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

www.youtube.com

I built that rig. :-) And here's that console:

Resources - Allen & Heath

User Guides, Software, Drivers and Documentation

www.allen-heath.com

https://www.allen-heath.com/media/GL3300-SCHEMATICS.pdf

Not the cleanest schematic that I've seen by any means, but everything is there.

For more analog audio projects, knowledge, and schematics:

Elliott Sound Products - The Audio Pages (Main Index)

DIY Audio from ESP - Audio articles, projects to build and general information about hi-fi and audio in general.

sound-au.com

And his take on a mixing console, to compare to the commercial one:

High Quality Sound Mixer

Audio mixing console. High quality mixer suitable for stage or recording.

sound-au.com

jebba · Sep 19, 2023

I don't get how that setup would detect the difference of the sounds between humans, birds, bees, deer, cows, elk, cars, airplanes, etc. For instance, in my stream, it is 99.99% ambient noise. The idea would be to auto filter out the few minutes a day (if anything) of humans, typically me. :)

Edit: how would that just filter voice?

AaronD · Sep 19, 2023

jebba said:
Edit: how would that just filter voice?

The Noise Suppressor calls anything not-voice, "noise". QED.
Lots of streamers have been bitten by misunderstanding that. It also kills their sound effects, music, etc., leaving *only* voices.

So this "on/off-keyed signal with inverted logic" (don't care about the content, only the amplitude) is then fed into something that inverts it back as it processes the actual intended signal, which is what the side-chained Compressor does.

Looking for audio plugin to remove voices

New Member

Active Member

New Member

Active Member

Member

New Member

Member

Member

Active Member

Member

Active Member

Member

Active Member

Member

Member

Active Member

Member

Active Member

Member

Active Member