Looking for audio plugin to remove voices

BarrySDCA

New Member
Hello,

I am using OBS for a live webcam. I have a mic setup outside to capture natural sounds but I need a way to filter out people talking. I don't want to inadvertently capture a conversation, etc.. Does anyone know of such a plugin?

thank you much
 

AaronD

Active Member
Hmm... This is the exact opposite problem from a Noise Suppressor, so my first thought was to have two parallel paths - one noise-suppressed to end up with voices only, and the other delayed to line up with it - then subtract the voices from everything else. (invert one, then mix)

Two problems with that:
  • OBS doesn't have a delay filter, so you can't line them up. Not lining them up completely kills the cancelling effect. So you'd have to use a DAW for all that, which is certainly possible, but:
  • The Noise Suppressor can't be perfect. The best you can do will probably still have some residual voices in it. It might be worth trying though - in a DAW, not OBS - to see if the voices are reduced enough.

DAW = Digital Audio Workstation. Essentially a complete sound studio in one app. It only does sound, and it does it REALLY WELL!!! There are some free, some paid, some are better featured than others, and there doesn't seem to be much correlation between price and functionality anymore.

I like this one myself:
On Debian and its derivatives, including *buntu and similar: sudo apt install ardour
Ubuntu Studio has it preinstalled, along with an old version of OBS. Install the PPA to update OBS to the current version, and you have a quite capable rig!

And a noise suppressor plugin for it:
 

BarrySDCA

New Member
Thanks. I don't have control over the voices or background sound. it's just a mic setup outside on the bay to catch the boat and bird sounds. but sometimes people stand near by and I catch them. I want to remove the people. do you think this has a fighting chance? I don't mind trying if you do. thank you
 

AaronD

Active Member
All of the above is free, so the only cost is your time to play with it. Seems like a cheap experiment to me.

---

Another thing you might look at, instead of just a simple subtraction, is echo cancellation. This does not replace the noise suppressor; only the subtraction.

Echo cancellation is designed for things like speakerphones, where the mic and speaker are right next to each other, and you still want to talk over the other party.
  • In its original application, the echo canceller takes the raw mic and the final speaker signal, and figures out what to take out of the mic before passing it on to the rest of the system. It involves quite a lot of processing, not a simple subtraction.
  • In your case, it would take the noise-suppressed mic (voices only) as the "speaker" to be removed, and the raw mic as the mic.
Unfortunately, I don't know of a good one of those. I'm sure there *are* some, now that you know the term to look for.

---

Another thing that comes to mind, though I doubt it'd work very well for you, is to use multiple mics like the Grateful Dead did for vocals in front of their "Wall of Sound" experiment. They had two identical mics for each vocal, one subtracted from the other, with the idea that the close-up voice would not be equal in both and thus *not* cancel completely, while the PA *would* be equal in both and thus cancel. It...worked for that specific purpose, but it didn't sound very good for the vocals themselves. And you want to do the opposite anyway.

---

Once we have multiple mics though, another idea is to spread them out. The ambient sounds that you want, won't be much different at all no matter which mic you're listening to, but when someone talks close to one, that's definitely different! So you could split off each one into its own noise suppressor, so now you have a voice detector for each one, and then you can use that to kill the raw signal that it's associated with.

As a first attempt on *that*, I'd feed the noise-suppressed version into the side-chain of an aggressive compressor, then mix them all together into a more gentle bus compressor, just to account for the overall level changing if one of them drops out.

This is also free, except for mics, cabling, and preamps.
 
Last edited:

jebba

Member
I am using OBS for a live webcam. I have a mic setup outside to capture natural sounds but I need a way to filter out people talking. I don't want to inadvertently capture a conversation, etc.. Does anyone know of such a plugin?

I would very much like the same thing, for the same reason. I haven't found anything yet.

There have been large strides made in the last year or so doing this with music, where each instrument and voice can be separated into a different track (e.g. drums.wav, voice.wav, guitar.wav). Here is an example:


That is a GUI for a couple of the more widely used models. Those models are based on music repositories though, so aren't very well suited for our purpose.

There is another ML task to do similar with movies. In general this is called "speech separation". For example, they want to split out dialog (voice), music, and sound effects. One recent project in this area, with accompanying paper is here:


There was a contest for the best way to do this, if you want to explore some others:


But none of these that I have seen are suited for our particular goal. For it to work well, it would likely be best to train on a similar data set.

The other main issue is that this is to be done in "realtime" in OBS Studio. I didn't see any of the above that are set up to do this as a stream--they all presume pre-recorded audio. The models are computationally expensive...usually requiring a GPU. Nowadays, de facto, that means having an nvidia GPU that can run PyTorch.

I would also like to incorporate a bird identification stream. :)
 

BarrySDCA

New Member
thank you for all this info. I'm starting from the basics, better directional microphones. But I'll need to polish it digitally be removing any human voices. I also have a digital noise sound that comes in the vhf radio I would like to remove. It's not 24/7, just once in a while. annoying. I'll update the thread as I get further along. Please keep posting helpful information - thank you very much
 

jebba

Member
A couple options for better directional microphones are "shotgun" mics or using a parabolic dish.

There's a lot of options out there for shotgun mics; one example would be the Sennheiser MKH 416-P48U3, which can also tolerate bad weather quite well. To protect it, you can put it inside a Rycote Windshield.

For a parabolic dish, which is more directional, there are options from Telinga and others. These setups are a dish that place a microphone in the center. Some can use lavalier mics, others larger mics.

For the hoomans' voices problem, I was also thinking a stop gap solution may be easier short term. Just detecting human voices, then just muting the audio until they are gone may be simpler to implement than trying to filter them out.

Not sure exactly what you mean about vhf there, but using cables with XLR connectors can help with noise issues.

I also see now there are other plugins, which may be helpful:


 

jebba

Member
"RNNoise is a noise suppression library based on a recurrent neural network."


More recent:

"Real-time microphone noise suppression on Linux"

 

AaronD

Active Member
Keep in mind that pretty much all of the existing tools do exactly the *opposite* of what the OP wants, and it's not a directly reversible problem.
 

jebba

Member
Keep in mind that pretty much all of the existing tools do exactly the *opposite* of what the OP wants, and it's not a directly reversible problem.

One of the issues that they mentioned was noise reduction though too, as they said they were getting interference from VHF.

Back to the voice removal, another approach is to use Voice Activity Detection, then go mute.

 

AaronD

Active Member
One of the issues that they mentioned was noise reduction though too, as they said they were getting interference from VHF.
"Noise Reduction" often focuses on spoken voice, and considers everything else as noise to be removed. A bit of a misnomer, until you see their primary use: business conferences.

So you can't assume that "noise reduction" uses the same definition of "noise" as you do.

Back to the voice removal, another approach is to use Voice Activity Detection, then go mute.
That's pretty much what I said about a month ago in this thread. :-)
 

AaronD

Active Member
Do you know of a good VAD for this?

Edit:
This looks good, is recommended by OVOS (successor to MyCroft):

Does it have a binary output? Or an audio one that contains only voice? (or is supposed to at least)

If it's audio that contains only voice, then it's functionally no different from any other Noise Suppressor, whether it's a plugin to a DAW, or the filter that OBS has, or any other variation. And that's enough to trigger a side-chained compressor to make it mute. You don't need a binary output for that.
 

jebba

Member
VAD, like silero-vad, isn't a noise suppressor.

The VAD just detects voice. It doesn't do any audio changes. The idea is once the VAD detects speech, a macro mutes OBS audio, then when the VAD detects the speech ends, the macro unmutes OBS.
 

jebba

Member
This looks pretty cool, multi-species:

"Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection"

But it is trained on very few species.
 

AaronD

Active Member
VAD, like silero-vad, isn't a noise suppressor.

The VAD just detects voice. It doesn't do any audio changes. The idea is once the VAD detects speech, a macro mutes OBS audio, then when the VAD detects the speech ends, the macro unmutes OBS.
How is that better than an aggressive side-chained compressor?

It probably works too. Just seems like it introduces more problems than it solves in this case. How to get a binary output into something that OBS or something else can use? Do you need a Python script as a supervisor? Can Adv. SS use it? Etc.

Compared to using a mechanism that already exists: the Compressor side-chain, which takes audio, and a Noise Suppressor that gives audio in (probably) the desired pattern.
 

jebba

Member
The VAD is for the purpose of making sure one doesn't stream human conversation with OBS.

Noise reduction is another goal, but that is separate from the VAD.

When you talk about a "aggressive side-chained compressor" I'm not sure which you are referring to. It seems you could use those audio filters for cleaning up noise (e.g. buzz, hums, etc). I'm not sure how it would work with removing human voices though. I think I misunderstand what you mean.

Edit: Something like the Scene Switcher or similar plugins could be used hand-in-hand with the detection, perhaps.
 

AaronD

Active Member
When you talk about a "aggressive side-chained compressor" I'm not sure which you are referring to. It seems you could use those audio filters for cleaning up noise (e.g. buzz, hums, etc). I'm not sure how it would work with removing human voices though. I think I misunderstand what you mean.
1695075259799.png


Inside the Compressor:

1695075479881.png


Normally, the Side-chain is a copy of the Input, but it doesn't have to be...

For some analog circuitry:
Notice that even in the first app note (dn00A), the detector taps off right at the input and runs independently from there. So even that design can be disconnected at that point and fed a different signal, to make it work per the diagrams here.

I've noticed that software people tend towards more Rube Goldberg solutions than most do, maybe because the software world is much like this anyway:
Is that where you're coming from?

Here's where I'm coming from:
I built that rig. :-) And here's that console:
Not the cleanest schematic that I've seen by any means, but everything is there.

For more analog audio projects, knowledge, and schematics:
And his take on a mixing console, to compare to the commercial one:
 

jebba

Member
I don't get how that setup would detect the difference of the sounds between humans, birds, bees, deer, cows, elk, cars, airplanes, etc. For instance, in my stream, it is 99.99% ambient noise. The idea would be to auto filter out the few minutes a day (if anything) of humans, typically me. :)

Edit: how would that just filter voice?
 

AaronD

Active Member
Edit: how would that just filter voice?
The Noise Suppressor calls anything not-voice, "noise". QED.
Lots of streamers have been bitten by misunderstanding that. It also kills their sound effects, music, etc., leaving *only* voices.

So this "on/off-keyed signal with inverted logic" (don't care about the content, only the amplitude) is then fed into something that inverts it back as it processes the actual intended signal, which is what the side-chained Compressor does.
 
Top