Just add the audio-providing source to every scene first, then add the relevant video source on top of it. Word of warning: Be exceptionally careful when adding new scenes. You will always forget to add the audio source, and then you will have no audio. And because OBS bafflingly gives you no way to monitor the mixed audio — only monitor specific sources continuously — you will never know that you are ruining your broadcast until it is too late.
Unfortunately, what you're finding is that OBS's audio architecture is entirely designed around audio-follows-video, which is kind of bizarre, because I can count the number of times I've wanted AFV in my entire life on one hand, if that. You build your packages so that the audio is ducked for the opening VO, and it goes silent at the end, and you're done. The content itself should handle the AFV for you, barring mistakes by your VTR operator. Heck, I can count the number of times I even saw support in a studio for audio-follows-video on one finger.
The reason it isn't usually done that way is that AFV makes the most common use cases — bringing in your studio mics continuously — more challenging, because you're fighting against the architecture. The number of times when you want to cut off one person's mic because the other person is on camera (presidential debates notwithstanding) is almost exactly zero. :-D
It is possible to work around OBS's design, though, as noted above.
Ideally, I'd like to see the audio mixing UI be significantly rewritten to behave more like this:
- Have a checkbox for each source to make it either A. audio-follows-video or B. continuous audio:
- Existing sources should be AFV to avoid breaking anybody.
- New sources should default to continuous except for media sources and browser sources, which should default to AFV.
- All audio sources should be visible in the mixer, not just sources that are active in the scene.
- Inactive AFV sources should be greyed out.
- Source order in the mixer should be static, and possibly configurable.
- Monitor the final mixed output, not individual sources (by default).
Thoughts?