Video encoding takes a lot of processing power. The CPU can do it, with high usage, but if the GPU is designed to do it, then that leaves the CPU to do other things.
As for an engineering spec, there really isn't one. Your best bet for a new system is to go overkill and let it run easy. That also leaves room for you to grow into it, instead of being overloaded from the get go and not being able to do what you really want.
For the confusion and incomprehensibility, this is a technical pursuit. If you thought it was, "push button, make stream," it's absolutely not that! The Auto-Config Wizard tries to approximate it, but there's so much variation in what people want to do, what gear they have to do it with, what internet connection they have, etc., that nothing can make it *that* turn-key.
And from the questions that I often answer, along the lines of, "Why is <something> misbehaving?", it's a massive hindrance to not understand how a serious live media rig actually works, because you're literally building one.
Audio:
Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
www.youtube.com
As a live Front-of-House and Broadcast Engineer (those are two different things with different goals and different techniques), I don't agree with everything in that video for either use, but it's a good start. You can follow links from there.
Video:
Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
www.youtube.com
In both cases, OBS is both simpler than the examples, and weird by comparison. As if it was designed in isolation from the pro world and hasn't aligned with it yet. (there's a good reason for pro gear to work the way it does, and OBS...doesn't yet) But the principles still apply.
One principle to note right away is the clear separation between picture and sound. Different gear for each purpose, and they don't cross over except in very low-end cheap stuff. Everything "serious" handles the two streams completely separately, and combines them at the last moment to send out.
That's a good principle to keep in mind, as OBS's token audio processing is just that: it's token and not much more. It's okay for a game + mic, but doesn't take much complexity at all to break out of OBS's audio and do it all in some external thing instead, so that OBS only sees the finished soundtrack to pass through unchanged.