Random googling. TLDR: best perf when CPU/GPU work async, asking for front buffer data to capture it makes CPU wait for GPU to finish rendering so CPU can't schedule new commands for GPU and so GPU waits for CPU again probably, etc.
Behind the scenes stuff roughly:
- game renders to front buffer
- copy FB to separate texture that CPU has access to
- copy pixels from texture to RAM
- convert BGRA to NV12 with CPU (OBS does it with 2 threads and from YUV444 (32bits))
- copy NV12 buffer back to GPU
- VCE is told to get next frame from NV12 buffer
With OpenCL interop, OBS remove step 3 and 5 and does step 4 on GPU too but then there's the rendering/compute scheduling issue.