My understanding is that it takes whatever each source gives it and converts that to 32-bit or 64-bit floating point internally*, and then converts the final result from that to whatever the encoder defines, which isn't really any particular bit depth.
The encoder's job is to minimize the total number of bits while keeping enough quality to be "acceptable", the definition of which is determined by the encoder's settings. If you're not going to notice a difference between "perfect" and 6 bits for a given passage, then the effective bit depth may well be 6 there, and maybe 11 for a different part of the same stream/recording, etc. That's determined by the encoder's settings.
The number of bits to not notice is surprisingly low: