Description
Use case(s) - what problem will this feature solve?
When health check is enabled (via "healthCheckConfig": {"serviceName": ""}
in the service config) and the client fails to create a stream, the channel remains in CONNECTING
state and no error message is output. This makes the task of troubleshooting problems caused by creating the health check watch stream hard, as there are neither logs nor RPC errors, and RPC typically fail with deadline exceeded (if a deadline is set).
The code that swallows errors when creating the health check Watch
stream is here:
Lines 74 to 78 in fbff2ab
- The transport is not ready or nil (
Lines 362 to 365 in 38a8b9a
Lines 1233 to 1236 in e912015
- A CallOption returns an error (I don't think this can happen here, since we don't provide the ability to customize call options for health checks)
Lines 1253 to 1255 in e912015
- We fail to get codec or compressor for the call (again not applicable to health check since gRPC controls that part)
- We fail to get a transport or create a stream on the transport (
Lines 353 to 358 in e912015
GetRequestMetadata
to get per-RPC credentials fails (that is the case I ran into).
Proposed Solution
Transition the subchannel to TRANSIENT_FAILURE
if we failed to create the watch stream. Continue the retry loop.
Alternatives Considered
If changing the behavior of of health checks when per RPC credentials fail is not desirable, at least logging through channelz would be nice.
Activity