Skip to content

OTEL_SEMCONV_STABILITY_OPT_IN latency buckets too big #3011

Open
@bergur88

Description

Describe your environment

services are build with docker python:3.10.15-slim and run on k8s
services use
opentelemetry-api==1.27.0
opentelemetry-sdk==1.27.0
opentelemetry-propagator-b3==1.27.0
opentelemetry-exporter-otlp-proto-grpc==1.27.0
opentelemetry-instrumentation-fastapi==0.48b0
opentelemetry-instrumentation-aiohttp-client==0.48b0
opentelemetry-instrumentation-asyncpg==0.48b0
opentelemetry-instrumentation-psycopg==0.48b0
opentelemetry-instrumentation-psycopg2==0.48b0
opentelemetry-instrumentation-requests==0.48b0
opentelemetry-instrumentation-logging==0.48b0
opentelemetry-instrumentation-system-metrics==0.48b0
opentelemetry-instrumentation-grpc==0.48b0

What happened?

I'm using the OTEL_SEMCONV_STABILITY_OPT_IN feature (I'm currently running http/dup ) and am seeing some weird results with http latencies. It seems to me to use the same bucket sizes as the old metrics. Doesn't the buckets need to be smaller since the unit has been changed from milliseconds to seconds, with the lowest bucket being 5 seconds it not particularly useful and most percentiles calculated from my metrics show that p99 for most of my services/paths are 5 seconds which is not very accurate.

nodejs and dotnet overwrite the default buckets with more sane values.

image
image

The images show the same metric during the same time for the same labelset as a histogram and the older one being more granular and useful.
sum(rate(http_server_duration_milliseconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)
sum(rate(http_server_request_duration_seconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)

Steps to Reproduce

set OTEL_SEMCONV_STABILITY_OPT_IN="http/dup"

it can then be visualized in graphana similarly to this:
sum(rate(http_server_duration_milliseconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)
sum(rate(http_server_request_duration_seconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)

Expected Result

I expected to see the same percentiles for my services/paths using the semantic metrics.

Actual Result

new metrics are scewed towards 5 seconds because of buckets sizes.

Additional context

No response

Would you like to implement a fix?

None

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions