Building a OTel sink that follows GenAI semantic conventions

I've been working on an OTel sink recently. To be more specific, it concerns receiving telemetry data from LLM interactions. Fortunately, there is a spec for all of this called GenAI Semantic Conventions. That is what I based my work on, so I can make assumptions about key attributes to highlight in the sink's UI, which is actually a CLI in my case.

I dogfooded my handrolled sink using Pydantic AI (my favorite LLM framework), and all good, it follows the spec well. Key attributes flow in as specified. Pydantic AI also includes some of its own attributes, but that is expected.

However, pick something like Claude Code, (via anthropic agent sdk), the reality is a bit different. It does not follow the spec at all. It has its own custom attribute layout.

Same direction, different flavor with LangChain. The Python source has zero opentelemetry imports anywhere. None. The ls_* namespace (ls_provider, ls_model_name, ls_temperature) is its own thing, with LangSmith as the only tracing target. So Claude Code has its own data model that never reaches OTel; LangChain doesn't even pretend.

Part of why emissions look like this is that the GenAI Semantic Conventions spec is not the only one in town. There is a competing one from Arize called OpenInference, and it actually predates the OTel version in production use by roughly 3 months (Phoenix adopted it 2024-01-09; first gen_ai.* attributes merged upstream 2024-04-16). Arize ships a ton of instrumentation libraries for it, and OpenInference is also what Agno, BeeAI, and SmolAgents emit. So when frameworks "have their own thing," sometimes it is actually OpenInference under the hood.

OK so that is the emit side. What does the consume side look like?

I did a bit of a deep dive into a few popular open source observability tools, and dug out some dirty laundry on how they handle this.

The answer is that they DO support most of the stuff, but three of them have built explicit translation layers from gen_ai.* into their own proprietary data models. None of them use gen_ai.* as their internal data model. Phoenix is the clear outlier. They skipped the spec entirely and built the whole product on OpenInference. Their MIGRATION.md says it twice across v3 and v5: "Phoenix now exclusively uses OpenInference for instrumentation." The frontend folder is literally named app/src/openInference/.

Here is a table a-la Claude.

Cells are I·E·M (Ingest, Emit, Internal model). N=native, M=mapped, D=different name, A=absent, ?=unclear.

Attribute	MLflow	Phoenix	Langfuse	Opik
`gen_ai.system`	M·A·D	D·D·D	M·D·D	M·A·D
`gen_ai.operation.name`	N·M·D	A·A·A	M·D·A	M·A·D
`gen_ai.request.model`	N·M·D	D·D·D	M·D·D	M·A·D
`gen_ai.response.model`	N·M·D	A·A·A	M·D·A	M·A·D
`gen_ai.usage.input_tokens`	N·M·D	D·D·D	M·D·D	M·A·D
`gen_ai.usage.output_tokens`	N·M·D	D·D·D	M·D·D	M·A·D
`gen_ai.request.temperature`	A·M·A	A·D·A	M·D·A	M·A·D
`gen_ai.prompt`	M·A·A	D·D·D	M·D·A	M·A·D
`gen_ai.completion`	M·A·A	D·D·D	M·D·A	M·A·D
`gen_ai.tool.name`	A·A·A	?·D·D	M·D·A	M·A·D

Three of the four tools (Langfuse, MLflow, Opik) translate gen_ai.* on ingest into their own taxonomy. None use it as the internal data model. None emit it by default. Phoenix stores it verbatim and ignores it. The four archetypes:

Tool	Archetype	Key signal
Phoenix	Explicit competitor	MIGRATION.md says it twice: "Phoenix now exclusively uses OpenInference for instrumentation." Frontend folder is literally `app/src/openInference/`.
MLflow	Opt-in dual-namespace	`MLFLOW_ENABLE_OTEL_GENAI_SEMCONV=False` by default.
Langfuse	Bridge layer	3072-line `OtelIngestionProcessor.ts`, five separate code paths for prompt/completion to handle spec churn.
Opik	Ingest-only	Real OTLP receiver, but silently rewrites `input_tokens` to `prompt_tokens` internally.

An interesting finding is that, for example, Langfuse has more than 10 custom mappers (framework-specific attributes like ai.* to Langfuse domain) from frameworks like Vercel ai, genkit, traceloop, Pydantic AI, logfire, livekit, mlflow, smolagents, openinference, plus partial branches for pipecat, llamaindex, and google adk. All in one ~3000-line ingest file.

One notable thing is that newer fields, like evaluation results, do not appear to have any special handling, as far as I (and Claude) can tell.

By browsing the changelog, you can see that the semantic conventions have been evolving for a while. Just in the last 12 months, gen_ai.system got renamed to gen_ai.provider.name, prompt_tokens to input_tokens, and gen_ai.prompt was removed entirely (content moved to events, then the event shape got refactored too). Small wonder: everyone is learning this thing as we go and the specs play catch-up. Bundle that with the fact that every vendor already has a working internal data model from before the spec landed, and this might be a long road to reach stability.

So in practice it is not like I can just drop the custom attributes. I need to keep them and map to the spec attributes when they match. Sure, we have our own domain model too. But I am planning to follow the spec as much as possible. Granted that it is volatile. I would rather chase spec churn than ship a fourth taxonomy nobody asked for.

So why is it like that?

Few plausible reasons.

Every tool had a working LLM data model before the spec landed in April 2024. So for them, the spec arrives as a migration ask, not a green-field choice. Phoenix's whole product is built on OpenInference. Langfuse has its Generation/Observation/Score model in ClickHouse. MLflow has its own trace schema. Opik has its Java DTOs.

OpenInference got there about 3 months earlier and is frankly better suited for the job. It puts content directly on the span, which is what most span-first backends like Tempo or Jaeger can actually render.

The spec is still marked as in development. None of the relevant attributes are marked as stable. So, like, I've found out, it is hard to commit to baking something that keeps moving into your internal schema.

And the spec has gaps for things production cares about. Cost is the obvious one. There is no canonical attribute for token cost. So three of the four tools just invented their own (gen_ai.usage.cost in Langfuse, mlflow.llm.cost in MLflow, custom routes in Opik). Same story for evaluation. The gen_ai.evaluation.* namespace exists in the registry, but zero of the four tools touch it. They all ship their own eval/scoring systems instead.

Anyway, this is business as usual in the LLM space, moves fast and specs play catch-up.

I'm not a passionate developer

A family friend of mine is an airlane pilot. A dream job for most, right? As a child, I certainly thought so. Now that I can have grown-up talks with him, I have discovered a more accurate description of his profession. He says that the truth about the job is that it is boring. To me, that is not that surprising. Airplanes are cool and all, but when you are in the middle of the Atlantic sitting next to the colleague you have been talking to past five years, how stimulating can that be? When he says the job is boring, it is not a bad kind of boring. It is a very specific boring. The "boring" you would want as a passenger. Uneventful. Yet, he loves his job. According to him, an experienced pilot is most pleased when each and every tiny thing in the flight plan - goes according to plan. Passengers in the cabin of an expert pilot sit in the comfort of not even noticing who is flying. As someone employed in a field where being boring is not exactly in high demand, this sounds pro...

Koodihommia

Search This Blog

Building a OTel sink that follows GenAI semantic conventions

Labels

Comments

Post a Comment

Popular posts from this blog

I'm not a passionate developer

RocksDB data recovery

PydanticAI + evals + LiteLLM pipeline