Skip to main content

Building a OTel sink that follows GenAI semantic conventions

I've been working on an OTel sink recently. To be more specific, it concerns receiving telemetry data from LLM interactions. Fortunately, there is a spec for all of this called GenAI Semantic Conventions. That is what I based my work on, so I can make assumptions about key attributes to highlight in the sink's UI, which is actually a CLI in my case.

I dogfooded my handrolled sink using Pydantic AI (my favorite LLM framework), and all good, it follows the spec well. Key attributes flow in as specified. Pydantic AI also includes some of its own attributes, but that is expected.

However, pick something like Claude Code, (via anthropic agent sdk), the reality is a bit different. It does not follow the spec at all. It has its own custom attribute layout.

Same direction, different flavor with LangChain. The Python source has zero opentelemetry imports anywhere. None. The ls_* namespace (ls_provider, ls_model_name, ls_temperature) is its own thing, with LangSmith as the only tracing target. So Claude Code has its own data model that never reaches OTel; LangChain doesn't even pretend.

Part of why emissions look like this is that the GenAI Semantic Conventions spec is not the only one in town. There is a competing one from Arize called OpenInference, and it actually predates the OTel version in production use by roughly 3 months (Phoenix adopted it 2024-01-09; first gen_ai.* attributes merged upstream 2024-04-16). Arize ships a ton of instrumentation libraries for it, and OpenInference is also what Agno, BeeAI, and SmolAgents emit. So when frameworks "have their own thing," sometimes it is actually OpenInference under the hood.

OK so that is the emit side. What does the consume side look like?

I did a bit of a deep dive into a few popular open source observability tools, and dug out some dirty laundry on how they handle this.

The answer is that they DO support most of the stuff, but three of them have built explicit translation layers from gen_ai.* into their own proprietary data models. None of them use gen_ai.* as their internal data model. Phoenix is the clear outlier. They skipped the spec entirely and built the whole product on OpenInference. Their MIGRATION.md says it twice across v3 and v5: "Phoenix now exclusively uses OpenInference for instrumentation." The frontend folder is literally named app/src/openInference/.

Here is a table a-la Claude.

Cells are I·E·M (Ingest, Emit, Internal model). N=native, M=mapped, D=different name, A=absent, ?=unclear.

AttributeMLflowPhoenixLangfuseOpik
gen_ai.systemM·A·DD·D·DM·D·DM·A·D
gen_ai.operation.nameN·M·DA·A·AM·D·AM·A·D
gen_ai.request.modelN·M·DD·D·DM·D·DM·A·D
gen_ai.response.modelN·M·DA·A·AM·D·AM·A·D
gen_ai.usage.input_tokensN·M·DD·D·DM·D·DM·A·D
gen_ai.usage.output_tokensN·M·DD·D·DM·D·DM·A·D
gen_ai.request.temperatureA·M·AA·D·AM·D·AM·A·D
gen_ai.promptM·A·AD·D·DM·D·AM·A·D
gen_ai.completionM·A·AD·D·DM·D·AM·A·D
gen_ai.tool.nameA·A·A?·D·DM·D·AM·A·D

Three of the four tools (Langfuse, MLflow, Opik) translate gen_ai.* on ingest into their own taxonomy. None use it as the internal data model. None emit it by default. Phoenix stores it verbatim and ignores it. The four archetypes:

ToolArchetypeKey signal
PhoenixExplicit competitorMIGRATION.md says it twice: "Phoenix now exclusively uses OpenInference for instrumentation." Frontend folder is literally app/src/openInference/.
MLflowOpt-in dual-namespaceMLFLOW_ENABLE_OTEL_GENAI_SEMCONV=False by default.
LangfuseBridge layer3072-line OtelIngestionProcessor.ts, five separate code paths for prompt/completion to handle spec churn.
OpikIngest-onlyReal OTLP receiver, but silently rewrites input_tokens to prompt_tokens internally.

An interesting finding is that, for example, Langfuse has more than 10 custom mappers (framework-specific attributes like ai.* to Langfuse domain) from frameworks like Vercel ai, genkit, traceloop, Pydantic AI, logfire, livekit, mlflow, smolagents, openinference, plus partial branches for pipecat, llamaindex, and google adk. All in one ~3000-line ingest file.

One notable thing is that newer fields, like evaluation results, do not appear to have any special handling, as far as I (and Claude) can tell.

By browsing the changelog, you can see that the semantic conventions have been evolving for a while. Just in the last 12 months, gen_ai.system got renamed to gen_ai.provider.name, prompt_tokens to input_tokens, and gen_ai.prompt was removed entirely (content moved to events, then the event shape got refactored too). Small wonder: everyone is learning this thing as we go and the specs play catch-up. Bundle that with the fact that every vendor already has a working internal data model from before the spec landed, and this might be a long road to reach stability.

So in practice it is not like I can just drop the custom attributes. I need to keep them and map to the spec attributes when they match. Sure, we have our own domain model too. But I am planning to follow the spec as much as possible. Granted that it is volatile. I would rather chase spec churn than ship a fourth taxonomy nobody asked for.

So why is it like that?

Few plausible reasons.

Every tool had a working LLM data model before the spec landed in April 2024. So for them, the spec arrives as a migration ask, not a green-field choice. Phoenix's whole product is built on OpenInference. Langfuse has its Generation/Observation/Score model in ClickHouse. MLflow has its own trace schema. Opik has its Java DTOs.

OpenInference got there about 3 months earlier and is frankly better suited for the job. It puts content directly on the span, which is what most span-first backends like Tempo or Jaeger can actually render.

The spec is still marked as in development. None of the relevant attributes are marked as stable. So, like, I've found out, it is hard to commit to baking something that keeps moving into your internal schema.

And the spec has gaps for things production cares about. Cost is the obvious one. There is no canonical attribute for token cost. So three of the four tools just invented their own (gen_ai.usage.cost in Langfuse, mlflow.llm.cost in MLflow, custom routes in Opik). Same story for evaluation. The gen_ai.evaluation.* namespace exists in the registry, but zero of the four tools touch it. They all ship their own eval/scoring systems instead.

Anyway, this is business as usual in the LLM space, moves fast and specs play catch-up.

Comments

Popular posts from this blog

I'm not a passionate developer

A family friend of mine is an airlane pilot. A dream job for most, right? As a child, I certainly thought so. Now that I can have grown-up talks with him, I have discovered a more accurate description of his profession. He says that the truth about the job is that it is boring. To me, that is not that surprising. Airplanes are cool and all, but when you are in the middle of the Atlantic sitting next to the colleague you have been talking to past five years, how stimulating can that be? When he says the job is boring, it is not a bad kind of boring. It is a very specific boring. The "boring" you would want as a passenger. Uneventful.  Yet, he loves his job. According to him, an experienced pilot is most pleased when each and every tiny thing in the flight plan - goes according to plan. Passengers in the cabin of an expert pilot sit in the comfort of not even noticing who is flying. As someone employed in a field where being boring is not exactly in high demand, this sounds pro...

RocksDB data recovery

I recently needed to do some maintenance on a RocksDB key-value store. The task was simple enough, just delete some keys as the db served as a cache and did not contain any permanent data. I used the RocksDB cli administration tool ldb to erase the keys. After running a key scan with it, I got this error Failed: Corruption: Snappy not supported or corrupted Snappy compressed block contents So a damaged database. Fortunately, there's a tool to fix it, and after running it, I had access to the db via the admin tool. All the data was lost though. Adding and removing keys worked fine but all the old keys were gone. It turned out that the corrupted data was all the data there was. The recovery tool made a backup folder, and I recovered the data by taking the files from the backup folder and manually changing the CURRENT file to point to the old MANIFEST file which is apparently how RocksDB knows which sst (table) files to use. I could not access the data with the admin tool, ...

PydanticAI + evals + LiteLLM pipeline

I gave a tech talk at a Python meetup titled "Overengineering an LLM pipeline". It's based on my experiences of building production-grade stuff with LLMs I'm not sure how overengineered it actually turned out. Experimental would be a better term as it is using PydanticAI graphs library, which is in its very early stages as of writing this, although arguably already better than some of the pipeline libraries. Anyway, here is a link to it. It is a CLI poker app where you play one hand against an LLM. The LLM (theoretically) gets better with a self-correcting mechanism based on the evaluation score from another LLM. It uses the annotated past games as an additional context to potentially improve its decision-making. https://github.com/juho-y/archipylago-poker