Observability setup

When to use this

You run Graftcode in production (or staging) and need correlated logs and traces across Graft calls — caller, Gateway, and receiver — using tools you already operate (OpenTelemetry, Application Insights, Grafana, Datadog, ELK, and so on).

Conceptual background: Observability, tracing, and context propagation.

Prerequisites

Services that emit logs or traces today (or a plan to add them)
Caller and receiver both running Grafts (or one side in-memory in the same process)
For Gateway diagnostics: access to GG console output and Graftcode Vision

What Graftcode provides (and what it does not)

Graftcode provides	Graftcode does not provide
Trace and span context propagated inside IIP invocation intent across Hypertube boundaries	A proprietary APM backend or log aggregator
Continuous trace chains across in-memory, polyglot, and remote execution	Payload logging or business-data export to the cloud
Exception propagation with context intact	A replacement for your logging framework

Integration model: keep your existing observability stack. Graftcode ensures context crosses runtime boundaries so your exporters and log pipelines can correlate caller and callee the same way they do for HTTP or gRPC.

You do not install a separate Graftcode observability SDK. Generated Grafts carry the runtime bridge; wire correlation into the logging and tracing you already use.

Distributed trace propagation across caller, Graft, Gateway, and your OpenTelemetry backend

Step 1: Use structured logging with correlation fields

Ensure every service logs structured fields, not plain text only. At minimum include:

trace_id / traceId (W3C trace context when using OpenTelemetry)
span_id / spanId
A business correlation ID if your org uses one (correlationId, requestId, …)
Service name and environment

TypeScript (example with OpenTelemetry context):

import { trace } from "@opentelemetry/api";

const span = trace.getActiveSpan();
const spanContext = span?.spanContext();

logger.info("Processing order", {
  orderId: "123",
  traceId: spanContext?.traceId,
  spanId: spanContext?.spanId,
});

.NET (example with ILogger and OpenTelemetry):

using Microsoft.Extensions.Logging;

_logger.LogInformation(
    "Processing order {OrderId}",
    orderId);
// With ASP.NET Core + OpenTelemetry instrumentation, trace/span IDs
// are attached automatically when logging scopes are enabled.

Python (example):

import logging

logger = logging.getLogger(__name__)
logger.info("Processing order", extra={"order_id": "123"})
# Attach OpenTelemetry LoggingInstrumentor or read active span context
# and add trace_id / span_id to `extra` if your backend expects them.

Apply the same pattern on both the caller and the service behind the Gateway. If only one side emits trace IDs, distributed traces will fragment.

Step 2: Enable distributed tracing with OpenTelemetry

OpenTelemetry is the recommended integration path. Graftcode does not require a custom exporter — use the SDK and backends you already run.

Typical setup per service:

Install the OpenTelemetry SDK for your runtime (.NET, Node.js, Python, Java, …).
Register a TracerProvider with your exporter (OTLP, Jaeger, Azure Monitor / Application Insights, etc.).
Enable auto-instrumentation where available (HTTP, ASP.NET Core, FastAPI, …).
Deploy the same trace context format (W3C traceparent) across all services in the call chain.

Official starting points:

Graft calls propagate trace context inside IIP, not as HTTP headers. Your application code usually stays unchanged; ensure the active span context at the call site is picked up by Graftcode's runtime bridge (same as any in-process call in that runtime).

Step 3: Understand trace propagation across Graft boundaries

When a Graft call crosses processes:

The caller creates or continues a distributed trace (via your OTel setup).
Trace context is embedded in the IIP invocation intent sent over Hypertube.
The receiver (Gateway-hosted runtime) continues the same trace for its spans and logs.
Results and exceptions return with context preserved.

This works for:

Remote calls (GraftConfig.host set to tcp:// or ws://)
In-memory and polyglot execution (same process, different runtimes)
Stateful sessions (callbacks and events stay within the same trace when your stack supports it)

For load-balanced Gateways, combine with Stateless vs stateful execution and Scaling, load balancers, and proxies.

Step 4: Gateway and Portal visibility

Use the right tool for each observability layer:

Layer	Tool	Use for
Functional testing	Graftcode Vision	Inspect exposed methods, try calls, copy install commands
Protocol debugging	`GG_DEBUG=1`	Byte-level IIP traffic in GG console — dev/troubleshooting only (Debugging runbook)
Project analytics	Graftcode Portal / Insights	Aggregated call counts, latency, version usage when bound to a Project Key — not a substitute for full APM
Production APM	Your OpenTelemetry / cloud monitor stack	Traces, logs, alerts, SLOs

GG_DEBUG is verbose and not suitable for production. Use it only when diagnosing wire-level issues.

Step 5: Correlate errors across services

When a Graft call fails:

The exception type and message propagate to the caller (native to your language).
Trace context should remain attached so your APM shows one connected trace, not isolated spans.

In your runbooks, search by trace ID first, then by business correlation ID. See Troubleshooting runbook for symptom-based fixes.

Verify it works

Trigger a Graft call that crosses at least two services (or caller + Gateway receiver).
Confirm the same trace ID appears in caller and receiver logs (or in your tracing UI).
Confirm spans link parent → child in Jaeger, Application Insights, Grafana Tempo, or your backend.
Induce a controlled failure and verify the trace shows the error on both sides.

If traces break at the Graft boundary, check that both services export OTel and that structured logs include trace identifiers — not that a separate Graftcode logging package is missing.

Common mistakes

Mistake	Fix
Traces stop at the Gateway	Enable OTel and structured logging on receiver services, not just callers
Only plain-text logs	Add structured `traceId` / `spanId` (or use OTel log correlation)
Using `GG_DEBUG` in production	Use OTel + your APM; reserve `GG_DEBUG` for local protocol diagnosis
Expecting Portal Insights to replace APM	Portal metrics are aggregated and non-payload; keep fine-grained telemetry local
Stateful calls behind a load balancer	Session stickiness or stateless interfaces — see Stateless vs stateful