OpenTelemetry Implementation Guide: Cloud-Native Observability for Azure Functions and Web APIs

Comprehensive technical guide to implementing OpenTelemetry in cloud environments, covering distributed tracing, metrics, and logging patterns for Azure Functions and ASP.NET Core Web APIs.

Author Avatar

Fernando

  ·  17 min read

What is OpenTelemetry? #

OpenTelemetry is a vendor-neutral, open-source standard for collecting telemetry data from cloud-native applications. It provides a unified approach to instrumentation that eliminates vendor lock-in and enables teams to collect traces, metrics, and logs through a single SDK.

The core problem OpenTelemetry solves: Traditional observability tooling requires vendor-specific instrumentation. Switching from one observability platform to another requires rewriting all instrumentation code. OpenTelemetry provides a single implementation that exports to any backend—Azure Monitor, Datadog, Grafana, Jaeger, Prometheus, or multiple platforms simultaneously.

Key characteristics:

  • Vendor-neutral - Instrument once, export to any observability backend
  • Standard protocol - OTLP (OpenTelemetry Protocol) ensures cross-platform compatibility
  • Three signal types - Unified collection of traces, metrics, and logs
  • Automatic instrumentation - Built-in support for common libraries (HTTP, databases, messaging)
  • Custom instrumentation - Extend with business-specific telemetry

Understanding Observability vs Monitoring #

Monitoring answers the question: “Is the system working?” It provides predefined dashboards, alerts, and health checks based on known failure modes.

Observability answers the question: “Why is the system behaving this way?” It enables arbitrary queries against telemetry data to debug unknown problems without predicting failure modes in advance.

Distributed systems require observability because failures emerge from complex interactions between components. Logs scattered across multiple services cannot reconstruct request flows. Observability provides the context to understand system behavior under all conditions, not just anticipated failure scenarios.

OpenTelemetry Signal Types: Traces, Metrics, and Logs #

OpenTelemetry collects three types of telemetry signals that work together to provide comprehensive observability. Each signal type serves a specific purpose and answers different questions about system behavior.

Distributed Tracing #

What it is: A trace represents a single request’s journey through your distributed system. Each step in that journey is a span. Traces connect spans using a unique trace ID, creating a complete picture of request flow across service boundaries.

Request flow example:

  1. HTTP request arrives at API Gateway (span 1)
  2. Gateway invokes Azure Function for inventory validation (span 2)
  3. Function queries database (span 3)
  4. Function invokes payment processing service (span 4)
  5. Response returns to client (back through span 1)

What traces capture:

  • Timing information for each operation
  • Parent-child relationships between spans
  • Trace ID that connects distributed operations
  • Span attributes (tags) for filtering and search
  • Error states and exception details

Use case: A performance investigation shows 95% of requests complete in under 200ms, but 5% take 8+ seconds. Filtering traces by latency reveals slow requests all occur during a specific time window. Further investigation shows database lock contention from a background job. This correlation between timing patterns and system behavior cannot be discovered through logs alone.

Metrics #

What it is: Numerical measurements aggregated over time intervals. Metrics provide quantitative data about system health and performance without the detailed context of individual requests.

Common metric types:

  • Request throughput (requests/second)
  • Error rate percentage
  • Resource utilization (CPU, memory, disk)
  • Queue depth and message backlog
  • Custom business metrics (orders processed, cache hit rate)

Use case: Metrics serve as the early warning system. A spike in error rate indicates a problem. Metrics identify when and how often issues occur. Traces provide the why by showing specific failing request paths.

Logs #

What it is: Event records with timestamps and contextual information. Logs provide narrative details about system operations and state changes.

Structured logging principles:

  • Use consistent JSON schema across services
  • Include trace ID and span ID for correlation
  • Add contextual fields (user ID, tenant ID, operation)
  • Avoid excessive logging that creates noise
  • Log decisions and state changes, not every step

Integration with traces: When you identify a failing trace, query logs filtered by that trace ID to see detailed event sequences. This correlation transforms logs from scattered events into a narrative explaining what the system did during that specific request.

Implementation Guide: Azure Functions vs Web APIs #

OpenTelemetry implementation differs significantly between serverless Azure Functions and traditional ASP.NET Core Web APIs. Each platform has distinct architectural characteristics that affect instrumentation strategy.

This section provides step-by-step configuration instructions for both platforms.

Azure Functions Implementation #

Azure Functions present unique observability challenges due to their ephemeral, event-driven nature. Unlike long-running processes, Functions consist of hundreds of short-lived executions that must maintain trace context across invocations.

Architectural consideration: Azure Functions use a dual-process model:

  1. Functions Host - Runtime that receives triggers and manages lifecycle
  2. Worker Process - Application code (Program.cs and function implementations)

Complete observability requires instrumenting both processes. Failure to instrument either process results in incomplete traces and broken context propagation.

Step 1: Configure the Functions Host #

Enable OpenTelemetry output in host.json:

1{
2  "version": "2.0",
3  "logging": {
4    "applicationInsights": {
5      "telemetryMode": "openTelemetry"
6    }
7  }
8}

This configuration instructs the Functions Host to emit telemetry as OpenTelemetry signals instead of using the legacy Application Insights SDK. Without this setting, duplicate telemetry and inconsistent trace correlation will occur.

Step 2: Configure the Worker Process #

For .NET isolated Functions, add OpenTelemetry to dependency injection in Program.cs:

 1using Microsoft.Azure.Functions.Worker;
 2using Microsoft.Extensions.DependencyInjection;
 3using Microsoft.Extensions.Hosting;
 4using OpenTelemetry;
 5using OpenTelemetry.Exporter;
 6using OpenTelemetry.Resources;
 7using OpenTelemetry.Trace;
 8
 9var host = new HostBuilder()
10    .ConfigureFunctionsWebApplication()
11    .ConfigureServices(services =>
12    {
13        services.AddApplicationInsightsTelemetryWorkerService();
14        services.ConfigureFunctionsApplicationInsights();
15
16        services.AddOpenTelemetry()
17            .ConfigureResource(resource => resource
18                .AddService(serviceName: "my-function-app"))
19            .UseFunctionsWorkerDefaults()
20            .WithTracing(tracing => tracing
21                .AddAspNetCoreInstrumentation()
22                .AddHttpClientInstrumentation()
23                .AddAzureMonitorTraceExporter())
24            .WithMetrics(metrics => metrics
25                .AddAspNetCoreInstrumentation()
26                .AddHttpClientInstrumentation()
27                .AddAzureMonitorMetricExporter());
28    })
29    .Build();
30
31host.Run();

Configuration requirements:

  1. UseFunctionsWorkerDefaults() - Critical method that sets up special instrumentation for correlating host and worker telemetry in Azure Functions.

  2. Service name - Becomes the identifier across all telemetry. Must be consistent across all configuration.

  3. Selective instrumentation - Include instrumentation only for libraries actually used in your application (e.g., ASP.NET Core for HTTP triggers, HttpClient for outbound calls). Instrumenting unused libraries adds overhead without benefit.

Required NuGet packages:

1<PackageReference Include="Microsoft.Azure.Functions.Worker.OpenTelemetry" Version="1.1.0-preview2" />
2<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.9.0" />
3<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.9.0" />
4<PackageReference Include="Azure.Monitor.OpenTelemetry.Exporter" Version="1.4.0" />

Tracking Cold Starts #

Serverless functions experience cold starts where the first request to a function instance takes longer due to runtime initialization. These appear in traces as unusually long spans for initial requests.

Track cold starts explicitly by adding custom attributes to spans:

 1[Function("ProcessOrder")]
 2public async Task<HttpResponseData> Run(
 3    [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequestData req,
 4    FunctionContext context)
 5{
 6    var activity = Activity.Current;
 7
 8    // Check if this is a cold start by looking for a static flag
 9    if (!_isWarmedUp)
10    {
11        activity?.SetTag("cold.start", true);
12        _isWarmedUp = true;
13    }
14
15    // Your function logic here
16}

This enables filtering traces by cold starts to analyze their performance characteristics separately from warm executions.

Web API Implementation #

Traditional Web APIs (ASP.NET Core) are simpler to instrument due to their long-running process model with full control over the startup pipeline.

Configure OpenTelemetry in Program.cs:

 1using OpenTelemetry.Resources;
 2using OpenTelemetry.Trace;
 3using OpenTelemetry.Metrics;
 4
 5var builder = WebApplication.CreateBuilder(args);
 6
 7// Add services to the container
 8builder.Services.AddControllers();
 9
10// Configure OpenTelemetry
11builder.Services.AddOpenTelemetry()
12    .ConfigureResource(resource => resource
13        .AddService(
14            serviceName: "order-api",
15            serviceVersion: "1.0.0"))
16    .WithTracing(tracing => tracing
17        .AddAspNetCoreInstrumentation(options =>
18        {
19            // Capture request and response bodies for detailed debugging
20            options.RecordException = true;
21            options.EnrichWithHttpRequest = (activity, request) =>
22            {
23                activity.SetTag("http.route", request.Path);
24            };
25            options.EnrichWithHttpResponse = (activity, response) =>
26            {
27                activity.SetTag("http.response.status_code", response.StatusCode);
28            };
29        })
30        .AddHttpClientInstrumentation()
31        .AddEntityFrameworkCoreInstrumentation(options =>
32        {
33            options.SetDbStatementForText = true; // Capture SQL queries
34        })
35        .AddOtlpExporter(options =>
36        {
37            options.Endpoint = new Uri("http://localhost:4317");
38            options.Protocol = OtlpExportProtocol.Grpc;
39        }))
40    .WithMetrics(metrics => metrics
41        .AddAspNetCoreInstrumentation()
42        .AddHttpClientInstrumentation()
43        .AddRuntimeInstrumentation()
44        .AddOtlpExporter(options =>
45        {
46            options.Endpoint = new Uri("http://localhost:4317");
47            options.Protocol = OtlpExportProtocol.Grpc;
48        }));
49
50var app = builder.Build();
51
52app.MapControllers();
53app.Run();

Key differences from Azure Functions:

  1. Single-process model - All code runs in one process, requiring instrumentation configuration only once.

  2. Enhanced instrumentation options - Support for EF Core query capture, runtime metrics, and HTTP instrumentation customization via enrichment callbacks.

  3. Simplified local development - OpenTelemetry Collector can run locally for immediate trace visibility without cloud resource dependencies.

  4. State monitoring - Web APIs typically maintain more state (caches, connection pools) requiring custom metric instrumentation.

Required NuGet packages:

1<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.9.0" />
2<PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.9.0" />
3<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.9.0" />
4<PackageReference Include="OpenTelemetry.Instrumentation.EntityFrameworkCore" Version="1.9.0-beta.1" />
5<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.9.0" />

OpenTelemetry Collector: Centralized Telemetry Hub #

The OpenTelemetry Collector pattern provides a powerful architecture for managing telemetry data.

Direct service-to-backend export (e.g., each service exporting directly to Azure Monitor) works but creates inflexibility. Adding additional backends (such as Jaeger for local development traces or Prometheus for metrics) requires updating configuration across all services.

The Collector pattern solves this by acting as a centralized proxy:

Services → Collector → Multiple Backends

Your services send telemetry to the Collector using the OTLP protocol (OpenTelemetry Protocol). The Collector then routes that data to whatever backends you’ve configured—Azure Monitor, Datadog, Grafana, Jaeger, Prometheus, or all of the above.

Here’s a simplified Collector configuration (otel-collector-config.yaml):

 1receivers:
 2  otlp:
 3    protocols:
 4      grpc:
 5        endpoint: 0.0.0.0:4317
 6      http:
 7        endpoint: 0.0.0.0:4318
 8
 9processors:
10  batch:
11    timeout: 10s
12    send_batch_size: 1024
13
14  # Add resource attributes to all telemetry
15  resource:
16    attributes:
17      - key: environment
18        value: production
19        action: upsert
20
21exporters:
22  # Azure Monitor for production observability
23  azuremonitor:
24    connection_string: "${APPLICATIONINSIGHTS_CONNECTION_STRING}"
25
26  # Jaeger for local development
27  jaeger:
28    endpoint: "jaeger:14250"
29    tls:
30      insecure: true
31
32  # Prometheus for metrics
33  prometheus:
34    endpoint: "0.0.0.0:8889"
35
36service:
37  pipelines:
38    traces:
39      receivers: [otlp]
40      processors: [batch, resource]
41      exporters: [azuremonitor, jaeger]
42
43    metrics:
44      receivers: [otlp]
45      processors: [batch, resource]
46      exporters: [azuremonitor, prometheus]

Benefits of the Collector pattern:

  • Single configuration point - Change backends without modifying service code
  • Data transformation - Add attributes, filter spans, and apply sampling
  • Cost control - Implement sampling strategies to reduce telemetry volume
  • Multi-backend support - Route data to multiple observability platforms simultaneously

Deployment options:

  • Sidecar container in Kubernetes clusters
  • Standalone service in Azure Container Instances for serverless functions
  • Docker container for local development

Sampling Strategies for Production #

Capturing every trace in production is cost-prohibitive due to data volume. Production systems require sampling strategies.

Sampling intelligently selects which traces to retain while maintaining statistical validity for system analysis. Two primary strategies exist:

Head Sampling (Simple) #

Decide at the start of a trace whether to record it, typically based on a percentage:

1builder.Services.AddOpenTelemetry()
2    .WithTracing(tracing => tracing
3        .SetSampler(new TraceIdRatioBasedSampler(0.1))); // Sample 10% of traces

Head sampling is simple to implement but has limitations. It discards 90% of traces regardless of content, including potentially critical traces (errors, slow requests).

Tail Sampling (Smart) #

Make sampling decisions at the end of a trace based on what actually happened. Keep all errors, keep slow requests, sample everything else at a lower rate.

Tail sampling requires the Collector because you need to wait for the entire trace to complete:

 1processors:
 2  tail_sampling:
 3    policies:
 4      # Always sample errors
 5      - name: errors
 6        type: status_code
 7        status_code: {status_codes: [ERROR]}
 8
 9      # Always sample slow requests (>1s)
10      - name: slow-requests
11        type: latency
12        latency: {threshold_ms: 1000}
13
14      # Sample 5% of successful, fast requests
15      - name: baseline
16        type: probabilistic
17        probabilistic: {sampling_percentage: 5}
18
19service:
20  pipelines:
21    traces:
22      receivers: [otlp]
23      processors: [tail_sampling, batch]
24      exporters: [azuremonitor]

This configuration provides complete error traces while controlling costs on successful requests.

Recommended sampling rates:

  • 100% of error traces
  • 100% of requests exceeding 500ms latency
  • 5% of all other requests

This combination provides comprehensive observability for problematic requests while managing costs for normal operations.

Custom Instrumentation: When Auto-Instrumentation Isn’t Enough #

Auto-instrumentation covers common libraries (HTTP, database, etc.), but your business logic is invisible by default. This is where custom spans come in.

Here’s an example from an order processing function:

 1[Function("ProcessOrder")]
 2public async Task<HttpResponseData> Run(
 3    [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequestData req,
 4    FunctionContext context)
 5{
 6    var logger = context.GetLogger<ProcessOrder>();
 7
 8    // The current Activity represents this function's span
 9    var activity = Activity.Current;
10    activity?.SetTag("order.source", "web");
11
12    var order = await JsonSerializer.DeserializeAsync<Order>(req.Body);
13    activity?.SetTag("order.id", order.Id);
14    activity?.SetTag("order.total", order.Total);
15
16    // Create a custom span for inventory validation
17    using (var inventoryActivity = _activitySource.StartActivity("ValidateInventory"))
18    {
19        inventoryActivity?.SetTag("order.items.count", order.Items.Count);
20
21        try
22        {
23            var isValid = await _inventoryService.ValidateAsync(order.Items);
24            inventoryActivity?.SetTag("inventory.valid", isValid);
25
26            if (!isValid)
27            {
28                inventoryActivity?.SetStatus(ActivityStatusCode.Error, "Insufficient inventory");
29                return req.CreateResponse(HttpStatusCode.BadRequest);
30            }
31        }
32        catch (Exception ex)
33        {
34            inventoryActivity?.SetStatus(ActivityStatusCode.Error, ex.Message);
35            inventoryActivity?.RecordException(ex);
36            throw;
37        }
38    }
39
40    // Create another custom span for payment processing
41    using (var paymentActivity = _activitySource.StartActivity("ProcessPayment"))
42    {
43        paymentActivity?.SetTag("payment.method", order.PaymentMethod);
44        paymentActivity?.SetTag("payment.amount", order.Total);
45
46        var result = await _paymentService.ChargeAsync(order);
47        paymentActivity?.SetTag("payment.transaction_id", result.TransactionId);
48    }
49
50    logger.LogInformation("Order {OrderId} processed successfully", order.Id);
51
52    return req.CreateResponse(HttpStatusCode.OK);
53}

Key patterns in custom instrumentation:

  1. Use ActivitySource—create one per logical component:
1private static readonly ActivitySource _activitySource = new("OrderProcessing");
  1. Always use using statements—ensures spans close even if exceptions occur.

  2. Add business-relevant tags—order ID, customer ID, payment method. These make traces searchable.

  3. Record exceptions explicitlyRecordException() captures full stack traces.

  4. Set span statusSetStatus() marks spans as errors, which affects sampling and alerting.

Context Propagation: The Hidden Magic #

The most subtle and critical part of distributed tracing is context propagation—ensuring trace IDs flow correctly across service boundaries.

OpenTelemetry uses W3C Trace Context headers to propagate this information:

  • traceparent - Contains trace ID, span ID, and sampling decision
  • tracestate - Vendor-specific data

When service A calls service B:

1// Service A
2using var httpClient = new HttpClient();
3// OpenTelemetry automatically injects traceparent header
4var response = await httpClient.GetAsync("https://service-b/api/orders");
1// Service B
2// OpenTelemetry automatically extracts traceparent header and continues the trace
3[HttpGet]
4public async Task<IActionResult> GetOrders()
5{
6    // This span automatically becomes a child of the span from Service A
7    return Ok(await _orderRepository.GetAllAsync());
8}

This automatic propagation works for HTTP calls. However, additional scenarios require manual context propagation:

Message queues (Azure Service Bus, RabbitMQ, etc.) require manual context propagation:

1// Publishing side
2var message = new ServiceBusMessage(JsonSerializer.Serialize(order))
3{
4    ApplicationProperties =
5    {
6        ["traceparent"] = Activity.Current?.Id
7    }
8};
9await sender.SendMessageAsync(message);
1// Consuming side
2var traceparent = message.ApplicationProperties["traceparent"] as string;
3if (traceparent != null)
4{
5    Activity.Current = new Activity("ProcessMessage");
6    Activity.Current.SetParentId(traceparent);
7    Activity.Current.Start();
8}

Background jobs - Requests that trigger background jobs lose trace context by default. Store the trace ID with the job for continuity:

1// When queuing the background job
2await _jobQueue.EnqueueAsync(new BackgroundJob
3{
4    OrderId = order.Id,
5    TraceParent = Activity.Current?.Id // Capture current trace context
6});
 1// When processing the background job
 2var job = await _jobQueue.DequeueAsync();
 3
 4// Start a new span that links to the original trace
 5using var activity = _activitySource.StartActivity(
 6    "ProcessBackgroundJob",
 7    ActivityKind.Consumer,
 8    job.TraceParent); // Resume the original trace
 9
10// Process the job

Local Development Setup #

OpenTelemetry enables local observability during development. Set up a local observability stack using Docker Compose:

docker-compose.yml:

 1version: '3.8'
 2
 3services:
 4  jaeger:
 5    image: jaegertracing/all-in-one:latest
 6    ports:
 7      - "16686:16686"  # Jaeger UI
 8      - "4317:4317"    # OTLP gRPC receiver
 9      - "4318:4318"    # OTLP HTTP receiver
10    environment:
11      - COLLECTOR_OTLP_ENABLED=true
12
13  prometheus:
14    image: prom/prometheus:latest
15    ports:
16      - "9090:9090"
17    volumes:
18      - ./prometheus.yml:/etc/prometheus/prometheus.yml
19
20  grafana:
21    image: grafana/grafana:latest
22    ports:
23      - "3000:3000"
24    environment:
25      - GF_AUTH_ANONYMOUS_ENABLED=true
26      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin

Run docker-compose up to access:

  • Jaeger UI at http://localhost:16686 for viewing traces
  • Prometheus at http://localhost:9090 for metrics
  • Grafana at http://localhost:3000 for dashboards

Local services export to this stack using the same OTLP exporters as production—only the endpoint URL differs. This enables full distributed tracing debugging during local development.

Benefits of local observability:

  • Identify race conditions before production deployment
  • Debug deadlocks and performance issues locally
  • Validate instrumentation configuration
  • Test sampling strategies
  • Develop with production-like visibility

Best Practices and Common Pitfalls #

Critical lessons for successful OpenTelemetry implementation in production environments:

Start with the Collector #

Don’t export directly from services to your observability backend. Start with the Collector from day one. It gives you flexibility, testability, and control.

Implement Instrumentation Incrementally #

Attempting comprehensive instrumentation simultaneously leads to complexity and delays. Use a phased approach:

  1. Start with auto-instrumentation for HTTP and database calls
  2. Add custom spans for critical business operations
  3. Add detailed instrumentation to problem areas as they are identified

Incremental instrumentation is more effective than attempting perfect coverage initially.

Strategic Tag Implementation #

Tags added to spans determine available query capabilities. Essential tags include:

  • User IDs (with appropriate privacy considerations)
  • Tenant IDs in multi-tenant systems
  • Active feature flags
  • Deployment version
  • Environment (staging, production)

Comprehensive tagging enables precise filtering: from “what happened” to “what happened for this specific user in this specific situation.”

Logs Still Matter #

OpenTelemetry focuses on traces and metrics, but don’t neglect structured logging. The combination of traces and correlated logs is more powerful than either alone.

Use the trace ID in your log messages:

1logger.LogInformation(
2    "Processing order {OrderId} [TraceId: {TraceId}]",
3    order.Id,
4    Activity.Current?.TraceId);

This correlation enables finding all logs for a given trace, or all traces matching log search criteria.

Cost Management #

Observability costs can escalate rapidly without proper controls. Example: Unmanaged Azure Monitor telemetry can reach $5,000/month or more. Implementing tail sampling and data retention policies can reduce costs to $800/month while maintaining signal quality.

Cost control measures:

  • Implement sampling strategies (see Sampling section)
  • Define data retention policies
  • Monitor observability costs as a metric
  • Regularly review telemetry volume by service
  • Remove unnecessary instrumentation

Azure Functions Performance Considerations #

Azure Functions with OpenTelemetry experience higher cold-start latency (~500ms additional on first request) compared to the legacy Application Insights SDK. This represents the trade-off for vendor neutrality and enhanced telemetry capabilities.

Mitigation strategies for latency-sensitive scenarios:

  • Implement warm-up functions
  • Use provisioned instances
  • Evaluate if vendor neutrality justifies the latency impact

Documentation and Maturity #

The OpenTelemetry ecosystem evolves rapidly. Official documentation often lags 6-12 months behind current implementation state.

Recommended practices:

  • Check GitHub issues for current feature status
  • Review example repositories from project maintainers
  • Thoroughly test implementations before production deployment
  • Be aware that Azure Functions OpenTelemetry support remains in preview (as of February 2026)
  • Some features (such as full log correlation) continue to mature

Observability Impact on Development Practices #

OpenTelemetry is a tool that enables observability as an engineering practice. Comprehensive observability changes how teams approach development and operations:

Architecture decisions: Complete request tracing reduces risk in distributed architectures. Teams can confidently build distributed systems knowing they maintain full visibility across service boundaries.

Debugging methodology: Query existing traces instead of adding logs and redeploying. Example: “What commonalities exist among slow requests between 2 PM and 3 PM?” becomes a 30-second query rather than a 2-hour investigation.

Incident response: Reconstruct exact failure sequences instead of speculation. Post-mortems shift from “probable causes” to “definitive trace evidence showing failure sequence.”

Performance optimization: Make data-driven decisions about actual bottlenecks rather than assumed bottlenecks. Reduce premature optimization by measuring performance before optimizing.

What This Guide Doesn’t Cover #

This article focused on my experience with .NET, Azure Functions, and Web APIs. OpenTelemetry supports many more languages and platforms, each with their own nuances:

  • Other languages - Node.js, Python, Go, Java all have OpenTelemetry SDKs with different maturity levels
  • Other cloud providers - AWS Lambda and GCP Cloud Functions have their own considerations
  • Kubernetes observability - Service mesh integration, pod-level metrics
  • Mobile and frontend - Browser tracing, mobile SDK instrumentation
  • Security and privacy - PII handling, data sanitization, compliance

Key Takeaways #

OpenTelemetry implementation fundamentals for cloud services:

  • Distributed tracing reveals system behavior that logs and metrics alone cannot capture
  • Azure Functions require dual instrumentation of both host and worker processes
  • Web APIs offer richer instrumentation options but require careful configuration for production scale
  • The Collector pattern provides flexibility to route telemetry to multiple backends without code changes
  • Sampling strategies control costs while preserving visibility into errors and performance issues
  • Custom instrumentation captures business context that makes traces actionable
  • Local observability stack enables debugging during development rather than only in production
  • Observability changes development practices including architecture, debugging, and incident response

Comprehensive observability reduces mean time to resolution (MTTR) for production incidents and enables data-driven architecture decisions. #

Sources #

This article synthesizes research and practical experience from multiple authoritative sources:

Technical implementation guide based on production OpenTelemetry deployments and current observability best practices.