Reader

AWS Lambda, OpenTelemetry, and Grafana Cloud: a guide to serverless observability considerations

| engineering on Grafana Labs | Default

In our increasingly serverless world, observability isn’t just a “nice to have”—it’s essential. Serverless functions such as AWS Lambda bring incredible benefits, but they also introduce complexities, especially around monitoring and debugging. In a previous article, I provided a quick, practical guide for sending AWS Lambda traces to Grafana Cloud using OpenTelemetry.

Building on that, this post explores key considerations for enabling observability in AWS Lambda using Grafana Cloud Application Observability, our solution to monitor applications and minimize mean time to resolution. We’ll address serverless observability’s unique challenges, explore the advantages of using OpenTelemetry, review OpenTelemetry Collector deployment patterns, and look at how Grafana Cloud facilitates effective monitoring.

And with the Formula One season kicking off (I feel your pain if you’re a Ferrari fan like me 😅), I’ll bring it all together with a fun project that tracks the Formula One “Driver of the Day” using Lambda, OpenTelemetry, and Grafana Cloud.

Wait, why can’t I just install the OpenTelemetry SDK and be done?

So you’ve gone serverless. You’ve got your AWS Lambda functions doing their thing, and now you want observability. Naturally, you reach for OpenTelemetry—the standard toolkit for tracing, metrics, and logs—add the SDK, send the data to your backend, and you’re done, right?

Not quite.

If you’ve tried this with Lambda, chances are you ran into issues—missing traces, broken spans, weird gaps in your dashboards. That’s not because OpenTelemetry is broken. It’s because Lambda is different, and here’s why: 

  1. No traditional servers: Without direct server access, you can’t install agents or custom telemetry tools the same way you would in traditional environments.
  2. Highly distributed components: Serverless architectures are distributed by nature, making it harder to trace requests across components like Lambda, API Gateway, and DynamoDB.
  3. Ephemeral nature: Functions may exist only for a single request, so telemetry must be exported quickly. Many libraries don’t support forced flushing, which complicates data capture.
  4. Cold starts and stateless operations: Each invocation is isolated, which complicates trace and context propagation. Cold starts introduce latency, and maintaining trace continuity across requests requires specialized tools like OpenTelemetry.

These challenges require robust instrumentation, efficient data collection, and advanced correlation capabilities.

ADOT vs. custom SDKs: two ways to instrument Lambda with OpenTelemetry

Understanding the available methods for instrumenting AWS Lambda functions with OpenTelemetry is essential for enhancing observability. Developers typically take one of two approaches to instrument their Lambda functions using OpenTelemetry:

1. AWS Distro for OpenTelemetry (ADOT) Lambda layer

AWS Distro for OpenTelemetry (ADOT) is a fully managed, AWS-supported distribution of OpenTelemetry. It simplifies the observability setup by offering pre-configured AWS Lambda layers that automatically instrument your functions with minimal setup. You can find more details at aws-otel.github.io.

Here are some additional considerations when using ADOT:

  • Automatic instrumentation: ADOT includes an embedded, lightweight collector that facilitates plug-and-play observability. With this setup, you don’t need to modify your function’s code; telemetry data is automatically captured and sent to the configured backend.
  • Configuration: To use ADOT with your Lambda functions, add the ADOT Lambda layer to your function and configure environment variables to customize its behavior, such as defining the sampling rate, exporter endpoints, or specific tracing attributes.
  • Use case: This approach is ideal for quick setups and for teams looking to enable standard observability features (like tracing and metrics) without the need for custom instrumentation. It’s perfect for simpler serverless applications or for getting started with observability in AWS Lambda.

2. Custom OpenTelemetry Instrumentation

With custom OpenTelemetry instrumentation, you have the flexibility to fine-tune your telemetry data collection. Using the OpenTelemetry SDKs, you can implement both automatic and manual instrumentation, providing granular control over the data you collect and how it is processed.

Here are some additional considerations when using custom instrumentation:

  • Custom processing: If your observability needs go beyond basic setups, you can integrate custom collectors or use external systems for processing and exporting telemetry data. This approach enables more complex features like dynamic sampling, data transformation, or routing telemetry to multiple backends.
  • Use case: Custom instrumentation is suited for complex, large-scale applications that require advanced telemetry processing, or for teams that need to tailor observability solutions to meet specific business needs, such as integrating with custom backends or managing high-volume telemetry data.

Choosing between these approaches depends on your application’s complexity and observability needs. For straightforward setups, ADOT offers a hassle-free solution, while custom instrumentation provides greater flexibility for more complex and advanced deployment. 

Why the Lambda execution lifecycle affects your telemetry

Both approaches rely on understanding the specifics of Lambda’s execution model. If we want consistent, reliable observability, we need to know how and when telemetry data is collected—and more importantly, when it might be lost.

Initialization (Init)

  • Extension initialization: Lambda initializes extensions (e.g., monitoring tools) before the runtime and function code.
    • Telemetry collection can begin here, but any misconfigured extensions may result in missed early-stage data.
  • Runtime initialization: The function’s runtime environment is prepared.
    • This phase doesn’t emit telemetry directly, but setup time impacts cold start metrics.
  • Function initialization: Your code is loaded and ready to execute.
    • Custom instrumentation (e.g., manual SDK setup) often begins here.

Invocation (Invoke)

  • Cold start: The first invocation in a new execution environment introduces latency.
    • Telemetry tools capture this, but it’s important to configure them to record cold start time separately.
  • Warm start: Subsequent invocations reuse the same environment.
    • Telemetry is typically more consistent, but initialization events are skipped.

Shutdown

  • Resource cleanup: The environment is terminated, and resources are freed.
    • Telemetry buffers may be lost if they’re not flushed before shutdown.
AWS Lambda lifecycle phases

The diagram above illustrates the different lifecycle phases. It comes from the AWS Lambda lifecycle documentation, where you can find more details on this topic.

Why this matters

If your OpenTelemetry SDK is buffering spans or metrics, you must flush them before shutdown. Otherwise, data is lost. Many SDKs don’t flush automatically, especially in short-lived environments like Lambda. This is why tools like ADOT handle telemetry flushing internally—and why lifecycle awareness is critical when doing it yourself.

Collector deployment options: Direct, agent, or gateway?

Now that we’ve covered the various ways to instrument our applications and generate telemetry data, the next question is how to send the data to the observability platform. Let’s explore the different options available.

1. Direct integration—no collector 

With direct integration, telemetry data is sent straight to the observability vendor without using an OpenTelemetry Collector. This approach offers a simple setup, requiring only the configuration of applications to transmit data directly to the vendor’s OpenTelemetry endpoint. However, it comes with notable drawbacks:

  • Vendor lock-in: Strong dependency on a single vendor
  • Limited flexibility: Any major change in telemetry collection requires redeploying all applications
  • Processing limitations: Difficult to modify data (e.g., changing attributes), inefficient batching, and only head-based sampling is available
  • Increased maintenance complexity: Minimal room for customization or optimization in data handling
A diagram showing data being setn from Lambda to an OpenTelemetry-compatible observability platform

2. Agent deployment—layer with ADOT or OpenTelemetry

With agent-based deployment, a collector is used to send telemetry data from the application, such as a Lambda layer with an OpenTelemetry Collector. This approach enhances observability by providing additional processing capabilities and flexibility.

Benefits of using “agent” mode with a collector:

  • Improved processing capabilities for handling telemetry data more efficiently
  • Configurable options such as batching, retries, and data modification
  • Access to more features of the OpenTelemetry Collector for better customization and optimization

Drawbacks of this setup:

  • Requires a collector to run alongside each application, adding complexity
  • Increased maintenance effort when managing multiple services
  • Regular upgrades and security patching are necessary for each deployed collector
A diagram showing data going from Lambda to an OTel Colector layer extension and then to an observability platform

3. Gateway deployment

With a gateway deployment, a centralized OpenTelemetry Collector operates behind a load balancer, providing a scalable, highly available, and easily upgradable solution for telemetry data management.

Benefits of gateway mode:

  • Single endpoint for all services, allowing backend components to be replaced or evolved without impacting applications
  • Centralized configuration, simplifying management and reducing operational overhead
  • Batching, compression, and attribute modification (e.g., reducing cardinality) for optimized performance
  • Better solution for tail-based sampling: the first layer of collector routes all spans of a trace to the same collector in the downstream deployment using a load-balancing exporter, while the second layer performs tail sampling
  • Multi-vendor support, enabling data to be sent to multiple vendors simultaneously for comparison.

Drawbacks of this setup:

  • Introduces additional infrastructure within your environment that requires maintenance and potentially increases the operational burden 
A variation of the previous diagram, but with a cluster of collectors sending the data from multiple Lambda functions

Building a ‘Driver of the Day’ app with Lambda, OpenTelemetry, and Grafana Cloud

If you’re a Formula One fan like me, you may have noticed that during each race, there’s a vote for the “Driver of the Day.” To illustrate how you can put the deployment considerations we just discussed to use, I decided to build a simple application to simulate this voting process and leverage AWS Lambda for the backend. Below is an overview of the technical architecture I’ll be implementing:

Lambda functions:

I’ll create three Lambda functions: Posting-Votes, Processing-Votes, and Getting-Votes.

Observability setup with Grafana Cloud:

To enable observability, each Lambda function will be instrumented with the ADOT layer, which automatically captures telemetry data. The environment variables for each Lambda function will be configured to send the data to Grafana Alloy. Then Grafana Alloy will send the data to the Grafana Cloud OTLP endpoint.  As you may have noticed, I followed the gateway deployment pattern. 

The diagram below illustrates the overall architecture.

The same diagram as before, except the data is sent to a Grafana Alloy gateway before being sent to Grafana Cloud

Additionally, I’ve deployed a React web application that allows users to vote and view the results, as shown below.

User request workflow

Step 1: Set up your AWS Lambda functions

​​Create three Lambda functions:

  • Posting-Votes: This function will capture votes submitted from the React web application and pass them to the Processing-Vote function.
  • Processing-Votes: This function will process the incoming votes and store them in DynamoDB.
  • Getting-Votes: This function will retrieve the processed votes and display the results.
Creating a function from scratch in the Lambda UI

You can find the source code for these Lambda functions in this GitHub repo.

Step 2 : Add the OpenTelemetry ADOT layer for each function 

For each Lambda function (Posting-Votes, Processing-Votes, and Getting-Votes), add the ADOT Lambda layer. Be sure to review the AWS ADOT documentation to select the appropriate version of the ADOT layer based on the programming language you’re using. (You can also use the OpenTelemetry Lambda layer.) Find the most recent instrumentation layer release for your language and use its ARN after changing the tag to the region your Lambda is in.

In my example, I’m using Python with ADOT.

Choosing a layer with a specific ARN in Lambda

Step 3: Configure environment variables

Configure the environment variables for each Lambda function. The “gateway” deployment pattern introduces an additional layer of the OpenTelemetry Collector for more advanced use cases. You can choose to deploy either the OpenTelemetry Collector or Grafana Alloy. In my case, I’ve opted to deploy Grafana Alloy.

I set it up on an EC2 instance by following the installation instructions provided in the documentation.

Next, we need to update the environment variable to include the Grafana Alloy endpoint:

ALLOY_OTLP_ENDPOINT

Here are the environment variables I’ve configured. 

A list of configured environmental variables

After that, configure your Lambda functions to export telemetry data to Grafana Alloy, which will then forward it to Grafana Cloud. 

Optional: Agent deployment

If you want to go with the “agent” deployment, you simply need to add ADOT or a custom OpenTelemetry Lambda layer (refer to Step 2).

To send telemetry data to Grafana Cloud, the following two values are required:

  • OTEL_EXPORTER_OTLP_ENDPOINT
  • OTEL_EXPORTER_OTLP_HEADERS
Adding environmental variables

You can find these values in the OpenTelemetry tile from your Grafana account.

Manage your stack tiles in Grafana Cloud

Step 4 : Add an OpenTelemetry Collector configuration file to your Lambda functions

Create an OpenTelemetry Collector configuration file for your Lambda function. This file will define how telemetry data is collected and exported, specifying the Grafana Alloy endpoint.

Once the configuration file is ready, upload it to your Lambda deployment package to ensure proper integration with the Grafana Alloy. Make sure you are using the otlp exporter.

Below is an example of how to configure the OpenTelemetry Collector to send telemetry data to Grafana Alloy:

receivers:
otlp:
protocols:
grpc:
endpoint: "localhost:4317"
http:
endpoint: "localhost:4318"
exporters:
debug:
verbosity: detailed
otlp:
endpoint: ${ALLOY_OTLP_ENDPOINT}:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp, debug]
metrics:
receivers: [otlp]
exporters: [otlp, debug]
logs:
receivers: [otlp]
exporters: [otlp, debug]

Optional: Agent deployment 

Below is an example of how to configure the OpenTelemetry Collector to send telemetry data to Grafana Cloud directly.

receivers:
otlp:
protocols:
grpc:
endpoint: "localhost:4317"
http:
endpoint: "localhost:4318"
exporters:
exporters:
debug:
verbosity: detailed
otlphttp/grafana:
endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT}
headers:
Authorization: ${OTEL_EXPORTER_OTLP_HEADERS}
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlphttp/grafana,debug]
metrics:
receivers: [otlp]
exporters: [otlphttp/grafana,debug]
logs:
receivers: [otlp]
exporters: [otlphttp/grafana,debug]

Step 5 : Configure DynamoDB

To store and manage votes, create a DynamoDB table named “f1-votes-table”. This table will be used by the Processing-Vote Lambda function to record and retrieve voting data.

  1. Navigate to the AWS Management Console and open Amazon DynamoDB.
  2. Click Create table and enter “f1-votes-table” as the table name.
  3. Set the partition key as UserDisplayName (String) to identify users submitting votes.
  4. Set the sort key as VoteId (String) to uniquely track each vote.
  5. Click Create table to complete the setup.

This table will efficiently store user votes while ensuring each vote is uniquely identifiable.

DynamoDB table details menu in AWS

Step 6: Configure API Gateway to trigger Lambda functions

To enable the frontend to interact with the backend, create an API Gateway that triggers the Posting-Votes and Get-Votes Lambda functions.

  • The POST /vote endpoint will send votes to the Posting-Vote Lambda function.
  • The GET /votes endpoint will retrieve vote results from the Getting-Votes Lambda function.
Vote post endpoint in API Gateway
Vote get endpoint in API Gateway

This setup ensures a seamless connection between the frontend and backend services.

Step 7: Test the application

For testing, we have two options:

Option 1: Manual test

Create test events in the AWS Lambda UI for Posting-Vote and Getting-Votes.

  • Verify that votes are being processed and stored in DynamoDB.

Posting-Vote

From Lambda Test UI, create the following test:

Edit saved event in test event menu

Results from Posting-Vote

Execute the previous test. You should have the following result.

Posting-vote results

Getting-Votes 

From Lambda Test UI, create the following test

Editing a saved event

Results for Getting-Votes 

Execute the previous test. You should have the following result. 

Results from Getting-Votes

Option 2: Deploy the web React frontend app

Unlike Option 1, which tests individual Lambda functions in isolation, Option 2 tests the entire flow using the deployed web-based React frontend. This approach provides a more realistic end-to-end scenario, covering the interaction between the frontend, backend, and database.

Step 1: Deploy the frontend

Deploy the React app by following the deployment instructions provided here. Ensure it’s correctly connected to your backend Lambda endpoints and DynamoDB.

Step 2: Submit votes via the Web UI

Open the deployed web app in your browser.

  • Select a candidate or option.
  • Click the Vote button to submit your vote.
Step 3: Observe the behavior

After submitting a vote, the UI should display a confirmation message.

F1 Driver of the Day dashboard

Step 8: Verify that telemetry data is sent to Grafana Cloud

We can confirm that telemetry data from our Lambda setup is successfully reaching Grafana Cloud, whether it’s through Explore mode or via Grafana Cloud Application Observability.

1. Explore the traces in Grafana Cloud Explore mode 

  • Navigate to the Explore menu in Grafana Cloud.
  • Check for incoming traces from the Lambda functions.
  • Verify that spans and metrics are properly recorded.
Grafana Cloud Traces UI
Service graph

2. Use Grafana Cloud Application Observability

Now that everything is working as expected, let’s open Application Observability in Grafana Cloud:

  • Navigate to a stack: https://<your-stack-name.>grafana.net.
  • Expand the top left menu below the Grafana logo.
  • Click on Application.

Application Observability relies on metrics generated from traces being ingested into Grafana Cloud. Those metrics are used to display the Rate, Error, Duration (RED) method information in Application Observability.

Application Observability get started page

To complete the setup, click Enable Application Observability

Note: It might take up to five minutes to generate metrics and show graphs.

The “Service Inventory” page lists all the services that are sending distributed traces to Grafana Cloud. You can see that we are able to capture Posting-Vote, Processing-Vote, Getting-Votes and also the calls to the DynamoDB table.

Listed services

We can also check the “Service Map” page to get a graph of related services and health metrics at a high level. 

Service map

Let’s navigate to the Service Overview page for one of our services.

This page provides a health overview with RED metrics for: 

  • The service itself
  • Related upstream and downstream services
  • Operations performed on the service

Here’s the Service Overview page for Posting-Vote:

posting-votes overview

The Traces view provides a view of all the traces related to our services. By default, Application Observability filters traces by the service and service namespace. There are two ways to customize search queries:

  • Use TraceQL, the trace query language. 
  • Click Search to use the visual TraceQL query builder.

Here’s the traces view for Posting-Vote:

posting-votes traces

The logs view provides the logs for the service. 

Posting-votes logs

Let us know what you think!

In summary, OpenTelemetry makes it straightforward to instrument AWS Lambda functions for observability. By using the ADOT Lambda layer or custom instrumentation, and configuring the OpenTelemetry Collector to gather telemetry data, you can seamlessly send it to Grafana Cloud.

As serverless architectures continue to evolve, the importance of effective observability will only increase. OpenTelemetry and Grafana Cloud offer a future-proof approach to monitoring AWS Lambda functions, providing the flexibility and scalability needed to adapt to changing requirements. 

We’re committed to continuous improvement, and your feedback is invaluable in this journey. If you’ve used OpenTelemetry and Grafana Cloud for serverless observability, please share your thoughts and experiences with us. You can find our team in the Grafana Labs Community Slack in the #opentelemetry channel.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, and dashboards. We recently added new features to our generous forever-free tier, including access to all Enterprise plugins for three users. Plus there are plans for every use case. Sign up for a free account today!