Self-Hosting Langfuse v3 on AWS Using CDK

I created a CDK project to self-host the observability tool (OSS) Langfuse v3 on AWS. https://github.com/mazyu36/langfuse-with-aws-cdk In this article, I'll share the usage method, architecture, and troubleshooting know-how. The following official documents are helpful for self-hosting: Self-host Langfuse Infrastructure Architecture I've built Langfuse v3 on AWS. Essentially, it's a simple configuration that replaces Langfuse's architecture with managed services. Langfuse Application (Web, Worker) I've deployed the Langfuse container image to ECS on Fargate. Service-to-service communication with ClickHouse, mentioned later, uses ECS Service Connect. Since Web and Worker act as clients (the requesting side) in service-to-service communication, Service Connect only needs to be defined with client settings. If you want to reduce costs and just need basic communication, Service Discovery should also work fine. According to the documentation, for production environments, the following is recommended: All containers should have at least 2 CPUs and 4GB RAM Langfuse Web should have 2 instances for redundancy For production environments, we recommend to use at least 2 CPUs and 4 GB of RAM for all containers. You should have at least two instances of the Langfuse Web container for high availability. For auto-scaling, we recommend to add instances once the CPU utilization exceeds 50% on either container. In the CDK implementation, you can configure whether to use Fargate Spot in lib/stack-config.ts. In development environments, you can use Fargate Spot to reduce costs. ClickHouse - OLAP ClickHouse is also running directly as a container image on ECS on Fargate. EFS is mounted for data persistence. PostgreSQL - OLTP Using Aurora Serverless v2. In the CDK implementation, lib/stack-config.ts allows enabling Zero Capacity for cost reduction. https://aws.amazon.com/blogs/database/introducing-scaling-to-0-capacity-with-amazon-aurora-serverless-v2 When enabled, the database stops after a certain period without connections. When attempting to connect during this state, it takes about 15 seconds to restart, so retries are necessary after a certain time. This is intended for development environments. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2-auto-pause.html#auto-pause-whynot S3 - Blob Storage Storage for storing events and traces. This uses S3 directly. Cache/Queue Redis/Valkey is needed for caching and asynchronous communication between Web and Worker. In the CDK project, ElastiCache is used, with Valkey as the engine for cost benefits. There are several points to note about ElastiCache, which I'll explain in detail. Use REDIS_STRING when using in-transit encryption ElastiCache allows encryption settings for communication (TLS). In CDK, setting transitEncryptionMode to required allows only TLS communication. const cache = new elasticache.CfnReplicationGroup(this, 'Resource', { // omit transitEncryptionEnabled: true, transitEncryptionMode: 'required', // Allow only TLS communication. 'preferred' allows both TLS and non-TLS. // omit authToken: secret.secretValue.unsafeUnwrap(), }); When setting connection information for Langfuse Web/Server in environment variables, there are two ways: Use REDIS_CONNECTION_STRING Set REDIS_HOST, REDIS_PORT, REDIS_AUTH https://langfuse.com/self-hosting/infrastructure/cache#configuration When only TLS communication is allowed, currently, method 1 (REDIS_CONNECTION_STRING) must be used. Method 2 will not connect and won't output an error (I got stuck on this). For TLS-only communication, set REDIS_CONNECTION_STRING with rediss://.... This is due to the implementation in Langfuse's packages/shared/src/server/redis/redis.ts. Langfuse uses ioredis, and the instance creation is defined as follows: const instance = env.REDIS_CONNECTION_STRING ? new Redis(env.REDIS_CONNECTION_STRING, { ...defaultRedisOptions, ...additionalOptions, }) : env.REDIS_HOST ? new Redis({ host: String(env.REDIS_HOST), port: Number(env.REDIS_PORT), password: String(env.REDIS_AUTH), // No TLS configuration ...defaultRedisOptions, ...additionalOptions, }) : null; When using REDIS_HOST, individual host settings are configured, but the tls property is necessary for TLS communication (reference: ioredis documentation) const redis = new Redis({ host: "redis.my-service.com", tls: {}, // TLS configuration is necessary }); However, this option is currently not provided by Langfuse. Therefore, using REDIS_HOST etc. makes TLS communication impossible. On the other hand, ioredis allows TLS connection by defining connection information starting with rediss. const redis = new Redis("rediss://redis.my-servic

Feb 6, 2025 - 15:41
 0
Self-Hosting Langfuse v3 on AWS Using CDK

I created a CDK project to self-host the observability tool (OSS) Langfuse v3 on AWS.

https://github.com/mazyu36/langfuse-with-aws-cdk

In this article, I'll share the usage method, architecture, and troubleshooting know-how.

The following official documents are helpful for self-hosting:

Architecture

I've built Langfuse v3 on AWS. Essentially, it's a simple configuration that replaces Langfuse's architecture with managed services.

Image description

Langfuse Application (Web, Worker)

I've deployed the Langfuse container image to ECS on Fargate.

Service-to-service communication with ClickHouse, mentioned later, uses ECS Service Connect. Since Web and Worker act as clients (the requesting side) in service-to-service communication, Service Connect only needs to be defined with client settings.

If you want to reduce costs and just need basic communication, Service Discovery should also work fine.

According to the documentation, for production environments, the following is recommended:

  • All containers should have at least 2 CPUs and 4GB RAM
  • Langfuse Web should have 2 instances for redundancy

For production environments, we recommend to use at least 2 CPUs and 4 GB of RAM for all containers. You should have at least two instances of the Langfuse Web container for high availability. For auto-scaling, we recommend to add instances once the CPU utilization exceeds 50% on either container.

In the CDK implementation, you can configure whether to use Fargate Spot in lib/stack-config.ts. In development environments, you can use Fargate Spot to reduce costs.

ClickHouse - OLAP

ClickHouse is also running directly as a container image on ECS on Fargate. EFS is mounted for data persistence.

PostgreSQL - OLTP

Using Aurora Serverless v2.

In the CDK implementation, lib/stack-config.ts allows enabling Zero Capacity for cost reduction.

https://aws.amazon.com/blogs/database/introducing-scaling-to-0-capacity-with-amazon-aurora-serverless-v2

When enabled, the database stops after a certain period without connections. When attempting to connect during this state, it takes about 15 seconds to restart, so retries are necessary after a certain time. This is intended for development environments.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2-auto-pause.html#auto-pause-whynot

S3 - Blob Storage

Storage for storing events and traces. This uses S3 directly.

Cache/Queue

Redis/Valkey is needed for caching and asynchronous communication between Web and Worker. In the CDK project, ElastiCache is used, with Valkey as the engine for cost benefits.

There are several points to note about ElastiCache, which I'll explain in detail.

Use REDIS_STRING when using in-transit encryption

ElastiCache allows encryption settings for communication (TLS). In CDK, setting transitEncryptionMode to required allows only TLS communication.

    const cache = new elasticache.CfnReplicationGroup(this, 'Resource', {
      // omit
      transitEncryptionEnabled: true,
      transitEncryptionMode: 'required', // Allow only TLS communication. 'preferred' allows both TLS and non-TLS.
      // omit
      authToken: secret.secretValue.unsafeUnwrap(),
    });

When setting connection information for Langfuse Web/Server in environment variables, there are two ways:

  1. Use REDIS_CONNECTION_STRING
  2. Set REDIS_HOST, REDIS_PORT, REDIS_AUTH

https://langfuse.com/self-hosting/infrastructure/cache#configuration

When only TLS communication is allowed, currently, method 1 (REDIS_CONNECTION_STRING) must be used. Method 2 will not connect and won't output an error (I got stuck on this).

For TLS-only communication, set REDIS_CONNECTION_STRING with rediss://....

This is due to the implementation in Langfuse's packages/shared/src/server/redis/redis.ts.

Langfuse uses ioredis, and the instance creation is defined as follows:

  const instance = env.REDIS_CONNECTION_STRING
    ? new Redis(env.REDIS_CONNECTION_STRING, {
        ...defaultRedisOptions,
        ...additionalOptions,
      })
    : env.REDIS_HOST
      ? new Redis({
          host: String(env.REDIS_HOST),
          port: Number(env.REDIS_PORT),
          password: String(env.REDIS_AUTH),
           // No TLS configuration
          ...defaultRedisOptions,
          ...additionalOptions,
        })
      : null;

When using REDIS_HOST, individual host settings are configured, but the tls property is necessary for TLS communication (reference: ioredis documentation)

const redis = new Redis({
  host: "redis.my-service.com",
  tls: {}, // TLS configuration is necessary
});

However, this option is currently not provided by Langfuse. Therefore, using REDIS_HOST etc. makes TLS communication impossible.

On the other hand, ioredis allows TLS connection by defining connection information starting with rediss.

const redis = new Redis("rediss://redis.my-service.com");  // Specify TLS with rediss

In Langfuse, using REDIS_CONNECTION_STRING enables TLS communication.

Set noeviction

The parameter maxmemory-policy must be set to noeviction (no eviction setting).
This ensures that queue jobs are not removed from the cache.

You must set the parameter maxmemory-policy to noeviction to ensure that the queue jobs are not evicted from the cache.

[https://langfuse.com/self-hosting/infrastructure/cache#deployment-options:embed:cite]

To prevent jobs from accumulating infinitely, retry and job retention on failure are considered (example in IngestionQueue).

    IngestionQueue.instance = newRedis
      ? new Queue<TQueueJobTypes[QueueName.IngestionQueue]>(
          QueueName.IngestionQueue,
          {
            connection: newRedis,
            defaultJobOptions: {
              removeOnComplete: true,  // Remove job after success
              removeOnFail: 100_000, // Maximum number of failed job retentions
              attempts: 5,  // Number of retries
              backoff: {  // Retry with exponential backoff
                type: "exponential",
                delay: 5000,
              },
            },
          },
        )
      : null;

noeviction can be set in the parameter group in CDK:

    /**
     * We must set the parameter `maxmemory-policy` to `noeviction` to ensure that the queue jobs are not evicted from the cache.
     * @see https://langfuse.com/self-hosting/infrastructure/cache#deployment-options
     */
    const parameterGroup = new elasticache.CfnParameterGroup(this, 'RedisParameterGroup', {
      cacheParameterGroupFamily: 'valkey8',
      description: 'Custom parameter group for Langfuse ElastiCache',
      properties: {
        'maxmemory-policy': 'noeviction',  // here
      },
    });

Cluster mode and ElastiCache Serverless are not supported

At the time of writing, Langfuse Web/Worker does not support Redis/Valkey cluster mode. Therefore, ElastiCache cluster mode cannot be used.

Langfuse handles failovers between read-replicas, but does not support Redis cluster mode for now, i.e. there is no sharding support.

https://langfuse.com/self-hosting/infrastructure/cache#managed-redisvalkey-by-cloud-providers

This means ElastiCache Serverless also cannot be used currently. AWS documentation also states that ElastiCache Serverless runs in cluster mode.

ElastiCache Serverless runs Valkey, Memcached, or Redis OSS in cluster mode and is only compatible with clients that support TLS.

https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/WhatIs.corecomponents.html

If you try to use ElastiCache Serverless, you'll frequently encounter CROSSSLOT errors:

2025-02-04T09:09:03.525Z warn Redis connection error: CROSSSLOT Keys in request don't hash to the same slot

The cause is that keys are being operated on that hash to different slots.

In cluster mode, keys are calculated and stored in hash slots. When operating multiple keys, they must be in the same hash slot. A detailed explanation can be found in this article:

https://dev.to/inspector/resolved-crossslot-keys-error-in-redis-cluster-mode-enabled-3kec

Since the current implementation does not consider cluster mode, these errors occur. This cannot be resolved through infrastructure settings.

Usage

Please refer to the README for details. Below is a simple deployment method and the flow for verifying the operation.

CDK Configuration and Deployment

The CDK implementation allows parameters to be defined for each environment (dev/stg/prod), and the deployment target environment is specified as a context. The general deployment method is as follows:

  • Clone the repository and install the necessary libraries with npm ci.
  • Set the parameters for the corresponding environment in bin/app-config and lib/stack-config.
  • Deploy with npx cdk deploy --context env=ENV_NAME.

The deployment takes about 20 minutes. The URL is output in the Outputs.

 ✅  LangfuseWithAwsCdkStack-dev

✨  Deployment time: 1238.58s

Outputs:

# omit

LangfuseWithAwsCdkStack-dev.LangfuseURL = https://langfuse.example.com

# omit

✨  Total time: 1247.3s

Initial Setup

After opening the URL, first sign up by entering your email address and other details, then click Sign up.

Image description

If Aurora Serverless v2 Zero scaling is enabled and the database is in an idle state, you may encounter a DB error like the following (the error message is hard to read due to the color...). At this point, the DB will start to wake up from idle, so wait for a while (about 15-30 seconds) and then click Sign up again.

Image description

First, create an Organization by clicking New Organization.

Image description

Set the Organization name and click Create.

Image description

If you want to add members, configure them here. I will not set this up here, so click next.

Image description

Next, set the Project name. After setting, click create.

Image description

Finally, select API Keys and click Create new API Keys.

Image description

Once the Secret key and Public key are issued, make a note of them. This completes the initial setup.

Image description

Operation Verification

Perform operation verification using the issued API keys. Here, we use curl.

First, set the API keys and hostname as environment variables in an appropriate environment.

export LANGFUSE_SECRET_KEY="YOUR_SECRET_KEY"
export LANGFUSE_PUBLIC_KEY="YOUR_PUBLIC_KEY"
export LANGFUSE_HOST="YOUR_LANGFUSE_URL"

Next, execute the ingestion (trace ingestion) API with curl. Note that the content is dummy, so there is almost no actual data.

curl -X POST "$LANGFUSE_HOST/api/public/ingestion" \
  -u "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "batch": [
      {
        "type": "trace-create",
        "id": "'$(uuidgen)'",
        "timestamp": "'$(date -u +"%Y-%m-%dT%H:%M:%S.000Z")'",
        "metadata": null,
        "body": {
          "id": "'$(uuidgen)'",
          "name": "test",
          "timestamp": "'$(date -u +"%Y-%m-%dT%H:%M:%S.000Z")'",
          "public": false
        }
      }
    ],
    "metadata": null
  }'

If a 201 response is output, it is working correctly.

{
  "successes": [
    {
      "id": "523EFC5E-8BAC-485D-9CC3-C049B5F64FA4",
      "status": 201
    }
  ],
  "errors": []
}

You can check the traces in the browser under Tracing -> Traces.
If the executed data is included, the operation verification is complete