High Availability LGTM – Ernesto Diaz

Storage

Network File System

First I updated the system.

sudo apt update && sudo apt upgrade -y

Then installed NFS server.

sudo apt install nfs-kernel-server -y

Then enabled the service to start upon boot, and started the service.

sudo systemctl enable nfs-kernel-server
sudo systemctl start nfs-kernel-server
sudo systemctl enable rpcbind
sudo systemctl start rpcbind

Next, I created the base directory for Loki storage.

sudo mkdir -p /srv/nfs/loki/{chunks,index,wal,boltdb-cache,compactor}

Command Explanation

{chunks,index,wal,boltdb-cache,compactor}: Shell brace expansion

This expands to create multiple directories

I then set ownership.

sudo chown -R nobody:nogroup /srv/nfs/loki

Then set permissions.

sudo chown -R nobody:nogroup /srv/nfs/lokisudo chmod -R 755 /srv/nfs/loki

Then created the directory structer for each Availability Zone.

sudo mkdir -p /srv/nfs/loki/{az1,az2}/data

I verified the structure was correct.

tree /srv/nfs/loki/

I then created the exports configuration file.

sudo nano /etc/exports

I specified which Loki VM’s to export to.

/srv/nfs/loki 10.33.99.74(rw,sync,no_subtree_check,no_root_squash,no_all_squash)
/srv/nfs/loki 10.33.99.77(rw,sync,no_subtree_check,no_root_squash,no_all_squash)

Then exported the filesystem.

sudo exportfs -arv

I also verified exports and checked the NFS status.

sudo exportfs -v
sudo systemctl status nfs-kernel-server

On both Loki servers I installed the NFS client.

sudo apt install nfs-common -y

Then created the shared mount directory.

sudo mkdir -p /shared/loki

Load Balancer

HAProxy

On both load balancer VM’s, I updated the system.

sudo apt update

Then installed HAProxy and Keepalived.

sudo apt install -y haproxy keepalived

Then enabled IP forwarding.

echo 'net.ipv4.ip_forward=1' | sudo tee -a /etc/sysctl.conf

Command Explanation

echo 'net.ipv4.ip_forward=1' – Outputs the configuration setting
| – Pipes the output to the next command
sudo tee -a /etc/sysctl.conf – Appends the setting to the system configuration file
The -a flag means “append” (vs overwriting the file)

Result: Adds the line net.ipv4.ip_forward=1 to /etc/sysctl.conf, making the change persistent across reboots.

Then reloaded and applied the settings.

sudo sysctl -p

Note

Without IP forwarding (default):
Client → Load Balancer → ❌ Packet dropped
When a packet arrives destined for another IP address, the Linux kernel discards it by default for security reasons.

With IP forwarding enabled:

Client → Load Balancer → Backend Server

       ← Load Balancer ← Backend Server

The kernel can forward packets between network interfaces, acting as a router.

Next, I created the configuration file for HAProxy.

sudo nano /etc/haproxy/haproxy.cfg

Global configuration section:

global
    daemon
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy
    master-worker

    # SSL Configuration
    ssl-default-bind-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384
    ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets

Command Explanation

daemon – Runs HAProxy as a background daemon process.
chroot /var/lib/haproxy – Changes the root directory.
user haproxy group haproxy – Drops root privileges and runs as the ‘haproxy’ user/group.
master-worker – Enables the modern master-worker process model (HAProxy 1.8+).

Note

Traditional model:

Single HAProxy Process

├── Handles all connections

└── Config reload = full restart

Master-worker model:

Master Process (manages workers)

├── Worker Process 1 (handles traffic)

├── Worker Process 2 (handles traffic)

└── Seamless reloads without connection drops

Benefits:

Zero-downtime reloads: New workers start while old ones finish existing connections
Better stability: Master manages workers, restarts failed ones
Improved monitoring: Separate process for management tasks

Defaults section:

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms
    option httplog
    option dontlognull
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

Command Explanation

mode http – Sets HAProxy to operate in HTTP mode.
timeout connect 5000ms timeout client 50000ms timeout server 50000ms – These timeouts prevents connections from hanging indefinitely and protects against various attack scenarios.
option httplog – Enables detailed HTTP logging format.
option dontlognull – Prevents logging of connections that don’t transfer data.
error file – Replaces default HAProxy error pages with custom ones.

Note

Normal Request Flow:

Client → [connect timeout] → HAProxy → [connect timeout] → Backend

     ← [client timeout]    ←        ← [server timeout]  ←

Scenario 1: Slow backend connection

1. Client connects to HAProxy instantly

2. HAProxy tries to connect to backend

3. Backend takes 6 seconds to accept connection

4. Connect timeout (5s) triggers → 502 error

Scenario 2: Slow backend response

1. Connection established quickly

2. Client sends request

3. Backend processes for 60 seconds

4. Server timeout (50s) triggers → 504 error

Scenario 3: Slow client

1. Client connects and starts sending large POST

2. Client sends data very slowly

3. No data sent for 51 seconds

4. Client timeout (50s) triggers → 408 error

States page section:

# Stats page
listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

Command Explanation

bind – Binds the stats interface to port 8404 on all network interfaces.
stats enable – Activates the statistics interface.
stats uri /stats – Sets the URL path for accessing statistics.
stats refresh 30s – Auto-refreshes the stats page every 30 seconds.
stats admin if TRUE – Enables administrative functions on the stats page if TRUE.

Grafana Frontend section:

# Grafana Frontend
frontend grafana_frontend
    bind *:443 ssl crt /etc/ssl/certs/grafana.pem
    bind *:80
    redirect scheme https if !{ ssl_fc }
    default_backend grafana_backend

Command Explanation

bind *:443 – Listen on port 443 on all interfaces.
ssl – Enable SSL/TLS termination at the load balancer.
crt – SSL certificate file location.
bind *:80 – Listen on port 80 on all interfaces.
redirect scheme – Forces all HTTP traffic to redirect to HTTPS.

Grafana Backend section:

backend grafana_backend
    balance roundrobin
    option httpchk GET /api/health
    server grafana-az1 10.33.99.73:3000 check inter 5s fall 3 rise 2
    server grafana-az2 10.33.99.76:3000 check inter 5s fall 3 rise 2

Command Explanation

backend grafana_backend – Routes all requests to the ‘grafana_backend’ server pool.
balance roundrobin – Load balancing algorithm.
option httpchk – Configures HTTP health checks for backend servers.
server – Defines the servers.

Prometheus Frontend section:

# Prometheus Frontend
frontend prometheus_frontend
    bind *:9090
    default_backend prometheus_backend

Prometheus Backend section:

backend prometheus_backend
    balance roundrobin
    option httpchk GET /-/healthy
    server prometheus-az1 10.33.99.73:9090 check inter 5s fall 3 rise 2
    server prometheus-az2 10.33.99.76:9090 check inter 5s fall 3 rise 2

Loki Frontend section:

# Loki Frontend
frontend loki_frontend
    bind *:3100
    default_backend loki_backend

Loki Backend section:

backend loki_backend
    balance roundrobin
    option httpchk GET /ready
    server loki-az1 VM-AZ1-2:3100 check inter 5s fall 3 rise 2
    server loki-az2 VM-AZ2-2:3100 check inter 5s fall 3 rise 2

Tempo Frontend section:

# Tempo Frontend
frontend tempo_frontend
    bind *:3200
    default_backend tempo_backend

Tempo Backend section:

backend tempo_backend
    balance roundrobin
    option httpchk GET /ready
    server tempo-az1 VM-AZ1-3:3200 check inter 5s fall 3 rise 2
    server tempo-az2 VM-AZ2-3:3200 check inter 5s fall 3 rise 2

Mimir Frontend section:

# Mimir Frontend
frontend mimir_frontend
    bind *:9009
    default_backend mimir_backend

Mimir Backend section:

backend mimir_backend
    balance roundrobin
    option httpchk GET /ready
    server mimir-az1 VM-AZ1-3:9009 check inter 5s fall 3 rise 2
    server mimir-az2 VM-AZ2-3:9009 check inter 5s fall 3 rise 2

Keepalived

I then edited the keepalived configuration file.

sudo nano /etc/keepalived/keepalived.conf

On VM-LB1 (Primary)

vrrp_script chk_haproxy {
    script "/bin/kill -0 `cat /var/run/haproxy.pid`"
    interval 2
    weight 2
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0  # Adjust to your interface
    virtual_router_id 51
    priority 110
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_secure_password
    }
    virtual_ipaddress {
        192.168.1.100  # Your VIP
    }
    track_script {
        chk_haproxy
    }
    notify_master "/etc/keepalived/master.sh"
    notify_backup "/etc/keepalived/backup.sh"
    notify_fault "/etc/keepalived/fault.sh"
}

On VM-LB2 (Backup)

vrrp_script chk_haproxy {
    script "/bin/kill -0 `cat /var/run/haproxy.pid`"
    interval 2
    weight 2
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0  # Adjust to your interface
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_secure_password
    }
    virtual_ipaddress {
        192.168.1.100  # Your VIP
    }
    track_script {
        chk_haproxy
    }
    notify_master "/etc/keepalived/master.sh"
    notify_backup "/etc/keepalived/backup.sh"
    notify_fault "/etc/keepalived/fault.sh"
}

I also created a script that displays a notification when the server is switching to master:

sudo nano /etc/keepalived/master.sh

#!/bin/bash
echo "$(date): Becoming MASTER" >> /var/log/keepalived.log
systemctl start haproxy

And another script that displays a notification when the server is switching to backup:

sudo nano /etc/keepalived/backup.sh

#!/bin/bash
echo "$(date): FAULT detected" >> /var/log/keepalived.log
systemctl stop haproxy

To make scripts executable:

chmod +x /etc/keepalived/*.sh

Example log entries:

2025-08-10 10:15:30: Becoming MASTER
2025-08-10 10:17:45: Becoming BACKUP

Database

PostgresSQL

Master Database

I installed PostgresSQL

sudo apt install -y postgresql postgresql-contrib

Then configured PostgresSQL for replication.

sudo -u postgres psql -c "CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'MyPassword123!';"

Command Explanation

sudo -u postgres – Switches to the ‘postgres’ system user.
CREATE USER replicator – Creates a new database user named ‘replicator’.
WITH REPLICATION – Grants replication privileges.
ENCRYPTED PASSWORD – Sets an encrypted password.

I then needed to edit the postgresql configuration file.

sudo nano /etc/postgresql/14/main/postgresql.conf

I added the following to the end:

# Replication settings
wal_level = replica
max_wal_senders = 3
wal_keep_size = 64MB
synchronous_commit = on
synchronous_standby_names = 'standby1'

Command Explanation

wal-level – Sets the Write-Ahead Logging to ‘replica’.
max_wal_senders – Set the number of background processes that stream WAL data to standby servers.
wal_keep_size – Sets the minimum WAL data kept for standby servers.
sunchronous_standby_names – Identifies the server name for replication.

I then edited the pg_hba configuration file.

sudo nano /etc/postgresql/main/pg_hba.conf

and added this to the bottom.

# Replication connections
host replication replicator VM-DB2/32 md5

Replica Database

First, I stopped Postgres.

sudo systemctl stop postgresql

I then deleted all database data. That way the replica database gets an exact copy of the master’s data directory.

sudo rm -rf /var/lib/postgresql/14/main/*

I then deleted all database data. That way the replica database gets an exact copy of the master’s data directory.

sudo -u postgres pg_basebackup -h 10.33.99.79 -D /var/lib/postgresql/14/main -U replicator -W -v -P -R

Command Explanation

-h VM-10.33.99.79 – Connects to the master database server.
-D /var/lib... – Specifies where the backup will be written.
-U – Uses the replication user I created.
-W – Forces password prompt.
-v – Shows what pg_baseback is doing.
-P – Shows transfer progress and ETA.
-R – Automatically sets up replication settings.

Docker

Prometheus

On both Prometheus VMs I created the following YAML configuration file.

sudo nano prometheus-ha.yml

Global section:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production-ha'
    replica: 'prometheus-az1'  # Change to az2 for second instance

rule_files:
  - "/etc/prometheus/rules/*.yml"

Command Explanation

scrape_interval – Defines how frequently Prometheus collects metrics from all targets.
evaluation_interval – Defines how frequently Prometheus evaluates alerting and recording rules.
external_labels – Metadata attached to all metrics scraped by this Prometheus instance.
cluster – Identifies which cluster this Prometheus monitors.
replica – Identifies which Prometheus instance collected this metric.

Alerting section:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - '10.33.99.73:9093'
          - '10.33.99.76:9093'

Scrape Configs section:

scrape_configs:
- job_name: 'prometheus'
  static_configs:
    - targets: ['localhost:9090']

- job_name: 'federate'
  scrape_interval: 15s
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"prometheus|node|alertmanager"}'
      - '{__name__=~"job:.*"}'
  static_configs:
    - targets:
      - '10.33.99.73:9090'  # Remove self from targets
      - '10.33.99.76:9090'

- job_name: 'node-exporter'
  static_configs:
    - targets:
      - '10.33.99.73:9100'
      - '10.33.99.74:9100'
      - '10.33.99.75:9100'
      - '10.33.99.76:9100'
      - '10.33.99.77:9100'
      - '10.33.99.78:9100'

Command Explanation

Federation allows one Prometheus server to scrape selected time series from another Prometheus server, creating hierarchical monitoring architectures.

metrics_path – /federate is Prometheus’s built-in federation endpoint.
{job= ... – Selects metrics from specific jobs.
{__name__= .. – Selects pre-computed recording rules.

Remote Write section:

remote_write:
  - url: http://10.33.99.75:9009/api/v1/push
    queue_config:
      max_samples_per_send: 10000
  - url: http://10.33.99.78:9009/api/v1/push
    queue_config:
      max_samples_per_send: 10000

Prometheus will send metrics to both Mimir instances simultaneously.

Then created the docker-compose YAML configuration file that will install and run Prometheus, Alert Manager, and Grafana.

sudo nano docker-compose.yml

Prometheus section:

services:
  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=7d'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
      - '--web.external-url=http://prometheuscluster.com:9090'
    volumes:
      - ./prometheus-ha.yml:/etc/prometheus/prometheus.yml:ro
      - ./rules:/etc/prometheus/rules:ro
      - prometheus-data:/prometheus
    networks:
      - monitoring

Command Explanation

--config.file – Specifies the main Prometheus configuration file.
--storage.tsdb.path – Sets the Time Series Database storage directory.
--storage.tsdb.retention – Keeps metrics for 7 days in local storage.
--web.console – Built-in web interface templates for basic dashboards.
--web.enable-lifecycle – Enables HTTP endpoints for configuration management.
--web.enable-admin-api – Enables powerful management endpoints.
--web.external-url – How Prometheus should reference itself in external communications.
./prometheus-ha-yml: – Maps host file to container configuration.
./rules: – Maps host directory to container rules directory.

Alert Manager section:

alertmanager:
  image: prom/alertmanager:v0.26.0
  container_name: alertmanager
  restart: unless-stopped
  ports:
    - "9093:9093"
  command:
    - '--config.file=/etc/alertmanager/config.yml'
    - '--storage.path=/alertmanager'
    - '--web.external-url=http://prometheuscluster.com:9093'
    - '--cluster.listen-address=0.0.0.0:9094'
    - '--cluster.peer=10.33.99.73:9094'  # Remove self from peers
    - '--cluster.peer=10.33.99.76:9094'
  volumes:
    - ./alertmanager:/etc/alertmanager:ro
    - alertmanager-data:/alertmanager
  networks:
    - monitoring

Grafana section:

grafana:
    image: grafana/grafana-enterprise:10.1.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_DATABASE_TYPE=postgres
      - GF_DATABASE_HOST=10.33.99.79:5432
      - GF_DATABASE_NAME=grafana
      - GF_DATABASE_USER=grafana
      - GF_DATABASE_PASSWORD=${DB_PASSWORD}
      - GF_SESSION_PROVIDER=postgres
      - GF_SESSION_PROVIDER_CONFIG=host=10.33.99.79 port=5432 user=grafana password=${DB_PASSWORD} dbname=grafana sslmode=disable
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - monitoring

volumes:
  prometheus-data:
  alertmanager-data:

networks:
  monitoring:
    driver: bridge

I then created the database for Grafana (on VM-DB1).

CREATE USER grafana WITH PASSWORD 'GrafanaPassword123!';
CREATE DATABASE grafana OWNER grafana;
GRANT ALL PRIVILEGES ON DATABASE grafana TO grafana;
GRANT CREATE ON SCHEMA public TO grafana;
GRANT USAGE ON SCHEMA public TO grafana;

I then created the .env file to store the Grafana database password.

echo "DB_PASSWORD=GrafanaPassword123!" > .env

I then generated a strong password.

GRAFANA_PASSWORD=$(openssl rand -base64 32)

And saved it as my Grafana Admin password.

echo "GRAFANA_PASSWORD=$GRAFANA_PASSWORD" >> .env

Loki

Next, I created the YAML configuration file for Loki.

sudo nano loki-ha.yml

Server section:

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600

distributor:
  ring:
    kvstore:
      store: memberlist

ingester:
  max_transfer_retries: 0
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 2
    tokens_file_path: /loki/wal/tokens

memberlist:
  join_members:
    - 10.33.99.74:7946
    - 10.33.99.77:7946
  bind_port: 7946

Storage Config section:

# Local filesystem storage
storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    shared_store: filesystem
    cache_location: /loki/boltdb-cache

  filesystem:
    directory: /loki/chunks

# Compactor for local cleanup
compactor:
  working_directory: /loki/compactor
  shared_store: filesystem

Command Explanation

active_index_directory – Where Loki writes currently active index files
shared_store – Uses local filesystem for ‘shipped’ indexes
cache_location – For local filesystem
directory: /loki/chunks – This is where compressed log content is stored
compactor – Merges small index files into larger ones, combines small chunk files for efficiency, removes old data according to retention policies, eliminates duplicate log entries, and reduces storage overhead and improves query performance.

Querier section:

# Query and retention settings
querier:
  max_concurrent: 2048

query_frontend:
  max_outstanding_per_tenant: 2048
  compress_responses: true

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 16
  ingestion_burst_size_mb: 32

# Set retention period
table_manager:
  retention_deletes_enabled: true
  retention_period: 30d

Command Explanation

max_concurrent – How many queries a single querier instance can process simultaneously
max_outstanding_per_tenant – Controls query queuing per tenant at the query frontend level
compress_responses – HTTP responses from query frontend to clients (Grafana)
ingestion_rate_mb – Maximum continuous data ingestion rate per tenant
ingestion_burst_size_mb – Allows short-term spikes above sustained rate

Then I created the docker-compose YAML file for Loki.

docker-compose.yml

version: '3.8'

services:
  loki:
    image: grafana/loki:2.9.0
    container_name: loki
    restart: unless-stopped
    ports:
      - "3100:3100"
      - "7946:7946"
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - ./loki-local.yml:/etc/loki/local-config.yaml:ro
      - /shared/loki:/loki                    # NFS shared storage
    networks:
      - loki
    depends_on:
      - check-nfs

  # Health check service to ensure NFS is mounted
  check-nfs:
    image: alpine
    command: |
      sh -c "
      if [ ! -d /shared/loki ]; then
        echo 'ERROR: NFS mount not available at /shared/loki'
        exit 1
      fi
      echo 'NFS mount verified at /shared/loki'
      touch /shared/loki/.docker-health-check
      "
    volumes:
      - /shared/loki:/shared/loki
    networks:
      - loki

  promtail:
    image: grafana/promtail:2.9.0
    container_name: promtail
    restart: unless-stopped
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail-config.yml:/etc/promtail/config.yml:ro
    command: -config.file=/etc/promtail/config.yml
    networks:
      - loki
    depends_on:
      - loki

  node-exporter:
    image: prom/node-exporter:v1.6.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    networks:
      - loki

networks:
  loki:
    driver: bridge

Tempo

For Tempo I just created a tempo-ha.yml configuration file.

server:
  http_listen_port: 3200
  grpc_listen_port: 9095

distributor:
  receivers:
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268
        grpc:
          endpoint: 0.0.0.0:14250
    zipkin:
      endpoint: 0.0.0.0:9411
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:4318
        grpc:
          endpoint: 0.0.0.0:4317
  ring:
    kvstore:
      store: memberlist

ingester:
  ring:
    kvstore:
      store: memberlist
    replication_factor: 2
  max_block_duration: 5m

memberlist:
  join_members:
    - 10.33.99.75:7946
    - 10.33.99.78:7946
  bind_port: 7946

compactor:
  ring:
    kvstore:
      store: memberlist

querier:
  frontend_worker:
    frontend_address: tempo-query-frontend:9095

query_frontend:
  search:
    duration_slo: 5s
    throughput_bytes_slo: 1.073741824e+09
  trace_by_id:
    duration_slo: 5s

storage:
  trace:
    backend: s3
    s3:
      bucket: your-tempo-bucket
      region: us-east-1
    wal:
      path: /tmp/tempo/wal
    pool:
      max_workers: 100
      queue_depth: 10000

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: production-ha
  storage:
    path: /tmp/tempo/generator/wal
    remote_write:
      - url: http://VM-AZ1-1:9090/api/v1/write
        send_exemplars: true
      - url: http://VM-AZ2-1:9090/api/v1/write
        send_exemplars: true

Mimir

Same thing with Mimiar, I created a mimir-ha.yml configuration file.

multitenancy_enabled: false

server:
  http_listen_port: 9009
  grpc_listen_port: 9095

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true
  ha_tracker:
    enable_ha_tracker: true
    kvstore:
      store: memberlist
  ring:
    kvstore:
      store: memberlist

ingester:
  ring:
    kvstore:
      store: memberlist
    replication_factor: 2
    heartbeat_period: 5s
    heartbeat_timeout: 1m
    tokens_file_path: /data/tokens

memberlist:
  join_members:
    - 10.33.99.75:7946
    - 10.33.99.78:7946
  bind_port: 7946

blocks_storage:
  backend: s3
  s3:
    endpoint: s3.amazonaws.com
    bucket_name: your-mimir-bucket
    region: us-east-1
  tsdb:
    dir: /data/tsdb

compactor:
  data_dir: /data/compactor
  ring:
    kvstore:
      store: memberlist

store_gateway:
  sharding_ring:
    replication_factor: 2
    kvstore:
      store: memberlist

ruler:
  rule_path: /data/ruler
  ring:
    kvstore:
      store: memberlist

ruler_storage:
  backend: s3
  s3:
    bucket_name: your-mimir-rules-bucket
    region: us-east-1

alertmanager:
  data_dir: /data/alertmanager
  external_url: /alertmanager
  sharding_ring:
    replication_factor: 2
    kvstore:
      store: memberlist

alertmanager_storage:
  backend: s3
  s3:
    bucket_name: your-alertmanager-bucket
    region: us-east-1

limits:
  compactor_blocks_retention_period: 30d
  ingestion_rate: 20000
  ingestion_burst_size: 40000

Deployment

Ansible

To deploy the stack I used Ansible, which I installed on a separate VM.

sudo apt install ansible -y

Then created the deployment directory.

mkdir -p /opt/lgtm-deployment
cd /opt/lgtm-deployment

Then created the directory structure.

mkdir -p {inventory,scripts,group_vars,host_vars,roles}
mkdir -p roles/{haproxy,keepalived,postgresql,prometheus,grafana,loki,tempo,mimir}

Next, I created the main files.

touch deploy-ha.sh
touch inventory/ha-hosts
touch scripts/health-check.sh
touch {deploy-loadbalancers.yml,deploy-databases.yml,deploy-az1.yml,deploy-az2.yml}

I then created the inventory file.

sudo nano inventory/ha-hosts

Here I added my server information.

[loadbalancers]
VM-LB1 ansible_host=10.33.99.71 ansible_user=ubuntu
VM-LB2 ansible_host=10.33.99.72 ansible_user=ubuntu

[databases]
VM-DB1 ansible_host=10.33.99.79 ansible_user=ubuntu
VM-DB2 ansible_host=10.33.99.80 ansible_user=ubuntu

[nfs_servers]
VM-NFS-1 ansible_host=10.33.99.70 ansible_user=ubuntu

[az1_prometheus]
VM-AZ1-1 ansible_host=10.33.99.73 ansible_user=ubuntu

[az1_loki]
VM-AZ1-2 ansible_host=10.33.99.74 ansible_user=ubuntu

[az1_tracing]
VM-AZ1-3 ansible_host=10.33.99.75 ansible_user=ubuntu

[az2_prometheus]
VM-AZ2-1 ansible_host=10.33.99.76 ansible_user=ubuntu

[az2_loki]
VM-AZ2-2 ansible_host=10.33.99.77 ansible_user=ubuntu

[az2_tracing]
VM-AZ2-3 ansible_host=10.33.99.78 ansible_user=ubuntu

# Group definitions
[az1:children]
az1_prometheus
az1_loki
az1_tracing

[az2:children]
az2_prometheus
az2_loki
az2_tracing

[all_monitoring:children]
az1
az2

Then I needed to create an SSH key.

ssh-keygen -t rsa -b 4096 -f ~/.ssh/lgtm_deployment_key

I then copied that key to all servers.

ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]

Then created an SSH configuration file.

nano ~/.ssh/config

Host VM-*
User ubuntu
IdentityFile ~/.ssh/lgtm_deployment_key
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

First I created a playbook to deploy the load balancers.

sudo nano deploy-loadbalancers.yml

---
- name: Deploy Load Balancers
  hosts: loadbalancers
  become: yes
  tasks:
    - name: Update apt cache
      apt:
        update_cache: yes

    - name: Install HAProxy and Keepalived
      apt:
        name:
          - haproxy
          - keepalived
        state: present

    - name: Start and enable services
      systemd:
        name: "{{ item }}"
        state: started
        enabled: yes
      loop:
        - haproxy
        - keepalived

Next I created a playbook to deploy the rest.

sudo nano deploy-ha.sh

#!/bin/bash

set -e

# Configuration
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
INVENTORY="$SCRIPT_DIR/inventory/ha-hosts"
LOG_FILE="/tmp/lgtm-deployment.log"

# Logging function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Change to script directory
cd "$SCRIPT_DIR"

log "Starting LGTM HA deployment from $SCRIPT_DIR..."

# Test connectivity first
log "Testing Ansible connectivity..."
if ! ansible all -i "$INVENTORY" -m ping > /dev/null 2>&1; then
    log "ERROR: Cannot connect to all hosts. Check inventory and SSH keys."
    exit 1
fi

# Deploy in correct order
log "Deploying Load Balancers..."
ansible-playbook -i "$INVENTORY" deploy-loadbalancers.yml

log "Deploying Databases..."
ansible-playbook -i "$INVENTORY" deploy-databases.yml

log "Deploying NFS Server..."
ansible-playbook -i "$INVENTORY" deploy-nfs.yml

log "Deploying AZ1..."
ansible-playbook -i "$INVENTORY" deploy-az1.yml

log "Deploying AZ2..."
ansible-playbook -i "$INVENTORY" deploy-az2.yml

log "Running health checks..."
if [ -f "$SCRIPT_DIR/scripts/health-check.sh" ]; then
    "$SCRIPT_DIR/scripts/health-check.sh"
else
    log "Health check script not found, skipping..."
fi

log "HA deployment completed successfully!"
log "Logs saved to: $LOG_FILE"

The finally ran the deployment.

./deploy-ha.sh