Storage
Network File System
First I updated the system.
sudo apt update && sudo apt upgrade -y
Then installed NFS server.
sudo apt install nfs-kernel-server -y
Then enabled the service to start upon boot, and started the service.
sudo systemctl enable nfs-kernel-server
sudo systemctl start nfs-kernel-server
sudo systemctl enable rpcbind
sudo systemctl start rpcbind
Next, I created the base directory for Loki storage.
sudo mkdir -p /srv/nfs/loki/{chunks,index,wal,boltdb-cache,compactor}
-
{chunks,index,wal,boltdb-cache,compactor}
: Shell brace expansionThis expands to create multiple directories
I then set ownership.
sudo chown -R nobody:nogroup /srv/nfs/loki
Then set permissions.
sudo chown -R nobody:nogroup /srv/nfs/lokisudo chmod -R 755 /srv/nfs/loki
Then created the directory structer for each Availability Zone.
sudo mkdir -p /srv/nfs/loki/{az1,az2}/data
I verified the structure was correct.
tree /srv/nfs/loki/
I then created the exports configuration file.
sudo nano /etc/exports
I specified which Loki VM’s to export to.
/srv/nfs/loki 10.33.99.74(rw,sync,no_subtree_check,no_root_squash,no_all_squash)
/srv/nfs/loki 10.33.99.77(rw,sync,no_subtree_check,no_root_squash,no_all_squash)
Then exported the filesystem.
sudo exportfs -arv
I also verified exports and checked the NFS status.
sudo exportfs -v
sudo systemctl status nfs-kernel-server
On both Loki servers I installed the NFS client.
sudo apt install nfs-common -y
Then created the shared mount directory.
sudo mkdir -p /shared/loki
Load Balancer
HAProxy
On both load balancer VM’s, I updated the system.
sudo apt update
Then installed HAProxy and Keepalived.
sudo apt install -y haproxy keepalived
Then enabled IP forwarding.
echo 'net.ipv4.ip_forward=1' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.ip_forward=1'
– Outputs the configuration setting|
– Pipes the output to the next commandsudo tee -a /etc/sysctl.conf
– Appends the setting to the system configuration file- The
-a
flag means “append” (vs overwriting the file)
net.ipv4.ip_forward=1
to /etc/sysctl.conf
, making the change persistent across reboots.
Then reloaded and applied the settings.
sudo sysctl -p
Without IP forwarding (default):
Client → Load Balancer → ❌ Packet dropped
When a packet arrives destined for another IP address, the Linux kernel discards it by default for security reasons.
With IP forwarding enabled:
Client → Load Balancer → Backend Server
← Load Balancer ← Backend Server
The kernel can forward packets between network interfaces, acting as a router.
Next, I created the configuration file for HAProxy.
sudo nano /etc/haproxy/haproxy.cfg
Global configuration section:
global
daemon
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
master-worker
# SSL Configuration
ssl-default-bind-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384
ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets
daemon
– Runs HAProxy as a background daemon process.chroot /var/lib/haproxy
– Changes the root directory.user haproxy
group haproxy
– Drops root privileges and runs as the ‘haproxy’ user/group.master-worker
– Enables the modern master-worker process model (HAProxy 1.8+).
Traditional model:
Single HAProxy Process
├── Handles all connections
└── Config reload = full restart
Master-worker model:
Master Process (manages workers)
├── Worker Process 1 (handles traffic)
├── Worker Process 2 (handles traffic)
└── Seamless reloads without connection drops
Benefits:
- Zero-downtime reloads: New workers start while old ones finish existing connections
- Better stability: Master manages workers, restarts failed ones
- Improved monitoring: Separate process for management tasks
Defaults section:
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
option httplog
option dontlognull
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
mode http
– Sets HAProxy to operate in HTTP mode.timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
– These timeouts prevents connections from hanging indefinitely and protects against various attack scenarios.option httplog
– Enables detailed HTTP logging format.option dontlognull
– Prevents logging of connections that don’t transfer data.error file
– Replaces default HAProxy error pages with custom ones.
Normal Request Flow:
Client → [connect timeout] → HAProxy → [connect timeout] → Backend
← [client timeout] ← ← [server timeout] ←
Scenario 1: Slow backend connection
1. Client connects to HAProxy instantly
2. HAProxy tries to connect to backend
3. Backend takes 6 seconds to accept connection
4. Connect timeout (5s) triggers → 502 error
Scenario 2: Slow backend response
1. Connection established quickly
2. Client sends request
3. Backend processes for 60 seconds
4. Server timeout (50s) triggers → 504 error
Scenario 3: Slow client
1. Client connects and starts sending large POST
2. Client sends data very slowly
3. No data sent for 51 seconds
4. Client timeout (50s) triggers → 408 error
States page section:
# Stats page
listen stats
bind *:8404
stats enable
stats uri /stats
stats refresh 30s
stats admin if TRUE
bind
– Binds the stats interface to port 8404 on all network interfaces.stats enable
– Activates the statistics interface.stats uri /stats
– Sets the URL path for accessing statistics.stats refresh 30s
– Auto-refreshes the stats page every 30 seconds.stats admin if TRUE
– Enables administrative functions on the stats pageif TRUE
.
Grafana Frontend section:
# Grafana Frontend
frontend grafana_frontend
bind *:443 ssl crt /etc/ssl/certs/grafana.pem
bind *:80
redirect scheme https if !{ ssl_fc }
default_backend grafana_backend
bind *:443
– Listen on port 443 on all interfaces.ssl
– Enable SSL/TLS termination at the load balancer.crt
– SSL certificate file location.bind *:80
– Listen on port 80 on all interfaces.redirect scheme
– Forces all HTTP traffic to redirect to HTTPS.
Grafana Backend section:
backend grafana_backend
balance roundrobin
option httpchk GET /api/health
server grafana-az1 10.33.99.73:3000 check inter 5s fall 3 rise 2
server grafana-az2 10.33.99.76:3000 check inter 5s fall 3 rise 2
<
backend grafana_backend
– Routes all requests to the ‘grafana_backend’ server pool.balance roundrobin
– Load balancing algorithm.option httpchk
– Configures HTTP health checks for backend servers.server
– Defines the servers.
Prometheus Frontend section:
# Prometheus Frontend
frontend prometheus_frontend
bind *:9090
default_backend prometheus_backend
Prometheus Backend section:
backend prometheus_backend
balance roundrobin
option httpchk GET /-/healthy
server prometheus-az1 10.33.99.73:9090 check inter 5s fall 3 rise 2
server prometheus-az2 10.33.99.76:9090 check inter 5s fall 3 rise 2
Loki Frontend section:
# Loki Frontend
frontend loki_frontend
bind *:3100
default_backend loki_backend
Loki Backend section:
backend loki_backend
balance roundrobin
option httpchk GET /ready
server loki-az1 VM-AZ1-2:3100 check inter 5s fall 3 rise 2
server loki-az2 VM-AZ2-2:3100 check inter 5s fall 3 rise 2
Tempo Frontend section:
# Tempo Frontend
frontend tempo_frontend
bind *:3200
default_backend tempo_backend
Tempo Backend section:
backend tempo_backend
balance roundrobin
option httpchk GET /ready
server tempo-az1 VM-AZ1-3:3200 check inter 5s fall 3 rise 2
server tempo-az2 VM-AZ2-3:3200 check inter 5s fall 3 rise 2
Mimir Frontend section:
# Mimir Frontend
frontend mimir_frontend
bind *:9009
default_backend mimir_backend
Mimir Backend section:
backend mimir_backend
balance roundrobin
option httpchk GET /ready
server mimir-az1 VM-AZ1-3:9009 check inter 5s fall 3 rise 2
server mimir-az2 VM-AZ2-3:9009 check inter 5s fall 3 rise 2
Keepalived
I then edited the keepalived configuration file.
sudo nano /etc/keepalived/keepalived.conf
On VM-LB1 (Primary)
vrrp_script chk_haproxy {
script "/bin/kill -0 `cat /var/run/haproxy.pid`"
interval 2
weight 2
fall 3
rise 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0 # Adjust to your interface
virtual_router_id 51
priority 110
advert_int 1
authentication {
auth_type PASS
auth_pass your_secure_password
}
virtual_ipaddress {
192.168.1.100 # Your VIP
}
track_script {
chk_haproxy
}
notify_master "/etc/keepalived/master.sh"
notify_backup "/etc/keepalived/backup.sh"
notify_fault "/etc/keepalived/fault.sh"
}
On VM-LB2 (Backup)
vrrp_script chk_haproxy {
script "/bin/kill -0 `cat /var/run/haproxy.pid`"
interval 2
weight 2
fall 3
rise 2
}
vrrp_instance VI_1 {
state BACKUP
interface eth0 # Adjust to your interface
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass your_secure_password
}
virtual_ipaddress {
192.168.1.100 # Your VIP
}
track_script {
chk_haproxy
}
notify_master "/etc/keepalived/master.sh"
notify_backup "/etc/keepalived/backup.sh"
notify_fault "/etc/keepalived/fault.sh"
}
I also created a script that displays a notification when the server is switching to master:
sudo nano /etc/keepalived/master.sh
#!/bin/bash
echo "$(date): Becoming MASTER" >> /var/log/keepalived.log
systemctl start haproxy
And another script that displays a notification when the server is switching to backup:
sudo nano /etc/keepalived/backup.sh
#!/bin/bash
echo "$(date): FAULT detected" >> /var/log/keepalived.log
systemctl stop haproxy
To make scripts executable:
chmod +x /etc/keepalived/*.sh
Example log entries:
2025-08-10 10:15:30: Becoming MASTER
2025-08-10 10:17:45: Becoming BACKUP
Database
PostgresSQL
Master Database
I installed PostgresSQL
sudo apt install -y postgresql postgresql-contrib
Then configured PostgresSQL for replication.
sudo -u postgres psql -c "CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'MyPassword123!';"
sudo -u postgres
– Switches to the ‘postgres’ system user.CREATE USER replicator
– Creates a new database user named ‘replicator’.WITH REPLICATION
– Grants replication privileges.ENCRYPTED PASSWORD
– Sets an encrypted password.
I then needed to edit the postgresql configuration file.
sudo nano /etc/postgresql/14/main/postgresql.conf
I added the following to the end:
# Replication settings
wal_level = replica
max_wal_senders = 3
wal_keep_size = 64MB
synchronous_commit = on
synchronous_standby_names = 'standby1'
wal-level
– Sets the Write-Ahead Logging to ‘replica’.max_wal_senders
– Set the number of background processes that stream WAL data to standby servers.wal_keep_size
– Sets the minimum WAL data kept for standby servers.sunchronous_standby_names
– Identifies the server name for replication.
I then edited the pg_hba configuration file.
sudo nano /etc/postgresql/main/pg_hba.conf
and added this to the bottom.
# Replication connections
host replication replicator VM-DB2/32 md5
Replica Database
First, I stopped Postgres.
sudo systemctl stop postgresql
I then deleted all database data. That way the replica database gets an exact copy of the master’s data directory.
sudo rm -rf /var/lib/postgresql/14/main/*
I then deleted all database data. That way the replica database gets an exact copy of the master’s data directory.
sudo -u postgres pg_basebackup -h 10.33.99.79 -D /var/lib/postgresql/14/main -U replicator -W -v -P -R
-h VM-10.33.99.79
– Connects to the master database server.-D /var/lib...
– Specifies where the backup will be written.-U
– Uses the replication user I created.-W
– Forces password prompt.-v
– Shows what pg_baseback is doing.-P
– Shows transfer progress and ETA.-R
– Automatically sets up replication settings.
Docker
Prometheus
On both Prometheus VMs I created the following YAML configuration file.
sudo nano prometheus-ha.yml
Global section:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production-ha'
replica: 'prometheus-az1' # Change to az2 for second instance
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_interval
– Defines how frequently Prometheus collects metrics from all targets.evaluation_interval
– Defines how frequently Prometheus evaluates alerting and recording rules.external_labels
– Metadata attached to all metrics scraped by this Prometheus instance.cluster
– Identifies which cluster this Prometheus monitors.replica
– Identifies which Prometheus instance collected this metric.
Alerting section:
alerting:
alertmanagers:
- static_configs:
- targets:
- '10.33.99.73:9093'
- '10.33.99.76:9093'
Scrape Configs section:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"prometheus|node|alertmanager"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- '10.33.99.73:9090' # Remove self from targets
- '10.33.99.76:9090'
- job_name: 'node-exporter'
static_configs:
- targets:
- '10.33.99.73:9100'
- '10.33.99.74:9100'
- '10.33.99.75:9100'
- '10.33.99.76:9100'
- '10.33.99.77:9100'
- '10.33.99.78:9100'
Federation allows one Prometheus server to scrape selected time series from another Prometheus server, creating hierarchical monitoring architectures.
metrics_path
– /federate is Prometheus’s built-in federation endpoint.{job= ...
– Selects metrics from specific jobs.{__name__= ..
– Selects pre-computed recording rules.
Remote Write section:
remote_write:
- url: http://10.33.99.75:9009/api/v1/push
queue_config:
max_samples_per_send: 10000
- url: http://10.33.99.78:9009/api/v1/push
queue_config:
max_samples_per_send: 10000
Prometheus will send metrics to both Mimir instances simultaneously.
Then created the docker-compose YAML configuration file that will install and run Prometheus, Alert Manager, and Grafana.
sudo nano docker-compose.yml
Prometheus section:
services:
prometheus:
image: prom/prometheus:v2.47.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=7d'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--web.external-url=http://prometheuscluster.com:9090'
volumes:
- ./prometheus-ha.yml:/etc/prometheus/prometheus.yml:ro
- ./rules:/etc/prometheus/rules:ro
- prometheus-data:/prometheus
networks:
- monitoring
--config.file
– Specifies the main Prometheus configuration file.--storage.tsdb.path
– Sets the Time Series Database storage directory.--storage.tsdb.retention
– Keeps metrics for 7 days in local storage.--web.console
– Built-in web interface templates for basic dashboards.--web.enable-lifecycle
– Enables HTTP endpoints for configuration management.--web.enable-admin-api
– Enables powerful management endpoints.--web.external-url
– How Prometheus should reference itself in external communications../prometheus-ha-yml:
– Maps host file to container configuration../rules:
– Maps host directory to container rules directory.
Alert Manager section:
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
command:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://prometheuscluster.com:9093'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=10.33.99.73:9094' # Remove self from peers
- '--cluster.peer=10.33.99.76:9094'
volumes:
- ./alertmanager:/etc/alertmanager:ro
- alertmanager-data:/alertmanager
networks:
- monitoring
Grafana section:
grafana:
image: grafana/grafana-enterprise:10.1.0
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_DATABASE_TYPE=postgres
- GF_DATABASE_HOST=10.33.99.79:5432
- GF_DATABASE_NAME=grafana
- GF_DATABASE_USER=grafana
- GF_DATABASE_PASSWORD=${DB_PASSWORD}
- GF_SESSION_PROVIDER=postgres
- GF_SESSION_PROVIDER_CONFIG=host=10.33.99.79 port=5432 user=grafana password=${DB_PASSWORD} dbname=grafana sslmode=disable
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
networks:
- monitoring
volumes:
prometheus-data:
alertmanager-data:
networks:
monitoring:
driver: bridge
I then created the database for Grafana (on VM-DB1).
CREATE USER grafana WITH PASSWORD 'GrafanaPassword123!';
CREATE DATABASE grafana OWNER grafana;
GRANT ALL PRIVILEGES ON DATABASE grafana TO grafana;
GRANT CREATE ON SCHEMA public TO grafana;
GRANT USAGE ON SCHEMA public TO grafana;
I then created the .env file to store the Grafana database password.
echo "DB_PASSWORD=GrafanaPassword123!" > .env
I then generated a strong password.
GRAFANA_PASSWORD=$(openssl rand -base64 32)
And saved it as my Grafana Admin password.
echo "GRAFANA_PASSWORD=$GRAFANA_PASSWORD" >> .env
Loki
Next, I created the YAML configuration file for Loki.
sudo nano loki-ha.yml
Server section:
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
distributor:
ring:
kvstore:
store: memberlist
ingester:
max_transfer_retries: 0
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 2
tokens_file_path: /loki/wal/tokens
memberlist:
join_members:
- 10.33.99.74:7946
- 10.33.99.77:7946
bind_port: 7946
Storage Config section:
# Local filesystem storage
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
shared_store: filesystem
cache_location: /loki/boltdb-cache
filesystem:
directory: /loki/chunks
# Compactor for local cleanup
compactor:
working_directory: /loki/compactor
shared_store: filesystem
active_index_directory
– Where Loki writes currently active index filesshared_store
– Uses local filesystem for ‘shipped’ indexescache_location
– For local filesystemdirectory: /loki/chunks
– This is where compressed log content is storedcompactor
– Merges small index files into larger ones, combines small chunk files for efficiency, removes old data according to retention policies, eliminates duplicate log entries, and reduces storage overhead and improves query performance.
Querier section:
# Query and retention settings
querier:
max_concurrent: 2048
query_frontend:
max_outstanding_per_tenant: 2048
compress_responses: true
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
ingestion_rate_mb: 16
ingestion_burst_size_mb: 32
# Set retention period
table_manager:
retention_deletes_enabled: true
retention_period: 30d
max_concurrent
– How many queries a single querier instance can process simultaneouslymax_outstanding_per_tenant
– Controls query queuing per tenant at the query frontend levelcompress_responses
– HTTP responses from query frontend to clients (Grafana)ingestion_rate_mb
– Maximum continuous data ingestion rate per tenantingestion_burst_size_mb
– Allows short-term spikes above sustained rate
Then I created the docker-compose YAML file for Loki.
docker-compose.yml
version: '3.8'
services:
loki:
image: grafana/loki:2.9.0
container_name: loki
restart: unless-stopped
ports:
- "3100:3100"
- "7946:7946"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- ./loki-local.yml:/etc/loki/local-config.yaml:ro
- /shared/loki:/loki # NFS shared storage
networks:
- loki
depends_on:
- check-nfs
# Health check service to ensure NFS is mounted
check-nfs:
image: alpine
command: |
sh -c "
if [ ! -d /shared/loki ]; then
echo 'ERROR: NFS mount not available at /shared/loki'
exit 1
fi
echo 'NFS mount verified at /shared/loki'
touch /shared/loki/.docker-health-check
"
volumes:
- /shared/loki:/shared/loki
networks:
- loki
promtail:
image: grafana/promtail:2.9.0
container_name: promtail
restart: unless-stopped
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail-config.yml:/etc/promtail/config.yml:ro
command: -config.file=/etc/promtail/config.yml
networks:
- loki
depends_on:
- loki
node-exporter:
image: prom/node-exporter:v1.6.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
networks:
- loki
networks:
loki:
driver: bridge
Tempo
For Tempo I just created a tempo-ha.yml configuration file.
server:
http_listen_port: 3200
grpc_listen_port: 9095
distributor:
receivers:
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
grpc:
endpoint: 0.0.0.0:14250
zipkin:
endpoint: 0.0.0.0:9411
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
ring:
kvstore:
store: memberlist
ingester:
ring:
kvstore:
store: memberlist
replication_factor: 2
max_block_duration: 5m
memberlist:
join_members:
- 10.33.99.75:7946
- 10.33.99.78:7946
bind_port: 7946
compactor:
ring:
kvstore:
store: memberlist
querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
query_frontend:
search:
duration_slo: 5s
throughput_bytes_slo: 1.073741824e+09
trace_by_id:
duration_slo: 5s
storage:
trace:
backend: s3
s3:
bucket: your-tempo-bucket
region: us-east-1
wal:
path: /tmp/tempo/wal
pool:
max_workers: 100
queue_depth: 10000
metrics_generator:
registry:
external_labels:
source: tempo
cluster: production-ha
storage:
path: /tmp/tempo/generator/wal
remote_write:
- url: http://VM-AZ1-1:9090/api/v1/write
send_exemplars: true
- url: http://VM-AZ2-1:9090/api/v1/write
send_exemplars: true
Mimir
Same thing with Mimiar, I created a mimir-ha.yml configuration file.
multitenancy_enabled: false
server:
http_listen_port: 9009
grpc_listen_port: 9095
distributor:
shard_by_all_labels: true
pool:
health_check_ingesters: true
ha_tracker:
enable_ha_tracker: true
kvstore:
store: memberlist
ring:
kvstore:
store: memberlist
ingester:
ring:
kvstore:
store: memberlist
replication_factor: 2
heartbeat_period: 5s
heartbeat_timeout: 1m
tokens_file_path: /data/tokens
memberlist:
join_members:
- 10.33.99.75:7946
- 10.33.99.78:7946
bind_port: 7946
blocks_storage:
backend: s3
s3:
endpoint: s3.amazonaws.com
bucket_name: your-mimir-bucket
region: us-east-1
tsdb:
dir: /data/tsdb
compactor:
data_dir: /data/compactor
ring:
kvstore:
store: memberlist
store_gateway:
sharding_ring:
replication_factor: 2
kvstore:
store: memberlist
ruler:
rule_path: /data/ruler
ring:
kvstore:
store: memberlist
ruler_storage:
backend: s3
s3:
bucket_name: your-mimir-rules-bucket
region: us-east-1
alertmanager:
data_dir: /data/alertmanager
external_url: /alertmanager
sharding_ring:
replication_factor: 2
kvstore:
store: memberlist
alertmanager_storage:
backend: s3
s3:
bucket_name: your-alertmanager-bucket
region: us-east-1
limits:
compactor_blocks_retention_period: 30d
ingestion_rate: 20000
ingestion_burst_size: 40000
Deployment
Ansible
To deploy the stack I used Ansible, which I installed on a separate VM.
sudo apt install ansible -y
Then created the deployment directory.
mkdir -p /opt/lgtm-deployment
cd /opt/lgtm-deployment
Then created the directory structure.
mkdir -p {inventory,scripts,group_vars,host_vars,roles}
mkdir -p roles/{haproxy,keepalived,postgresql,prometheus,grafana,loki,tempo,mimir}
Next, I created the main files.
touch deploy-ha.sh
touch inventory/ha-hosts
touch scripts/health-check.sh
touch {deploy-loadbalancers.yml,deploy-databases.yml,deploy-az1.yml,deploy-az2.yml}
I then created the inventory file.
sudo nano inventory/ha-hosts
Here I added my server information.
[loadbalancers]
VM-LB1 ansible_host=10.33.99.71 ansible_user=ubuntu
VM-LB2 ansible_host=10.33.99.72 ansible_user=ubuntu
[databases]
VM-DB1 ansible_host=10.33.99.79 ansible_user=ubuntu
VM-DB2 ansible_host=10.33.99.80 ansible_user=ubuntu
[nfs_servers]
VM-NFS-1 ansible_host=10.33.99.70 ansible_user=ubuntu
[az1_prometheus]
VM-AZ1-1 ansible_host=10.33.99.73 ansible_user=ubuntu
[az1_loki]
VM-AZ1-2 ansible_host=10.33.99.74 ansible_user=ubuntu
[az1_tracing]
VM-AZ1-3 ansible_host=10.33.99.75 ansible_user=ubuntu
[az2_prometheus]
VM-AZ2-1 ansible_host=10.33.99.76 ansible_user=ubuntu
[az2_loki]
VM-AZ2-2 ansible_host=10.33.99.77 ansible_user=ubuntu
[az2_tracing]
VM-AZ2-3 ansible_host=10.33.99.78 ansible_user=ubuntu
# Group definitions
[az1:children]
az1_prometheus
az1_loki
az1_tracing
[az2:children]
az2_prometheus
az2_loki
az2_tracing
[all_monitoring:children]
az1
az2
Then I needed to create an SSH key.
ssh-keygen -t rsa -b 4096 -f ~/.ssh/lgtm_deployment_key
I then copied that key to all servers.
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
ssh-copy-id -i ~/.ssh/lgtm_deployment_key.pub [email protected]
Then created an SSH configuration file.
nano ~/.ssh/config
Host VM-*
User ubuntu
IdentityFile ~/.ssh/lgtm_deployment_key
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
First I created a playbook to deploy the load balancers.
sudo nano deploy-loadbalancers.yml
---
- name: Deploy Load Balancers
hosts: loadbalancers
become: yes
tasks:
- name: Update apt cache
apt:
update_cache: yes
- name: Install HAProxy and Keepalived
apt:
name:
- haproxy
- keepalived
state: present
- name: Start and enable services
systemd:
name: "{{ item }}"
state: started
enabled: yes
loop:
- haproxy
- keepalived
Next I created a playbook to deploy the rest.
sudo nano deploy-ha.sh
#!/bin/bash
set -e
# Configuration
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
INVENTORY="$SCRIPT_DIR/inventory/ha-hosts"
LOG_FILE="/tmp/lgtm-deployment.log"
# Logging function
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Change to script directory
cd "$SCRIPT_DIR"
log "Starting LGTM HA deployment from $SCRIPT_DIR..."
# Test connectivity first
log "Testing Ansible connectivity..."
if ! ansible all -i "$INVENTORY" -m ping > /dev/null 2>&1; then
log "ERROR: Cannot connect to all hosts. Check inventory and SSH keys."
exit 1
fi
# Deploy in correct order
log "Deploying Load Balancers..."
ansible-playbook -i "$INVENTORY" deploy-loadbalancers.yml
log "Deploying Databases..."
ansible-playbook -i "$INVENTORY" deploy-databases.yml
log "Deploying NFS Server..."
ansible-playbook -i "$INVENTORY" deploy-nfs.yml
log "Deploying AZ1..."
ansible-playbook -i "$INVENTORY" deploy-az1.yml
log "Deploying AZ2..."
ansible-playbook -i "$INVENTORY" deploy-az2.yml
log "Running health checks..."
if [ -f "$SCRIPT_DIR/scripts/health-check.sh" ]; then
"$SCRIPT_DIR/scripts/health-check.sh"
else
log "Health check script not found, skipping..."
fi
log "HA deployment completed successfully!"
log "Logs saved to: $LOG_FILE"
The finally ran the deployment.
./deploy-ha.sh