Skip to main content

Reliability and Disaster Recovery

The QALITA platform stores critical data that must be backed up and restored in the event of a disaster.

Backup

Platform backup is performed by the system administrator, ensuring the persistence of the following elements:

ElementCriticality LevelRecommended Backup Frequency
PostgreSQL➕➕➕ ⚠️ Critical ⚠️Daily
SeaweedFS➕➕ ModerateWeekly
Agents➕➕ Moderate(Can be backed up via PVC to ensure optimal service continuity)
RedisNone (stateless)None
FrontendNone (stateless)None
BackendNone (stateless)None

PostgreSQL

Backup can be configured via the backup feature of the Bitnami Helm chart.

warning

By default, with the QALITA Helm chart, this backup is not enabled. Once enabled, the backup is performed in a PVC that must itself be backed up in cold or semi-cold storage.

SeaweedFS

SeaweedFS storage is less critical because it only contains:

  • Logs
  • Pack archives whose code is versioned in a VCS (e.g., GitLab)
info

In Kubernetes, back up the PVC containing the data. If you do not manage your cluster, ensure that PVCs are included in the backup policy (contact the cluster administrator).

Agent

Agents store important information for their operation:

  • Sources (sources-conf.yaml) This file contains source definitions. It is important to back it up. In the QALITA deployment, a local agent can be deployed with persistence of the ~/.qalita/ directory.

  • Connection to the platform (.agent) Contains recent connection information.

    warning

    ⚠️ These files should not be backed up. Use environment variables for authentication. The .env-<login> files are temporary and sensitive.

  • Execution results (~/.qalita/agent_temp_run/) Can be configured to be persistent and backed up.


Restoration

PostgreSQL

Follow Bitnami PostgreSQL documentation.

SeaweedFS

See official SeaweedFS documentation.

Agents

  1. Re-register the agent: Run qalita agent login and copy the ~/.qalita/ directory.

  2. Restore sources: Restore the sources-conf.yaml file.

  3. Synchronize source IDs: Use qalita source push to realign local sources with those on the platform.


Degraded Mode

In case of partial loss of a component, the platform can continue to operate in degraded mode. Here are the main scenarios:

Missing ComponentImpactPossible Workaround
PostgreSQLComplete platform blockageNone, mandatory restoration
SeaweedFSTemporary loss of logs and archivesPartial operation possible
Agent unavailableScheduled or manual executions failRestart the agent or use a local agent
Web platform (frontend)Read access impossibleUse REST API or CLI (if backend is still active)
Backend unavailableAll API access and executions are blockedNone, requires redeployment or restoration
RedisPerformance loss on certain operationsManual re-executions, partially stable function

Monitoring and SRE

Observability

Recommended tools:

ComponentRecommended Tool
LogsLoki + Grafana / ELK
MetricsPrometheus + Grafana
Uptime / probesUptime Kuma / Blackbox
TracingJaeger / OpenTelemetry

Proactive Alerting

Set critical thresholds:

  • Backend > 2s latency
  • HTTP 5xx rate > 2%
  • PostgreSQL PVC > 85% usage

Send alerts via:

  • Email
  • Slack / MS Teams
  • Opsgenie / PagerDuty
DomainRecommended SRE Practice
DBBackups + regular restoration tests
StorageWeekly backups, volume monitoring
NetworkLB with health checks + retries in ingress
DeploymentRolling update
IncidentsRunbooks + postmortems
AgentsDeployment with PVC, cron job for automatic restart

Resilience Tests

  • Voluntary deletion of a pod
  • Simulated DB crash
  • Failover test if replicas
  • Network outage simulation
  • Actual restoration of a backup

Runbooks

1. Backend Not Responding

Symptoms

  • REST API unavailable (5xx, timeout)
  • Web interface not loading (error 502/504)

Diagnosis

kubectl get pods -n qalita
kubectl logs <pod-backend> -n qalita

Immediate Actions

  • Delete the faulty pod: kubectl delete pod <pod-backend> -n qalita
  • Check resources: kubectl top pod <pod-backend> -n qalita
  • If the error is due to an inaccessible DB: kubectl exec <pod-backend> -- psql <url>

Recovery

  • Pod recreated automatically? ✅
  • API tests: curl <backend-url>/health
  • Test a business API call

Postmortem

  • Reason for the crash? (OOM, crash, logical error)
  • Need to increase resources? Add a readiness probe?

2. PostgreSQL Down

Symptoms

  • Backend crashes in a loop
  • Logs containing could not connect to server: Connection refused
  • kubectl get pvc indicates an attachment issue

Diagnosis

kubectl get pods -n qalita
kubectl describe pod <postgresql-pod>
kubectl logs <postgresql-pod> -n qalita

Immediate Actions

  • Delete the pod: kubectl delete pod <postgresql-pod> -n qalita
  • Check the PVC: kubectl describe pvc <postgresql-pvc> -n qalita

Restoration

  • If data is lost → restore from backup:
    helm install postgresql bitnami/postgresql \
    --set postgresql.restore.enabled=true \
    --set postgresql.restore.backupSource=<source>

Postmortem

  • Why did the pod crash?
  • Are recent backups valid?
  • Should automated restoration tests be scheduled?

3. SeaweedFS Inaccessible

Symptoms

  • Archive downloads fail
  • Unable to display task logs

Diagnosis

kubectl logs <seaweedfs-pod> -n qalita
kubectl describe pvc <seaweedfs-pvc> -n qalita

Immediate Actions

  • Check PVC status
  • Delete and restart the pod
  • Restart the volume if CSI driver (EBS, Ceph...)

Recovery

  • Validate that objects are accessible via the platform
  • Rerun a task that generates a log

Postmortem

  • Is this a PVC saturation issue?
  • Did the lack of alerting prolong the outage?

4. Agent Blocked or Offline

Symptoms

  • Tasks no longer execute
  • Agent no longer appears in the interface

Diagnosis

kubectl logs <agent-pod> -n qalita

Immediate Actions

  • Restart the local agent: qalita agent run
  • Re-register: qalita agent login
  • Check network access (can it reach the API?)

Recovery

  • Test a simple task via qalita agent run -p pack_id -s source_id
  • Verify result reception on the platform

Postmortem

  • Is the agent too old? DNS issue?
  • Monitor agents via regular heartbeat

5. Memory or CPU Saturated

Symptoms

  • Pod restarts in a loop (CrashLoopBackOff)
  • High API latency

Diagnosis

kubectl top pod -n qalita
kubectl describe pod <pod-name>

Immediate Actions

  • Increase resources in values.yaml
  • Check if a process is consuming abnormally (profiling via pprof)

Recovery

  • Apply new resource parameters:
    helm upgrade qalita-platform ./chart --values values.yaml

Postmortem

  • Is HPA enabled?
  • Is there a spike related to a specific task?

6. Expired TLS Certificate

Symptoms

  • Unable to access the interface
  • Browser error "unsecured connection"

Diagnosis

kubectl describe certificate -n qalita
kubectl get cert -n qalita

Immediate Actions

  • Manually renew:
    kubectl cert-manager renew <cert-name>
  • Force redeploy of Traefik/Ingress

Recovery

  • Wait for the certificate to be "Ready":
    kubectl get certificate -n qalita -o wide

Postmortem

  • Is cert-manager working correctly?
  • Set up an alert for expiration at D-15

7. License Issue

Symptoms

  • Messages Invalid or expired license
  • API returns 401 on login

Diagnosis

  • Check the QALITA_LICENSE_KEY variable
  • Verification test:
    curl -H "Authorization: Bearer <token>" \
    https://<registry>/v2/_catalog

Immediate Actions

  • Check expiration date (contained in the JWT token)
  • Extend via the portal or contact support

Recovery

  • Redeploy the backend with the new license (if the variable is mounted via secret/env)

Postmortem

  • Is renewal automated or monitored?
  • Was an alert triggered in advance?

Automation Suggestions

IncidentPossible Automation
Agent offlineCron job for qalita agent ping with alert
Expired TLS certScript that checks certificates at D-30
PostgreSQL saturatedPrometheus alerts on pg_stat_activity
PVC almost fullAlerting on disk usage via kubelet or metrics

Tip: Classify Incidents in Grafana OnCall / Opsgenie

  • Category: backend, db, network, tls, storage
  • Priority: P1 (blocking), P2 (degraded), P3 (minor)
  • Responsible: infra, dev, data