Reliability and Disaster Recovery
The QALITA platform stores critical data that must be backed up and restored in the event of a disaster.
Backup
Platform backup is performed by the system administrator, ensuring the persistence of the following elements:
Element | Criticality Level | Recommended Backup Frequency |
---|---|---|
PostgreSQL | ➕➕➕ ⚠️ Critical ⚠️ | Daily |
SeaweedFS | ➕➕ Moderate | Weekly |
Agents | ➕➕ Moderate | (Can be backed up via PVC to ensure optimal service continuity) |
Redis | None (stateless) | None |
Frontend | None (stateless) | None |
Backend | None (stateless) | None |
PostgreSQL
Backup can be configured via the backup feature of the Bitnami Helm chart.
By default, with the QALITA Helm chart, this backup is not enabled. Once enabled, the backup is performed in a PVC that must itself be backed up in cold or semi-cold storage.
SeaweedFS
SeaweedFS storage is less critical because it only contains:
- Logs
- Pack archives whose code is versioned in a VCS (e.g., GitLab)
In Kubernetes, back up the PVC containing the data. If you do not manage your cluster, ensure that PVCs are included in the backup policy (contact the cluster administrator).
Agent
Agents store important information for their operation:
-
Sources (
sources-conf.yaml
) This file contains source definitions. It is important to back it up. In the QALITA deployment, a local agent can be deployed with persistence of the~/.qalita/
directory. -
Connection to the platform (
.agent
) Contains recent connection information.warning⚠️ These files should not be backed up. Use environment variables for authentication. The
.env-<login>
files are temporary and sensitive. -
Execution results (
~/.qalita/agent_temp_run/
) Can be configured to be persistent and backed up.
Restoration
PostgreSQL
Follow Bitnami PostgreSQL documentation.
SeaweedFS
See official SeaweedFS documentation.
Agents
-
Re-register the agent: Run
qalita agent login
and copy the~/.qalita/
directory. -
Restore sources: Restore the
sources-conf.yaml
file. -
Synchronize source IDs: Use
qalita source push
to realign local sources with those on the platform.
Degraded Mode
In case of partial loss of a component, the platform can continue to operate in degraded mode. Here are the main scenarios:
Missing Component | Impact | Possible Workaround |
---|---|---|
PostgreSQL | Complete platform blockage | None, mandatory restoration |
SeaweedFS | Temporary loss of logs and archives | Partial operation possible |
Agent unavailable | Scheduled or manual executions fail | Restart the agent or use a local agent |
Web platform (frontend) | Read access impossible | Use REST API or CLI (if backend is still active) |
Backend unavailable | All API access and executions are blocked | None, requires redeployment or restoration |
Redis | Performance loss on certain operations | Manual re-executions, partially stable function |
Monitoring and SRE
Observability
Recommended tools:
Component | Recommended Tool |
---|---|
Logs | Loki + Grafana / ELK |
Metrics | Prometheus + Grafana |
Uptime / probes | Uptime Kuma / Blackbox |
Tracing | Jaeger / OpenTelemetry |
Proactive Alerting
Set critical thresholds:
- Backend > 2s latency
- HTTP 5xx rate > 2%
- PostgreSQL PVC > 85% usage
Send alerts via:
- Slack / MS Teams
- Opsgenie / PagerDuty
Recommended Resilience
Domain | Recommended SRE Practice |
---|---|
DB | Backups + regular restoration tests |
Storage | Weekly backups, volume monitoring |
Network | LB with health checks + retries in ingress |
Deployment | Rolling update |
Incidents | Runbooks + postmortems |
Agents | Deployment with PVC, cron job for automatic restart |
Resilience Tests
- Voluntary deletion of a pod
- Simulated DB crash
- Failover test if replicas
- Network outage simulation
- Actual restoration of a backup
Runbooks
1. Backend Not Responding
Symptoms
- REST API unavailable (
5xx
,timeout
) - Web interface not loading (error 502/504)
Diagnosis
kubectl get pods -n qalita
kubectl logs <pod-backend> -n qalita
Immediate Actions
- Delete the faulty pod:
kubectl delete pod <pod-backend> -n qalita
- Check resources:
kubectl top pod <pod-backend> -n qalita
- If the error is due to an inaccessible DB:
kubectl exec <pod-backend> -- psql <url>
Recovery
- Pod recreated automatically? ✅
- API tests:
curl <backend-url>/health
- Test a business API call
Postmortem
- Reason for the crash? (OOM, crash, logical error)
- Need to increase resources? Add a readiness probe?
2. PostgreSQL Down
Symptoms
- Backend crashes in a loop
- Logs containing
could not connect to server: Connection refused
kubectl get pvc
indicates an attachment issue
Diagnosis
kubectl get pods -n qalita
kubectl describe pod <postgresql-pod>
kubectl logs <postgresql-pod> -n qalita
Immediate Actions
- Delete the pod:
kubectl delete pod <postgresql-pod> -n qalita
- Check the PVC:
kubectl describe pvc <postgresql-pvc> -n qalita
Restoration
- If data is lost → restore from backup:
helm install postgresql bitnami/postgresql \
--set postgresql.restore.enabled=true \
--set postgresql.restore.backupSource=<source>
Postmortem
- Why did the pod crash?
- Are recent backups valid?
- Should automated restoration tests be scheduled?
3. SeaweedFS Inaccessible
Symptoms
- Archive downloads fail
- Unable to display task logs
Diagnosis
kubectl logs <seaweedfs-pod> -n qalita
kubectl describe pvc <seaweedfs-pvc> -n qalita
Immediate Actions
- Check PVC status
- Delete and restart the pod
- Restart the volume if CSI driver (EBS, Ceph...)
Recovery
- Validate that objects are accessible via the platform
- Rerun a task that generates a log
Postmortem
- Is this a PVC saturation issue?
- Did the lack of alerting prolong the outage?
4. Agent Blocked or Offline
Symptoms
- Tasks no longer execute
- Agent no longer appears in the interface
Diagnosis
kubectl logs <agent-pod> -n qalita
Immediate Actions
- Restart the local agent:
qalita agent run
- Re-register:
qalita agent login
- Check network access (can it reach the API?)
Recovery
- Test a simple task via
qalita agent run -p pack_id -s source_id
- Verify result reception on the platform
Postmortem
- Is the agent too old? DNS issue?
- Monitor agents via regular heartbeat
5. Memory or CPU Saturated
Symptoms
- Pod restarts in a loop (
CrashLoopBackOff
) - High API latency
Diagnosis
kubectl top pod -n qalita
kubectl describe pod <pod-name>
Immediate Actions
- Increase resources in
values.yaml
- Check if a process is consuming abnormally (profiling via pprof)
Recovery
- Apply new resource parameters:
helm upgrade qalita-platform ./chart --values values.yaml
Postmortem
- Is HPA enabled?
- Is there a spike related to a specific task?
6. Expired TLS Certificate
Symptoms
- Unable to access the interface
- Browser error "unsecured connection"
Diagnosis
kubectl describe certificate -n qalita
kubectl get cert -n qalita
Immediate Actions
- Manually renew:
kubectl cert-manager renew <cert-name>
- Force redeploy of Traefik/Ingress
Recovery
- Wait for the certificate to be "Ready":
kubectl get certificate -n qalita -o wide
Postmortem
- Is cert-manager working correctly?
- Set up an alert for expiration at D-15
7. License Issue
Symptoms
- Messages
Invalid or expired license
- API returns 401 on login
Diagnosis
- Check the
QALITA_LICENSE_KEY
variable - Verification test:
curl -H "Authorization: Bearer <token>" \
https://<registry>/v2/_catalog
Immediate Actions
- Check expiration date (contained in the JWT token)
- Extend via the portal or contact support
Recovery
- Redeploy the backend with the new license (if the variable is mounted via secret/env)
Postmortem
- Is renewal automated or monitored?
- Was an alert triggered in advance?
Automation Suggestions
Incident | Possible Automation |
---|---|
Agent offline | Cron job for qalita agent ping with alert |
Expired TLS cert | Script that checks certificates at D-30 |
PostgreSQL saturated | Prometheus alerts on pg_stat_activity |
PVC almost full | Alerting on disk usage via kubelet or metrics |
Tip: Classify Incidents in Grafana OnCall / Opsgenie
- Category:
backend
,db
,network
,tls
,storage
- Priority:
P1
(blocking),P2
(degraded),P3
(minor) - Responsible:
infra
,dev
,data