Published on

Designing a Database Backup Service

Authors
  • avatar
    Name
    Siddharth Singh
    Twitter

Introduction

Backing up databases is a critical part of any organization's disaster recovery and business continuity plan. There are multiple companies like Druva, Commavault, Veeam , Cohesity etc who offer database backup as a service products.

In this post, we'll analyze what all things we need to consider while designing a robust, secure, and scalable database backup service. We'll cover requirements, high level architecture, best practices, and common pitfalls—so you can build or evaluate a solution with confidence. This is not a detailed design for the series but more of high level notes to keep in mind.


Problem Statement

Design a database backup service.

This open-ended question is common in system design interviews and real-world architecture discussions. To answer it well, you need to clarify requirements, understand constraints, and propose a practical, extensible solution.

It's up to us to figure out and settle the scope of the problem by asking questions. Sometimes, you may have studied a similar problem and it's tempting to jump straight into the design. However, that's not the right approach.

The interviewer may have a different view of the problem and different constraints in mind. So, even if you 'crack' the interview, it may not satisfy the interviewer.


Clarifying Requirements

Calm your mind. The problem is vague and open-ended by design. You are not wasting your time if you are figuring out the problem scope. Just talk to the interviewer about the problem and understand the constraints they want to put in place to close the 'ends'.

Put yourself in a Dev Lead's shoes who is having their first meeting titled 'DB backup system requirement' with a Product Manager.

Before jumping into design, clarify:

  • What databases need backup? (Type, size, criticality)
  • When should backups run? (Frequency, scheduling)
  • How should backups be performed? (Full, incremental, differential)
  • How long should backups be retained? (Retention, compliance)
  • Why are backups needed? (Disaster recovery, compliance, accidental deletion)

Why Are Backups Needed?

Backups protect against:

  • Ransomware and malware attacks
  • Accidental data loss (e.g., dropped tables)
  • Hardware failures (disk, server)
  • Natural disasters (fire, flood)
  • Human error

Key Backup Terminology

  • Full Backup: Complete copy of the database.
  • Differential Backup: Changes since the last full backup.
  • Incremental Backup: Changes since the last backup (full or incremental).
  • Point-in-Time Recovery: Restore to a specific moment.
  • Disaster Recovery: Restore after catastrophic failure.

Functional & Non-Functional Requirements

Functional

  1. Enroll databases for backup.
  2. Schedule and execute backups.
  3. Restore databases from backup.
  4. Configurable backup policies.

Non-Functional

  1. Durability: No data loss.
  2. Security: Encryption at rest and in transit.
  3. Scalability: Support many databases and large data volumes.
  4. Reliability: Backups/restores must work when needed.
  5. Auditability: Track all operations.

Example Backup Policy

A backup policy defines how and when backups occur. Here’s a sample in YAML:

database_id: db-prod-1
schedule: "0 2 * * *" # Daily at 2 AM
backup_type: full
retention_days: 30
encryption: true
notify_on_failure: ["admin@domain.com"]

High-Level Architecture

Components

  • Agent: Runs on the DB server, performs backups, uploads to storage.
  • Backup Service: Manages agents, stores metadata, exposes APIs.
  • Storage: Blob store for backup files (e.g., S3, Azure Blob).
  • Metadata DB: Stores backup schedules, status, and policies.

Architecture Diagram

Basic diagram
Service diagram

User Flow

  1. Database Enrollment: Register DB via API.
  2. Policy Configuration: Set backup frequency, retention, etc.
  3. Backup Execution: Agent performs backup, uploads to storage.
  4. Metadata Update: Status and details recorded.
  5. Restore Request: User requests restore; service provides secure download or triggers agent.

Failure Scenarios & Handling

  • Backup Failure: Agent retries; alerts sent if repeated failures.
  • Network Outage: Agent retries upload; backs off with exponential delay.
  • Storage Full: Alert admins; pause new backups until resolved.
  • Corrupted Backup: Validate backups post-upload; mark as invalid if corruption detected.

Monitoring & Alerting

  • Track backup/restore job status.
  • Alert on failures, missed schedules, or storage issues.
  • Expose metrics (success rate, duration, storage used) for dashboards.

Restore Workflow

  1. User selects backup (by date, type, etc.).
  2. Service validates backup integrity.
  3. Provides secure download link or triggers agent to restore.
  4. Optionally supports partial/table-level restores.
  5. Logs and audits all restore actions.

Security Considerations

  • Air-gapped Storage: Store copies in isolated networks for ransomware protection.
  • Encryption: Use strong encryption for data at rest and in transit.
  • Authentication & Authorization: Only authorized users/agents can access/restore.
  • Audit Logging: Record all backup and restore operations.

Compliance & Retention

  • Support configurable retention policies (e.g., 7, 30, 365 days).
  • Enable legal hold for compliance (GDPR, HIPAA, etc.).
  • Provide audit trails for all operations.

Performance Optimization

  • Schedule backups during off-peak hours.
  • Use database-native snapshot features if available.
  • Compress and deduplicate backups to save space and bandwidth.
  • Throttle backup jobs to minimize impact on production workloads.

Cost Considerations

  • Use lifecycle policies to delete old backups and control storage costs.
  • Compress backups to reduce storage and transfer costs.
  • Monitor bandwidth usage for uploads/downloads.

Technology Choices

  • Agents: Custom scripts, open-source tools (e.g., pgBackRest, Percona XtraBackup), or vendor solutions.
  • Storage: AWS S3, Azure Blob, Google Cloud Storage, on-premises object storage.
  • Orchestration: Kubernetes CronJobs, managed schedulers, or custom job runners.
  • Monitoring: Prometheus, Grafana, cloud-native monitoring.

Common Pitfalls

  • Not testing restores regularly.
  • Not encrypting backups.
  • Ignoring storage limits or costs.
  • Failing to monitor backup health.
  • Not updating backup policies as requirements change.

Extending the Design

  • Backup Compression & Deduplication: Save space and costs.
  • Cross-Region Replication: Improve disaster recovery.
  • Self-Service Portal: Let users manage backups/restores.
  • Integration with Monitoring/Alerting: Automated notifications and dashboards.

Conclusion

Designing a database backup service is about more than just copying files. You need to automate, secure, monitor, and regularly test your backups and restores. By considering failure scenarios, compliance, cost, and user experience, you can build a solution that protects your data and your business.


What would you add or change in this design? Let me know in the comments!