Configuring Data Retention

Overview

This document outlines LangDB's data retention strategy for tracing information stored in ClickHouse. The strategy employs materialized views to manage data retention periods based on user subscription tiers efficiently. Data eviction is implemented using ClickHouse's TTL (Time-To-Live) mechanisms and background processes:

TTL Definitions: Each table includes TTL expressions that specify when data should expire based on timestamp fields
Background Merge Process: ClickHouse automatically runs background processes that merge data parts and remove expired data during these merge operations
Resource-Efficient: The eviction process runs asynchronously during system low-load periods, minimizing impact on query performance

Tracing Data Architecture

LangDB uses a robust system for storing and analyzing trace data:

Primary Storage: All trace data is initially stored in the langdb.traces table in ClickHouse
Materialized Views: Tier-specific materialized views filter and retain data based on user subscription levels
Retention Policies: Automated TTL (Time-To-Live) mechanisms enforce retention periods

Implementation using Materialized Views

Tier-Specific Materialized Views

Professional Tier View

CREATE MATERIALIZED VIEW langdb.traces_professional_mv
TO langdb.traces_professional
AS SELECT *
FROM langdb.traces;

CREATE TABLE langdb.traces_professional (
    /* Same structure as base table */
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id)
TTL timestamp + toIntervalDay(30);

Enterprise Tier View

CREATE MATERIALIZED VIEW langdb.traces_enterprise_mv
TO langdb.traces_enterprise
AS SELECT *
FROM langdb.traces;

CREATE TABLE langdb.traces_enterprise (
    /* Same structure as base table */
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id)
TTL timestamp + toIntervalDay(90);

Data Access Flow

New trace data is inserted into the base langdb.traces table
Materialized views automatically filter and copy relevant data to tier-specific tables
TTL mechanisms automatically remove data older than the specified retention period
Data access APIs query the appropriate table based on the user's subscription tier

Benefits of This Approach

Efficiency: Only store data for the period necessary based on customer tier
Performance: Queries run against smaller, tier-specific tables rather than the entire dataset
Compliance: Clear retention boundaries help with regulatory compliance
Cost-Effective: Optimizes storage costs by aligning retention with customer value

Backup and Disaster Recovery

While the retention strategy focuses on operational access to trace data, a separate backup strategy ensures data can be recovered in case of system failures:

Daily snapshots of ClickHouse data
Backup retention aligned with the longest tier retention period (365 days)
Geo-redundant storage of backups

Monitoring and Management

The retention system includes:

Monitoring dashboards for data volume by tier
Alerts for unexpected growth or retention failures
Regular audits to ensure compliance with retention policies

Future Enhancements

Implementation of custom retention periods for specific enterprise customers
Cold storage options for extended archival needs
Advanced sampling techniques to retain representative trace data beyond standard periods

Overview​

Tracing Data Architecture​

Implementation using Materialized Views​

Tier-Specific Materialized Views​

Data Access Flow​

Benefits of This Approach​

Backup and Disaster Recovery​

Monitoring and Management​

Future Enhancements​