Configuring Data Retention
Overview
This document outlines LangDB's data retention strategy for tracing information stored in ClickHouse. The strategy employs materialized views to manage data retention periods based on user subscription tiers efficiently. Data eviction is implemented using ClickHouse's TTL (Time-To-Live) mechanisms and background processes:
- TTL Definitions: Each table includes TTL expressions that specify when data should expire based on timestamp fields
- Background Merge Process: ClickHouse automatically runs background processes that merge data parts and remove expired data during these merge operations
- Resource-Efficient: The eviction process runs asynchronously during system low-load periods, minimizing impact on query performance
Tracing Data Architecture
LangDB uses a robust system for storing and analyzing trace data:
- Primary Storage: All trace data is initially stored in the
langdb.tracestable in ClickHouse - Materialized Views: Tier-specific materialized views filter and retain data based on user subscription levels
- Retention Policies: Automated TTL (Time-To-Live) mechanisms enforce retention periods
Implementation using Materialized Views
Tier-Specific Materialized Views
Professional Tier View
CREATE MATERIALIZED VIEW langdb.traces_professional_mv
TO langdb.traces_professional
AS SELECT *
FROM langdb.traces;
CREATE TABLE langdb.traces_professional (
/* Same structure as base table */
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id)
TTL timestamp + toIntervalDay(30);
Enterprise Tier View
CREATE MATERIALIZED VIEW langdb.traces_enterprise_mv
TO langdb.traces_enterprise
AS SELECT *
FROM langdb.traces;
CREATE TABLE langdb.traces_enterprise (
/* Same structure as base table */
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id)
TTL timestamp + toIntervalDay(90);
Data Access Flow
- New trace data is inserted into the base
langdb.tracestable - Materialized views automatically filter and copy relevant data to tier-specific tables
- TTL mechanisms automatically remove data older than the specified retention period
- Data access APIs query the appropriate table based on the user's subscription tier
Benefits of This Approach
- Efficiency: Only store data for the period necessary based on customer tier
- Performance: Queries run against smaller, tier-specific tables rather than the entire dataset
- Compliance: Clear retention boundaries help with regulatory compliance
- Cost-Effective: Optimizes storage costs by aligning retention with customer value
Backup and Disaster Recovery
While the retention strategy focuses on operational access to trace data, a separate backup strategy ensures data can be recovered in case of system failures:
- Daily snapshots of ClickHouse data
- Backup retention aligned with the longest tier retention period (365 days)
- Geo-redundant storage of backups
Monitoring and Management
The retention system includes:
- Monitoring dashboards for data volume by tier
- Alerts for unexpected growth or retention failures
- Regular audits to ensure compliance with retention policies
Future Enhancements
- Implementation of custom retention periods for specific enterprise customers
- Cold storage options for extended archival needs
- Advanced sampling techniques to retain representative trace data beyond standard periods