Technical Deep Dive9 min read

Beyond CAEP: Building Continuous Access Evaluation That Scales

CAEP defines the protocol, but building continuous access evaluation that works at enterprise scale requires more — event correlation, decision caching, and graceful degradation. Lessons from processing 10 billion access decisions.

TigerIdentity Team|

In January, we published an introduction to CAEP and the Shared Signals Framework. The response was overwhelming — but the most common follow-up question was: "How do you actually build this at scale?"

CAEP (Continuous Access Evaluation Protocol) provides the protocol layer — how security events are transmitted between systems to trigger session re-evaluation. The Shared Signals Framework standardizes the event format. Together, they're the foundation.

But the protocol alone doesn't solve the engineering challenges of evaluating millions of access decisions per second, correlating signals from dozens of sources, and maintaining sub-50ms latency at 99.99% availability. This article covers what we've learned building TigerIdentity's evaluation engine to handle 10 billion+ access decisions.

TigerIdentity Decision Engine

10B+
Access decisions processed
Across all deployments
<50ms
p95 evaluation latency
Including policy evaluation
99.99%
Availability target
With graceful degradation
500K+
Events/second peak
Signal processing throughput

What CAEP Doesn't Cover

CAEP is a protocol, not an architecture. It defines how to transmit events between systems but leaves the hard engineering problems to the implementer. Here are the four challenges we had to solve beyond the spec:

Event Correlation Across Sources

A single identity threat involves signals from multiple sources — your IdP reports a suspicious login, your EDR flags malware, your SIEM detects data exfiltration. Each signal alone might be low-confidence. Together, they're a confirmed incident.

Challenge: Correlating events across time windows, matching identities across systems with different naming conventions, and scoring multi-source signals without generating false positives.

Decision Caching and Consistency

What happens when the policy engine is unreachable for 500ms? Do you fail-open (security risk) or fail-closed (availability risk)? Caching decisions helps, but cached decisions go stale — especially during active threats.

Challenge: Cache invalidation when policies or context change. Tiered caching strategies based on resource sensitivity. Consistency guarantees across distributed decision points.

Graceful Degradation

Not all resources need the same evaluation rigor. A request to view a public wiki page can use a cached decision. A request to delete production data must be evaluated in real-time, every time.

Challenge: Defining degradation tiers, implementing circuit breakers that activate per-resource-sensitivity, and ensuring degraded mode doesn't become a security bypass.

Multi-Tenant Isolation

One tenant's event storm shouldn't degrade another tenant's evaluation latency. A security incident at Company A — generating thousands of events per second — must not affect Company B's access decisions.

Challenge: Per-tenant rate limiting, resource isolation, and priority queuing without over-provisioning.

Architecture for Scale

TigerIdentity's evaluation architecture is designed around four key layers, each optimized for its specific role in the decision pipeline:

Event Ingestion Layer

NATS JetStream provides the messaging backbone with at-least-once delivery guarantees. Events from IdPs, EDR agents, SIEMs, and HR systems flow into topic-based streams with per-tenant partitioning.

Key design decision: At-least-once over exactly-once. Duplicate events are idempotent (same signal, same score), but missing an event means missing a threat. We chose reliability over deduplication complexity.

Decision Engine

Policies are compiled from YAML DSL to executable rules at deploy time — not interpreted at evaluation time. The engine runs entirely in memory with pre-indexed policy lookups. Cache hits resolve in under 5ms.

Key design decision: Compile policies to Go structs, not evaluate a rule engine at runtime. 10x faster and eliminates an entire class of parsing bugs.

Signal Correlation

Sliding window analysis correlates signals from multiple sources into composite risk scores. A suspicious login alone might score 30. Add a concurrent EDR alert and it scores 85. Multi-source fusion eliminates false positives.

Key design decision: 5-minute sliding windows with exponential decay. Recent signals weighted higher than older ones. Window size configurable per tenant.

Session Management

Distributed session store backed by Redis with instant revocation propagation. When a session is revoked, CAEP events are published to all connected relying parties within 2 seconds. No polling — push-based revocation.

Key design decision: Redis Pub/Sub for revocation fanout, with persistent storage for audit. Revocation is fire-and-forget from the decision engine's perspective.

Patterns That Work

Tiered Evaluation

Not every access decision needs real-time policy evaluation. We tier requests by resource sensitivity:

  • Tier 1 (public/internal): Cached decisions, 5ms response, re-evaluate every 5 minutes
  • Tier 2 (confidential): Real-time evaluation, <50ms response, continuous monitoring
  • Tier 3 (restricted): Real-time evaluation + step-up auth, full context check every request

Pre-Computed Access Grants

For predictable access patterns, we evaluate policies ahead of access requests. When a user's context changes (team assignment, risk score update), we recompute their access grants for commonly accessed resources and cache the results.

This turns most access checks into cache lookups — sub-5ms with no policy evaluation overhead.

Circuit Breakers

If a signal source goes down (EDR agent unreachable, IdP timeout), the circuit breaker activates. Access decisions continue with degraded context — but the degradation is resource-sensitivity-aware. Tier 1 resources use cached decisions. Tier 3 resources fail-closed until the signal source recovers.

Eventual Consistency for Analytics

Access decisions must be strongly consistent. Analytics can be eventually consistent. We use PostgreSQL (OLTP) for real-time decision state and ClickHouse (OLAP) for analytics, compliance reports, and trend analysis. Events replicate from PostgreSQL to ClickHouse asynchronously.

Event Processing Pipeline

# TigerIdentity evaluation engine configuration

evaluation:
  engine:
    compiled_policies: true
    cache:
      enabled: true
      ttl_by_sensitivity:
        public: 300s      # 5 minutes
        internal: 120s    # 2 minutes
        confidential: 0s  # No caching — always real-time
        restricted: 0s    # No caching — always real-time
      max_entries: 100000

  signal_correlation:
    window: 300s           # 5-minute sliding window
    decay: exponential
    min_confidence: 0.7
    sources:
      - idp              # Identity provider events
      - edr              # Endpoint detection
      - siem             # Security events
      - hr_system        # Employment status changes
      - device_mgmt      # Device posture signals

  degradation:
    circuit_breaker:
      failure_threshold: 5
      recovery_timeout: 30s
    fallback_by_tier:
      tier_1: use_cached_decision
      tier_2: use_cached_with_alert
      tier_3: fail_closed

  messaging:
    provider: nats_jetstream
    delivery: at_least_once
    partitioning: per_tenant
    retention: 7d

  session_revocation:
    method: caep_push
    propagation_target: 2s
    storage: redis_cluster
    audit_backend: clickhouse

Build on Proven Scale

TigerIdentity's evaluation engine has processed 10 billion+ access decisions at sub-50ms latency. Deploy continuous access evaluation that scales with your enterprise.

30-day trial. No credit card required. Full platform access.