Spanfold advanced analytics concepts

Scenario

A realistic audit has five things happening at once.

Real comparisons rarely produce a single clean overlap. A primary and a backup see different parts of the same outage. A maintenance window partly excludes the outage from the SLA. The incident escalates mid-range, so measurements before and after the boundary must be reported separately. The backup briefly drops out. And the dashboard still has to serve a live number while the window is still open.

Pipeline

Stage the analysis so interpretation stays visible.

Hand-written interval joins tend to blend filtering, normalization, alignment, scoring, and rollup into one query — which is why the answers are hard to audit. Spanfold stages them, so each step is explicit and re-runnable.

1. Scope

Pick window, axis, segments, tags.

Choose what counts as in-scope before asking any analytic question. Narrow by window name, temporal axis, segment values, and tag filters.

2. Normalize

Known-at, horizon, exclusions.

Apply known-at filtering for replay safety, clip open windows to a live horizon, exclude ranges that should not count, and require closed windows for historical runs.

3. Align

Join lanes by key and partition.

Match target and against windows on key, source, partition, and temporal axis. Cohorts collapse many members into one derived lane at this stage.

4. Score

Emit comparator rows.

Overlap, residual, missing, coverage, gap, containment, lead/lag, and as-of rows come out of this stage. Each row preserves its originating window ids and range.

5. Project

Aggregate per segment or tag.

Group rows by segment, tag, source, or bucket. Compute ratios, histograms, and rollups from the evidence rows — not from counters.

6. Publish

Export with finality preserved.

Emit JSON, Markdown, debug HTML, or agent context. Final and provisional rows are kept distinct so a live dashboard cannot pretend to be history.

Row families

Choose the row family that matches the decision.

Coverage

Covered magnitude over eligible target magnitude, with exclusions applied at the normalize stage. Ratios stay drill-downable because every ratio preserves its contributing rows.

Gaps

Uncovered spans surfaced as their own rows. Useful with a minimum-magnitude threshold so that micro-flaps below (say) 30 seconds do not trigger alerts but still appear in exports.

Containment

One window fully encloses another. Use it for maintenance-inside-outage exclusions, parent-contains-child explanation, or release-window envelopes. Partial containment surfaces as its own row kind.

Lead & lag

Signed transition deltas with a tolerance band. Group the rows into buckets to build histograms, or filter by direction (target-leads / target-lags / within-tolerance) for SLA assertions.

As-of

Point-in-time match against the previous or next qualifying window. Combined with known-at filtering, as-of rows let audits replay the evidence that was visible at a specific decision moment.

Live finality

Rows derived from clipped open windows are labelled provisional and carry the horizon metadata. Final rows from closed history stay stable when the same comparison is replayed later.

Hard examples

Four real audits that combine several concepts.

Each example is a single Compare plan that layers normalization, exclusion, segment filters, and comparator rows. Read them as templates — the shape is the contribution, not the provider names.

1. Coverage SLA with maintenance exclusions

Payments outages must be covered by backup at 99.5%, but scheduled maintenance is excluded — unless the incident has already escalated, in which case the exclusion no longer applies.

var maintenance = pipeline.History
    .Query()
    .Window("MaintenanceWindow")
    .Tag("tier", "business-hours")
    .Windows();

var sla = pipeline.History
    .Compare("Payment coverage SLA")
    .Target("primary", s => s.Source("primary"))
    .Against("backup", s => s.Source("backup"))
    .Within(scope => scope
        .Window("PaymentOutage")
        .Axis(TemporalAxis.EventTime)
        .Segment("lifecycle", "Incident", "Escalated"))
    .Normalize(n => n
        .RequireClosedWindows()
        .ExcludeRanges(maintenance,
            when: ctx => !ctx.Target.Segment("lifecycle").Equals("Escalated")))
    .Using(c => c
        .Coverage()
        .Gap(minimumMagnitude: TimeSpan.FromSeconds(30))
        .Containment(ContainmentPolicy.ExcludeFromDenominator))
    .Run();

var eligible = sla.CoverageRows.TotalTargetMagnitude();
var covered  = sla.CoverageRows.TotalCoveredMagnitude();
var ratio    = eligible == TimeSpan.Zero
    ? 1d
    : covered / eligible;

var breaches = sla.GapRows
    .Where(g => g.Range.Duration > TimeSpan.FromMinutes(5))
    .OrderByDescending(g => g.Range.Duration)
    .ToArray();

2. Transition drift histogram across 30 days

Two providers should open their outage windows within 500 ms of each other. Bucket the signed deltas into 100 ms buckets and flag any bucket that holds more than 5% of the population.

var drift = pipeline.History
    .Compare("Provider start-drift")
    .Target("primary", s => s.Source("primary"))
    .Against("secondary", s => s.Source("secondary"))
    .Within(scope => scope
        .Window("DeviceOffline")
        .Axis(TemporalAxis.EventTime)
        .Since(DateTimeOffset.UtcNow.AddDays(-30)))
    .Using(c => c.LeadLag(
        LeadLagTransition.Start,
        TemporalAxis.EventTime,
        toleranceMagnitude: TimeSpan.FromMilliseconds(500)))
    .Run();

var buckets = drift.LeadLagRows
    .GroupBy(row => (long)Math.Round(row.Delta.TotalMilliseconds / 100d) * 100)
    .Select(g => new
    {
        BucketMs       = g.Key,
        Count          = g.Count(),
        OutOfTolerance = g.Count(r => r.Direction != LeadLagDirection.WithinTolerance)
    })
    .OrderBy(b => b.BucketMs)
    .ToArray();

var total    = drift.LeadLagRows.Count;
var hot      = buckets.Where(b => b.OutOfTolerance > total * 0.05).ToArray();
var p95Drift = drift.LeadLagRows
    .OrderBy(r => Math.Abs(r.Delta.TotalMilliseconds))
    .ElementAt((int)(total * 0.95))
    .Delta;

3. Replay-safe retro audit at two snapshots

Did the risk picture change between two decision points because of late-arriving events? Regenerate the view at both known-at horizons and diff window identities and ranges.

var earlier = pipeline.History
    .Compare("Risk view @ 12,000")
    .Target("risk",   s => s.Source("risk-service"))
    .Against("market", s => s.Source("market-feed"))
    .Within(scope => scope.Window("HighRisk"))
    .Normalize(n => n.KnownAtPosition(12_000))
    .Using(c => c.Overlap().Residual().Missing())
    .Run();

var later = pipeline.History
    .Compare("Risk view @ 12,847")
    .Target("risk",   s => s.Source("risk-service"))
    .Against("market", s => s.Source("market-feed"))
    .Within(scope => scope.Window("HighRisk"))
    .Normalize(n => n.KnownAtPosition(12_847))
    .Using(c => c.Overlap().Residual().Missing())
    .Run();

var retroactive = later.AllWindows()
    .Except(earlier.AllWindows(), WindowIdentityComparer.Default)
    .ToArray();

var restated = later.AllWindows()
    .Join(earlier.AllWindows(),
        nw => nw.WindowId, ow => ow.WindowId,
        (nw, ow) => new { nw.WindowId, Before = ow.Range, After = nw.Range })
    .Where(x => x.Before != x.After)
    .ToArray();

4. Live dashboard with provisional banding

The live coverage number is what the dashboard shows; the final number is what survives replay. Keep both, and compute how much of the live number is still provisional.

var horizon = TemporalPoint.ForPosition(currentPosition);

var live = pipeline.History
    .Compare("Live coverage")
    .Target("primary", s => s.Source("primary"))
    .Against("backup",  s => s.Source("backup"))
    .Within(scope => scope.Window("PaymentOutage"))
    .Normalize(n => n.ClipOpenWindowsTo(horizon))
    .Using(c => c.Coverage())
    .RunLive(horizon);

static double Ratio(IEnumerable<CoverageRow> rows)
{
    var covered = rows.Sum(r => r.CoveredMagnitude.Value);
    var target  = rows.Sum(r => r.TargetMagnitude.Value);
    return target == 0 ? 1d : covered / target;
}

var final       = live.CoverageRows.Where(r => r.Finality == RowFinality.Final);
var provisional = live.CoverageRows.Where(r => r.Finality == RowFinality.Provisional);

var finalRatio    = Ratio(final);
var liveRatio     = Ratio(final.Concat(provisional));
var provisionalPp = liveRatio - finalRatio;

dashboard.Render(
    finalRatio: finalRatio,
    liveRatio:  liveRatio,
    provisionalBanding: provisionalPp,
    horizon:    horizon);

Reading order

Look at rows first, then ratios.

A comparison result is most useful when ratios come out of the evidence rows instead of replacing them. That is what keeps audits explainable: every summary number can be walked back to the ranges, window ids, segments, and sources that produced it.

The same evidence rows feed live dashboards, incident reports, regulator-facing audits, and retrospective root-cause reviews — with finality preserved, so a provisional live number can never be mistaken for a settled historical one.

Back to concepts overview

Compose multi-stage temporal evidence.