# Performance

Performance characteristics of the core data structures in gcatlib, measured
using [nanobench](https://nanobench.ankerl.com) v4.3.11.

## Environment

| Item | Details |
|---|---|
| CPU | AMD Ryzen 9 5900X 12-Core Processor (24 threads) |
| Compiler | g++ with `-O2 -std=c++17` |
| Benchmarking tool | nanobench v4.3.11 |
| CPU governor | `performance` mode (`cpupower frequency-set -g performance`) |
| Core pinning | Single dedicated core (`taskset -c 2`) |

Setting the CPU governor to `performance` and pinning to a single core
eliminates frequency scaling and process migration noise. All error percentages
reported are from nanobench's stability analysis.

---

## StepVector

`StepVector<T>` stores a piecewise-constant function over a 1D integer
coordinate space using `std::map<size_t, T>` to record only the positions
where the value changes.

### add() — sequential non-overlapping

Each interval starts after the previous one ends. No existing map nodes fall
inside the new interval's range so the internal update loop is never entered.
This is the best-case access pattern.

| N | ns / add | Total time |
|---:|---:|---:|
| 1,000 | 191 | 191 µs |
| 10,000 | 311 | 3.1 ms |
| 100,000 | 412 | 41 ms |
| 1,000,000 | 510 | 510 ms |

The per-add cost grows slowly with N because each `add()` performs two
`std::map` operations (`lower_bound` and `insert`), both O(log N).
Total cost is **O(N log N)**.

### add() — random overlapping

Intervals have uniformly random start positions over a range of 100,000 bp
with lengths uniformly distributed in [1, 1,000]. Intervals are pre-generated
to exclude RNG overhead.

| N | ns / add | Total time |
|---:|---:|---:|
| 1,000 | 208 | 208 µs |
| 10,000 | 842 | 8.4 ms |
| 100,000 | 4,561 | 456 ms |

The 10x increase from N=1K to N=10K costs ~4x in per-add time, and from
N=10K to N=100K costs ~5.4x. This super-linear degradation occurs because
as the map fills, each new interval overlaps more existing step boundaries.
The internal update loop — which increments every map node between the
interval's start and end — becomes increasingly expensive as the coordinate
space saturates. See [worst-case stress](#worst-case-wide-interval-over-dense-pre-built-map)
for the limiting case.

### at() — point query

Queries 1,000 random positions uniformly distributed over the populated range
after building the map with N sequential non-overlapping adds.

| Map size (N) | ns / query |
|---:|---:|
| 1,000 | 12.4 |
| 10,000 | 27.4 |
| 100,000 | 70.6 |

Point queries use `std::map::upper_bound`, which is **O(log N)**.
Tripling from N=1K to N=10K (+10x) costs ~2.2x in query time, consistent
with log scaling.

### at_range() — full range query

Queries the entire populated range [0, N×10) after N sequential adds, writing
all step boundaries to an output vector.

| Map size (N) | Time per query |
|---:|---:|
| 1,000 | 10.6 µs |
| 10,000 | 110.8 µs |
| 100,000 | 1.15 ms |

Output size grows linearly with N (each sequential add contributes 2 map
entries), so total time is **O(N)**. The ~10.5x cost increase per 10x
increase in N confirms this.

---

## GenomicStepVector

`GenomicStepVector<T>` maintains one independent `StepVector<T>` per
chromosome, dispatched via `std::unordered_map<string, StepVector<T>>`.

### Single chromosome — dispatch overhead baseline

Single-chromosome performance can be compared directly against `StepVector`
to isolate the cost of the `unordered_map` chromosome dispatch.

| N | StepVector (ns/add) | GenomicStepVector (ns/add) | Overhead |
|---:|---:|---:|---:|
| 1,000 | 191 | 135 | ~0% |
| 10,000 | 311 | 252 | ~0% |
| 100,000 | 412 | 346 | ~0% |
| 1,000,000 | 510 | 485 | ~0% |
| 10,000,000 | — | 576 | — |

The `unordered_map` lookup adds negligible overhead — well within measurement
noise. `GenomicStepVector` is effectively free compared to the underlying
`StepVector` operations.

### Multi-chromosome sequential scaling

Each chromosome receives the same number of sequential non-overlapping adds.
Total adds = n_chrom × n_per_chrom.

| n_chrom | n_per_chrom | Total adds | Total time |
|---:|---:|---:|---:|
| 1 | 1,000 | 1,000 | 121 µs |
| 1 | 10,000 | 10,000 | 2.4 ms |
| 1 | 100,000 | 100,000 | 34 ms |
| 5 | 1,000 | 5,000 | 636 µs |
| 5 | 10,000 | 50,000 | 12.1 ms |
| 5 | 100,000 | 500,000 | 182 ms |
| 20 | 1,000 | 20,000 | 2.7 ms |
| 20 | 10,000 | 200,000 | 49 ms |
| 20 | 100,000 | 2,000,000 | 787 ms |

Scaling is near-linear with both n_chrom and n_per_chrom. Going from 1 to 5
to 20 chromosomes at the same n_per_chrom scales proportionally, confirming
that each chromosome's `StepVector` operates independently.

### Random overlapping — super-linear degradation

Each chromosome receives the same pre-generated set of random overlapping
intervals (positions in [0, 100,000], lengths in [1, 1,000]).

| n_chrom | n_per_chrom | Total adds | Total time |
|---:|---:|---:|---:|
| 1 | 1,000 | 1,000 | 212 µs |
| 1 | 10,000 | 10,000 | 7.6 ms |
| 1 | 100,000 | 100,000 | 433 ms |
| 1 | 1,000,000 | 1,000,000 | 78 s |
| 5 | 1,000 | 5,000 | 1.0 ms |
| 5 | 10,000 | 50,000 | 38.8 ms |
| 5 | 100,000 | 500,000 | 2,139 ms |
| 5 | 1,000,000 | 5,000,000 | 394 s |
| 20 | 1,000 | 20,000 | 4.2 ms |
| 20 | 10,000 | 200,000 | 156 ms |
| 20 | 100,000 | 2,000,000 | 8,658 ms |
| 20 | 1,000,000 | 20,000,000 | 1,560 s |

**Key observation:** The jump from n_per_chrom=10K to 100K costs ~57x per
chromosome (7.6 ms → 433 ms), far worse than the 10x expected from O(N log N).
This is because the coordinate range [0, 100,000] becomes saturated — once
the step boundaries are dense enough, every new `add()` must traverse a large
fraction of the map in the update loop, pushing cost toward O(N²).

Scaling with n_chrom remains proportional (each chromosome's map is independent),
so the total time for 20 chromosomes at n_per_chrom=1M is approximately
20× the single-chromosome case.

### at() query — chromosome lookup cost

Builds a vector with n_per_chrom=10,000 sequential adds per chromosome, then
runs 1,000 random queries across chromosomes.

| n_chrom | Time for 1,000 queries |
|---:|---:|
| 1 | 25.4 ms |
| 5 | 24.8 ms |
| 20 | 26.6 ms |

Query time is essentially flat regardless of chromosome count. The
`unordered_map` chromosome lookup is O(1), and the dominant cost is the
`at_range()` traversal and output merging — not the dispatch.

### Stress tests

#### Worst-case: wide interval over dense pre-built map

Pre-builds N non-overlapping length-1 intervals (creating N map nodes), then
repeatedly adds a single interval spanning the entire range. Each `add()` call
must traverse and update every existing map node — this is the maximum
per-call cost.

| Pre-built nodes (N) | Time per add |
|---:|---:|
| 10,000 | 145 µs ⚠ |
| 100,000 | 864 µs |
| 1,000,000 | 23.9 ms |

⚠ N=10,000 result is flagged as unstable by nanobench (~7.8 iterations
collected). The N=100K and N=1M results are reliable.

The ~28x cost increase from N=100K to N=1M (for 10x more nodes) reflects
**O(N)** traversal with additional cache pressure — once the map exceeds
the CPU cache, random pointer-chasing in `std::map` nodes becomes expensive.

#### Maximally dense: sequential length-1 intervals

Each `add()` creates a new 1-position interval, maximising map size for a
given number of adds. This tests performance under peak memory pressure.

| N | Total time |
|---:|---:|
| 100,000 | 30 ms |
| 1,000,000 | 437 ms |

Scaling of ~14.6x for 10x more adds is slightly super-linear compared to the
sequential 5-bp case (~12x), reflecting the larger and more memory-intensive
map.

---

## Key Findings

1. **`unordered_map` dispatch is free.** The per-chromosome lookup in
   `GenomicStepVector` adds no measurable overhead over raw `StepVector`.
   Multi-chromosome scaling is proportional: N chromosomes costs N× the
   single-chromosome time.

2. **Sequential adds are O(N log N).** When intervals do not overlap,
   each `add()` only touches 2 map entries and the update loop is never
   entered. This is the fast path.

3. **Random overlapping add() is the main bottleneck.** When intervals
   overlap, `add()` must update every map node between the interval's start
   and end. As the coordinate space saturates, this approaches O(N²) behaviour.
   For a 100 kbp coordinate range, saturation begins around N=10,000 adds.
   **Consider pre-sorting or batching overlapping intervals to reduce map
   density if this is a bottleneck.**

4. **Point queries are O(log N) and fast.** Even with 100,000 step boundaries,
   a single `at()` costs ~71 ns. Query performance degrades slowly with map
   size.

5. **Range queries are O(N) in output size.** `at_range()` is dominated by
   the cost of writing output, not by lookup. Reusing the output vector across
   calls (avoiding repeated allocation) is recommended.

6. **Cache pressure at large map sizes.** The worst-case stress test shows that
   traversing 1M `std::map` nodes costs ~28× more than traversing 100K nodes
   — super-linear due to cache misses on pointer-based tree nodes. For very
   large datasets, a cache-friendlier data structure (e.g. sorted vector of
   breakpoints) could improve performance.

---

*This document was co-authored with [Claude](https://claude.ai) (Anthropic).*