StepVector
Classes for storing (arbitrary) data associated with a genomic regions
StepVector
-
template<typename T>
class StepVector A memory-efficient step function over integer positions.
StepVector stores a piecewise-constant function over a 1D integer coordinate space. Rather than storing a value at every position, it stores only the positions where the value changes (step boundaries), making it efficient for sparse data.
Values are added over half-open intervals [start, end) and are cumulative: adding to an interval that already has a value accumulates the new value on top of the existing one. Positions that have never been written to return the default-constructed value T{}.
The type T must support the += and + operators and must be default-constructible.
- Template Parameters:
T – The value type stored at each step.
Public Functions
-
StepVector()
Default constructor.
-
void add(const size_t start, const size_t end, const T &val)
Accumulates val over the half-open interval [start, end). If the interval overlaps with previously added intervals, the values are summed. If start >= end, the call is a no-op.
- Parameters:
start – [in] Start of the interval (inclusive, 0-based).
end – [in] End of the interval (exclusive, 0-based).
val – [in] Value to accumulate over the interval.
-
void at_range(const size_t start, const size_t end, vector<pair<size_t, T>> &out) const
Returns the step boundaries and their values for the half-open interval [start, end).
The output is a vector of (position, value) pairs representing the step function within the queried range. Consecutive entries out[i] and out[i+1] define the sub-interval [out[i].first, out[i+1].first) with constant value out[i].second. The first entry is always (start, value_at_start) and the last entry is always (end, value_at_end), bounding the output. Any step boundaries strictly inside (start, end) are included between them.
If start >= end, out is cleared and left empty.
- Parameters:
start – [in] Start of the query interval (inclusive, 0-based).
end – [in] End of the query interval (exclusive, 0-based).
out – [out] Vector of (position, value) pairs representing the step boundaries within [start, end). Cleared before populating.
-
T at(const size_t pos) const
Returns the value at a single position. Returns the default-constructed value T{} if no value has been accumulated at or before pos.
- Parameters:
pos – [in] Query position (0-based).
- Returns:
The accumulated value at pos.
-
void print_elements()
Prints all internal step boundaries to stdout as tab-separated position-value pairs, one per line. Intended for debugging.
GenomicStepVector
-
template<typename T>
class GenomicStepVector A chromosome-aware step function for genomic data.
GenomicStepVector extends StepVector to genomic coordinates by maintaining an independent StepVector per chromosome. Chromosomes are created on demand the first time an interval is added. The accumulation semantics are the same as StepVector: adding to an interval that already has a value sums the new value on top of it.
The main query method at(GenomicRegion, out) returns results as a vector of (GenomicRegion, T) pairs where consecutive step boundaries with equal values are merged into a single region, and zero-valued intervals can optionally be filtered out.
The type T must support the += and + operators, equality comparison, and must be default-constructible.
- Template Parameters:
T – The value type stored at each step.
Public Functions
-
GenomicStepVector()
Default constructor. Initializes an empty genomic step vector with no chromosomes.
-
void add(const string chr, const size_t start, const size_t end, const T &val)
Accumulates val over the half-open interval [start, end) on the given chromosome. If the chromosome does not yet exist, it is created. Accumulation semantics are inherited from StepVector: overlapping intervals sum their values.
- Parameters:
chr – [in] Chromosome name.
start – [in] Start of the interval (inclusive, 0-based).
end – [in] End of the interval (exclusive, 0-based).
val – [in] Value to accumulate over the interval.
-
void add(const GenomicRegion &g, const T &val)
Convenience overload of add() using a GenomicRegion. Delegates to add(chr, start, end, val) using g.name, g.start, g.end.
- Parameters:
g – [in] GenomicRegion defining the chromosome and interval.
val – [in] Value to accumulate over the interval.
-
void at(const string chr, const size_t start, const size_t end) const
Prints the step boundaries in [start, end) on the given chromosome to stdout as tab-separated position-value pairs. If the chromosome does not exist, prints a message to stderr. Intended for debugging.
- Parameters:
chr – [in] Chromosome name.
start – [in] Start of the query interval (inclusive, 0-based).
end – [in] End of the query interval (exclusive, 0-based).
-
void at(const GenomicRegion &g, vector<pair<GenomicRegion, T>> &out, bool keep_0 = false) const
Returns the accumulated values within the genomic region g as a vector of (GenomicRegion, T) pairs. Each pair represents a contiguous sub-interval where the value is constant. Adjacent step boundaries with the same value are merged into a single entry.
By default (keep_0 = false), intervals where the value equals T{} (the default-constructed zero value) are excluded from the output. Set keep_0 = true to include them.
If the chromosome in g does not exist in the vector, out is cleared and left empty.
- Parameters:
g – [in] GenomicRegion defining the chromosome and interval to query.
out – [out] Vector of (GenomicRegion, T) pairs representing the merged constant-value sub-intervals within g. Cleared before populating.
keep_0 – [in] If false (default), intervals with value T{} are excluded. If true, all intervals are returned.
-
void at(const string chr, vector<pair<GenomicRegion, T>> &out)
Queries the entire chromosome and returns all accumulated intervals as a vector of (GenomicRegion, T) pairs. Equivalent to calling at(GenomicRegion{chr, 0, SIZE_MAX}, out). Zero-valued intervals are excluded (keep_0 = false).
- Parameters:
chr – [in] Chromosome name to query.
out – [out] Vector of (GenomicRegion, T) pairs for the entire chromosome. Cleared before populating.
-
inline size_t chrom_count() const
Returns the number of distinct chromosomes that have been added.
- Returns:
Number of chromosomes.