Readers

Classes related to reading genomic files

Sam reader

class SamReader

SAM/BAM file reader.

Uses htslib to read SAM/BAM header and entries. This class only reads the SAM header and entries, it uses SamEntry to parse them.

See also

SamEntry

Public Functions

SamReader(const string &in_file)

Opens a SAM/BAM file, verifies that it is a sequence data file, requires a header, and initializes htslib handlers.

Parameters:

in_file[in] SAM/BAM file name.

Throws:

std::runtime_error – if the file cannot be opened, is not a sequence data file, has no header, or htslib initialization fails.

~SamReader()

Closes the file and destroys htslib handlers.

bool read_sam_line(SamEntry &entry)

Reads one entry and populates ‘entry’ with the help of SamEntry.

Parameters:

entry[out] SamEntry to populate with the parsed fields. Contents are not valid if false is returned.

Throws:

std::runtime_error – if the entry cannot be parsed.

Returns:

True on successfully reading an entry. False if end of file is reached.

bool read_pe_sam(SamEntry &entry1, SamEntry &entry2)

Reads a pair of entries from a paired-end SAM/BAM file, and populates ‘entry1’ and ‘entry2’ in the order the pair appears in the SAM file. This uses a hash table to keep track of SAM entries whose mates have not yet been read. SAM entries are read using ‘read_sam_line’, and checks if another entry with the same QNAME exists in the hash table. If it exists, the current read entry and the entry in the hash table are returned. If it does not exist, the current entry is stored in the hash table. It is assumed that each read pair contains exactly two entries in the SAM file. Having more than two entries present for a read pair (for example, supplementary alignments) will result in returning more than one read pair for the same QNAME (if there are even entries) or having dangling entries in the hash table (if there are odd entries).

Parameters:
  • entry1[out] SamEntry with the first read

  • entry2[out] SamEntry with the second read

Returns:

True on successfully reading a pair. False if end of file is reached.

void read_sam_header(string &hdr)

Returns the entire SAM header as a string, including all header lines (HD, SQ, RG, PG, etc.).

Parameters:

hdr[out] String to hold the header.

class SamEntry

A class for holding values in a SAM entry.

Takes an entry read from a SAM/BAM file as a string and parses the mandatory and optional fields.

Public Functions

inline SamEntry()

Default constructor. Initializes SAM fields to default values.

SamEntry(const string &line)

Constructs from a SAM/BAM string by calling parse_entry.

Parameters:

line[in] SAM/BAM line to parse.

~SamEntry()

Default destructor.

void parse_entry(const string &line)

Parses a string into SAM/BAM fields. The required fields are stored in the appropriate member. The optional tags are stored as a TAG:TYPE:VALUE string in a vector.

Parameters:

line[in] SAM/BAM line.

Public Members

string qname

QNAME. Query template name.

uint16_t flag

FLAG. Bitwise flag. See SamFlags for interpretation.

string rname

RNAME. Reference sequence name.

uint32_t pos

POS. 1-based leftmost mapping position.

uint16_t mapq

MAPQ. Mapping quality. 255 indicates not available.

string cigar

CIGAR. CIGAR string.

string rnext

RNEXT. Reference name of mate.

uint32_t pnext

PNEXT. Position of mate.

int tlen

TLEN. Observed template length.

string seq

SEQ. Segment sequence.

string qual

QUAL. ASCII of Phred-scaled base quality.

vector<string> tags

< TAGS. Optional tags.

Fastq reader

class FastqReader

FASTQ file reader.

A class to read and parse single and paired end FASTQ files

Public Functions

FastqReader(const std::string &in_file)

Opens a single-end FASTQ file. Throws a runtime error if the file cannot be opened.

Parameters:

in_file[in] FASTQ file name

FastqReader(const std::string &in_file_1, const std::string &in_file_2)

Opens a paired-end FASTQ file. Throws a runtime error if either file cannot be opened.

Parameters:
  • in_file_1[in] first FASTQ file name

  • in_file_2[in] second FASTQ file name

~FastqReader()

Closes any open files

bool read_se_entry(FastqEntry &e)

Read an entry from a single-end FASTQ file and populates a

See also

FastqEntry.

Parameters:

e[out] FastqEntry to populate with the FASTQ entry

Returns:

True on successfully reading a FASTQ entry. Flase if end of file is reached.

bool read_pe_entry(FastqEntry &e1, FastqEntry &e2)

Read a pair of entries from paired-end FASTQ files and populates two

See also

FastqEntry.

Parameters:
  • e1[out] FastqEntry to populate with the first FASTQ entry

  • e2[out] FastqEntry to populate with the second FASTQ entry

Returns:

True on successfully reading a FASTQ entry. Flase if end of file is reached.

BED reader

class BedReader

BED file reader.

A class to read and parse BED files. Supports reading BED3 entries (chrom, start, end) and variable-column BED entries where additional columns are returned as a vector of strings.

Public Functions

BedReader(const string &in_file)

Opens a BED file. Throws a runtime error if the file cannot be opened.

Parameters:

in_file[in] BED file name.

~BedReader()

Closes the BED file.

bool read_bed3_line(GenomicRegion &g)

Reads the next BED3 (chrom, start, end) line from the file and populates a GenomicRegion.

Parameters:

g[out] GenomicRegion to populate.

Returns:

True on successfully reading a BED3 entry. False if end of file is reached.

bool read_bed_line(GenomicRegion &g, std::vector<std::string> &fields)

Reads the next BED line and populates a GenomicRegion with the first 3 columns (chrom, start, end). Any remaining columns are returned in a vector of strings. If the line has only 3 columns, the fields vector will be empty.

Parameters:
  • g[out] GenomicRegion to populate with chrom, start, end.

  • fields[out] Vector of strings containing any additional columns beyond the first three.

Returns:

True on successfully reading a BED entry. False if end of file is reached.

void read_bed3_file(std::vector<GenomicRegion> &g)

Reads the entire BED file and populates a vector of GenomicRegion with the BED3 (chrom, start, end) fields.

Parameters:

g[out] Vector of GenomicRegions to populate.

GTF reader

class GtfReader

GTF reader.

A class for reading and parsing GTF files.

Public Functions

GtfReader(const string &in_file)

Opens a GTF file. Throws a runtime error if the file cannot be opened.

Parameters:

in_file[in] GTF file name

~GtfReader()

Closes the GTF file.

bool read_gtf_line(GtfEntry &g)

Reads a line from a GTF file, parses it, and populates a @see GtfEntry.
` * @params [out] g GtfEntry to populate.

Returns:

true on successfully reading a gtf entry, and false if end of file is reached.

void read_gtf_file(vector<GtfEntry> &g)

Reads a GTF file, parses each line, and populates a vector of

See also

GtfEntry, Note that read_gtf_file and read_gtf_line uses the same file handler, and so if this is called after calls to read_gtf_line, then it will read from the next line to the end of the file.

See also

GtfEntry to populate.

Parameters:

g[out] a vector of

bool read_gencode_gtf_line(GencodeGtfEntry &g)
void read_gencode_gtf_file(vector<GencodeGtfEntry> &g)