Read a minimap/minimap2 .paf file including optional tagged extra fields. The optional fields will be parsed into a tidy format, one column per tag.
Arguments
- file
Either a path to a file, a connection, or literal data (either a single string or a raw vector).
Files ending in
.gz
,.bz2
,.xz
, or.zip
will be automatically uncompressed. Files starting withhttp://
,https://
,ftp://
, orftps://
will be automatically downloaded. Remote gz files can also be automatically downloaded and decompressed.Literal data is most useful for examples and tests. To be recognised as literal data, the input must be either wrapped with
I()
, be a string containing at least one new line, or be a vector containing at least one string with a new line.Using a value of
clipboard()
will read from the system clipboard.maximum number of optional fields to include
- col_names
column names to use. Defaults to
def_names("gff3")
(seedef_names
).- col_types
column types to use. Defaults to
def_types("gff3")
(seedef_types
).- ...
additional parameters, passed to
read_tsv
Details
Because readr::read_tsv
expects a fixed number of columns, but in .paf the
number of optional fields can differ among records, read_paf
tries to read
at least as many columns as the longest record has (max_tags
). The
resulting warnings for each record with fewer fields of the form "32 columns
expected, only 22 seen" should thus be ignored.
From the minimap2 manual
+—-+——–+———————————————————+ |Col | Type | Description | +—-+——–+———————————————————+ | 1 | string | Query sequence name | | 2 | int | Query sequence length | | 3 | int | Query start coordinate (0-based) | | 4 | int | Query end coordinate (0-based) | | 5 | char | ‘+’ if query/target on the same strand; ‘-’ if opposite | | 6 | string | Target sequence name | | 7 | int | Target sequence length | | 8 | int | Target start coordinate on the original strand | | 9 | int | Target end coordinate on the original strand | | 10 | int | Number of matching bases in the mapping | | 11 | int | Number bases, including gaps, in the mapping | | 12 | int | Mapping quality (0-255 with 255 for missing) | +—-+——–+———————————————————+
+—-+——+——————————————————-+ |Tag | Type | Description | +—-+——+——————————————————-+ | tp | A | Type of aln: P/primary, S/secondary and I,i/inversion | | cm | i | Number of minimizers on the chain | | s1 | i | Chaining score | | s2 | i | Chaining score of the best secondary chain | | NM | i | Total number of mismatches and gaps in the alignment | | MD | Z | To generate the ref sequence in the alignment | | AS | i | DP alignment score | | ms | i | DP score of the max scoring segment in the alignment | | nn | i | Number of ambiguous bases in the alignment | | ts | A | Transcript strand (splice mode only) | | cg | Z | CIGAR string (only in PAF) | | cs | Z | Difference string | | dv | f | Approximate per-base sequence divergence | +—-+——+——————————————————-+
From https://samtools.github.io/hts-specs/SAMtags.pdf type may be one of A (character), B (general array), f (real number), H (hexadecimal array), i (integer), or Z (string).