Type: | Package |
Title: | In-Depth Characterization and Analysis of Mutational Signatures ('ICAMS') |
Version: | 3.0.11 |
Description: | Analysis and visualization of experimentally elucidated mutational signatures – the kind of analysis and visualization in Boot et al., "In-depth characterization of the cisplatin mutational signature in human cell lines and in esophageal and liver tumors", Genome Research 2018, <doi:10.1101/gr.230219.117> and "Characterization of colibactin-associated mutational signature in an Asian oral squamous cell carcinoma and in other mucosal tumor types", Genome Research 2020 <doi:10.1101/gr.255620.119>. 'ICAMS' stands for In-depth Characterization and Analysis of Mutational Signatures. 'ICAMS' has functions to read in variant call files (VCFs) and to collate the corresponding catalogs of mutational spectra and to analyze and plot catalogs of mutational spectra and signatures. Handles both "counts-based" and "density-based" (i.e. representation as mutations per megabase) mutational spectra or signatures. |
License: | GPL-3 | file LICENSE |
URL: | https://github.com/steverozen/ICAMS |
BugReports: | https://github.com/steverozen/ICAMS/issues |
Encoding: | UTF-8 |
LazyData: | true |
Language: | en-US |
Imports: | Biostrings, BSgenome, data.table, dplyr, fuzzyjoin, GenomeInfoDb, GenomicRanges, graphics, grDevices, IRanges, lifecycle, RColorBrewer, stats, stringi, utils, zip |
Depends: | R (≥ 3.5), |
RoxygenNote: | 7.3.2 |
Suggests: | BSgenome.Hsapiens.1000genomes.hs37d5, BSgenome.Hsapiens.UCSC.hg38, BSgenome.Mmusculus.UCSC.mm10, ggplot2, reshape2, rlang, testthat |
NeedsCompilation: | no |
Packaged: | 2025-06-14 02:31:56 UTC; steve |
Author: | Steve Rozen [aut, cre], Nanhai Jiang [aut], Arnoud Boot [aut], Mo Liu [aut], Yang Wu [aut], Mi Ni Huang [aut], Jia Geng Chang [aut] |
Maintainer: | Steve Rozen <steverozen@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-06-15 00:30:15 UTC |
Add and check DBS class in an annotated VCF with the corresponding DBS mutation matrix
Description
Add and check DBS class in an annotated VCF with the corresponding DBS mutation matrix
Usage
AddAndCheckDBSClassInVCF(vcf, mat78, mat136, mat144 = NULL, sample.id)
Arguments
vcf |
An in-memory VCF file annotated with sequence context and
transcript information by function |
mat78 |
The DBS78 mutation count matrix. |
mat136 |
The DBS136 mutation count matrix. |
mat144 |
The DBS144 mutation count matrix. |
sample.id |
Usually the sample id, but defaults to "count". |
Value
The original vcf
with three additional columns
DBS78.class
, DBS136.class
and DBS144.class
added.
Add and check SBS class in an annotated VCF with the corresponding SBS mutation matrix
Description
Add and check SBS class in an annotated VCF with the corresponding SBS mutation matrix
Usage
AddAndCheckSBSClassInVCF(vcf, mat96, mat1536, mat192 = NULL, sample.id)
Arguments
vcf |
An in-memory VCF file annotated with sequence context and
transcript information by function |
mat96 |
The SBS96 mutation count matrix. |
mat1536 |
The SBS1536 mutation count matrix. |
mat192 |
The SBS192 mutation count matrix. |
sample.id |
Usually the sample id, but defaults to "count". |
Value
The original vcf
with three additional columns
SBS96.class
, SBS192.class
and SBS1536.class
added.
Add DBS mutation class to an annotated DBS VCF
Description
Add DBS mutation class to an annotated DBS VCF
Usage
AddDBSClass(vcf)
Arguments
vcf |
An in-memory VCF file annotated with sequence context and
transcript information by function |
Value
The original vcf
with three additional columns
DBS78.class
, DBS136.class
and DBS144.class
added.
Create a run information text file from generating zip archive from VCF files.
Description
Create a run information text file from generating zip archive from VCF files.
Usage
AddRunInformation(
files,
vcf.names,
zipfile.name,
vcftype,
ref.genome,
region,
mutation.loads,
strand.bias.statistics,
tmpdir
)
Add SBS mutation class to an annotated SBS VCF
Description
Add SBS mutation class to an annotated SBS VCF
Usage
AddSBSClass(vcf)
Arguments
vcf |
An in-memory VCF file annotated with sequence context and
transcript information by function |
Value
The original vcf
with three additional columns
SBS96.class
, SBS192.class
and SBS1536.class
added.
Add sequence context to a data frame with mutation records
Description
Add sequence context to a data frame with mutation records
Usage
AddSeqContext(df, ref.genome, seq.context.width = 10, name.of.VCF = NULL)
Arguments
df |
An input data frame storing mutation records of a VCF file. |
ref.genome |
A |
seq.context.width |
The number of preceding and following bases to be
extracted around the mutated position from |
Value
A copy of the input data.frame with a new column added that contains sequence context information.
Add transcript information to a data frame with mutation records
Description
Add transcript information to a data frame with mutation records
Usage
AddTranscript(df, trans.ranges = NULL, ref.genome, name.of.VCF = NULL)
Arguments
df |
A data frame storing mutation records of a VCF file. |
trans.ranges |
A |
ref.genome |
A |
name.of.VCF |
Name of the VCF file. |
Value
A data frame with new columns added to the input data frame, which contain the mutated gene's name, range and strand information.
Add sequence context and transcript information to an in-memory DBS VCF
Description
Add sequence context and transcript information to an in-memory DBS VCF
Usage
AnnotateDBSVCF(DBS.vcf, ref.genome, trans.ranges = NULL, name.of.VCF = NULL)
Arguments
DBS.vcf |
An in-memory DBS VCF as a |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
name.of.VCF |
Name of the VCF file. |
Value
An in-memory DBS VCF as a data.table
. This has been annotated
with the sequence context (column name seq.21bases
) and with
transcript information in the form of a gene symbol (e.g. "TP53"
)
and transcript strand. This information is in the columns
trans.start.pos
, trans.end.pos
, trans.strand
,
trans.Ensembl.gene.ID
and trans.gene.symbol
in the output.
These columns are not added if is.null(trans.ranges)
.
Examples
file <- c(system.file("extdata/Strelka-SBS-vcf",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka")
DBS.vcf <- list.of.vcfs$DBS[[1]]
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
annotated.DBS.vcf <- AnnotateDBSVCF(DBS.vcf, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37)}
Add sequence context and transcript information to an in-memory ID (insertion/deletion) VCF, and confirm that they match the given reference genome
Description
Add sequence context and transcript information to an in-memory ID (insertion/deletion) VCF, and confirm that they match the given reference genome
Usage
AnnotateIDVCF(
ID.vcf,
ref.genome,
trans.ranges = NULL,
flag.mismatches = 0,
name.of.VCF = NULL,
suppress.discarded.variants.warnings = TRUE
)
Arguments
ID.vcf |
An in-memory ID (insertion/deletion) VCF as a
|
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
flag.mismatches |
Deprecated. If there are ID variants whose |
name.of.VCF |
Name of the VCF file. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Value
A list of elements:
-
annotated.vcf
: The original VCF data frame with two new columns added to the input data frame:-
seq.context
: The sequence embedding the variant. -
seq.context.width
: The width ofseq.context
to the left.
-
-
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
Examples
file <- c(system.file("extdata/Strelka-ID-vcf/",
"Strelka.ID.GRCh37.s1.vcf",
package = "ICAMS"))
ID.vcf <- ReadAndSplitVCFs(file, variant.caller = "strelka")$ID[[1]]
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
list <- AnnotateIDVCF(ID.vcf, ref.genome = "hg19")
annotated.ID.vcf <- list$annotated.vcf}
Add sequence context and transcript information to an in-memory SBS VCF
Description
Add sequence context and transcript information to an in-memory SBS VCF
Usage
AnnotateSBSVCF(SBS.vcf, ref.genome, trans.ranges = NULL, name.of.VCF = NULL)
Arguments
SBS.vcf |
An in-memory SBS VCF as a |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
name.of.VCF |
Name of the VCF file. |
Value
An in-memory SBS VCF as a data.table
. This has been annotated
with the sequence context (column name seq.21bases
) and with
transcript information in the form of a gene symbol (e.g. "TP53"
)
and transcript strand. This information is in the columns
trans.start.pos
, trans.end.pos
, trans.strand
,
trans.Ensembl.gene.ID
and trans.gene.symbol
in the output.
These columns are not added if is.null(trans.ranges)
.
Examples
file <- c(system.file("extdata/Strelka-SBS-vcf",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka")
SBS.vcf <- list.of.vcfs$SBS[[1]]
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37)}
Calculate base counts from three mer abundance
Description
Calculate base counts from three mer abundance
Usage
CalBaseCountsFrom3MerAbundance(three.mer.abundance)
Calculate the number of space needed to add strand bias statistics to the run-information.txt file.
Description
Calculate the number of space needed to add strand bias statistics to the run-information.txt file.
Usage
CalculateNumberOfSpace(list)
Arguments
list |
A list containing strand bias statistics. |
Value
A matrix containing the space information.
Given a deletion and its sequence context, categorize it
Description
This function is primarily for internal use, but we export it to document the underlying logic.
Usage
Canonicalize1Del(context, del.seq, pos, trace = 0)
Arguments
context |
The deleted sequence plus ample surrounding
sequence on each side (at least as long as |
del.seq |
The deleted sequence in |
pos |
The position of |
trace |
If > 0, then generate messages tracing how the computation is carried out. |
Details
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on deletion mutation classification.
This function first handles deletions in homopolymers, then
handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
),
and if the deletion is not in a simple repeat,
looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Value
A string that is the canonical representation
of the given deletion type. Return NA
and raise a warning if
there is an un-normalized representation of
the deletion of a repeat unit.
See FindDelMH
for details.
(This seems to be very rare.)
Examples
Canonicalize1Del("xyAAAqr", del.seq = "A", pos = 3) # "DEL:T:1:2"
Canonicalize1Del("xyAAAqr", del.seq = "A", pos = 4) # "DEL:T:1:2"
Canonicalize1Del("xyAqr", del.seq = "A", pos = 3) # "DEL:T:1:0"
Given a single insertion or deletion in context categorize it.
Description
Given a single insertion or deletion in context categorize it.
Usage
Canonicalize1ID(context, ref, alt, pos, trace = 0)
Arguments
context |
Ample surrounding sequence on each side of the insertion or deletion. |
ref |
The reference allele (vector of length 1) |
alt |
The alternative allele (vector of length 1) |
pos |
The position of |
trace |
If > 0, then generate messages tracing how the computation is carried out. |
Value
A string that is the canonical representation
of the type of the given
insertion or deletion.
Return NA
and raise a warning if
there is an un-normalized representation of
the deletion of a repeat unit.
See FindDelMH
for details.
(This seems to be very rare.)
Given an insertion and its sequence context, categorize it.
Description
Given an insertion and its sequence context, categorize it.
Usage
Canonicalize1INS(context, ins.sequence, pos, trace = 0)
Arguments
context |
The deleted sequence plus ample surrounding
sequence on each side (at least as long as |
ins.sequence |
The deleted sequence in |
pos |
The position of |
trace |
If > 0, then generate messages tracing how the computation is carried out. |
Value
A string that is the canonical representation of the given insertion type.
Determine the mutation types of insertions and deletions.
Description
Determine the mutation types of insertions and deletions.
Usage
CanonicalizeID(context, ref, alt, pos)
Arguments
context |
A vector of ample surrounding sequence on each side the variants |
ref |
Vector of reference alleles |
alt |
Vector of alternative alleles |
pos |
Vector of the positions of the insertions and deletions in
|
Value
A vector of strings that are the canonical representations of the given insertions and deletions.
Standard order of row names in a catalog
Description
This data is designed for those who need to create their own catalogs from formats not supported by this package. The rownames denote the mutation types. For example, for SBS96 catalogs, the rowname AGAT represents a mutation from AGA > ATA.
Usage
catalog.row.order
Format
A list of character vectors indicating the standard orders of row names in catalogs.
An object of class list
of length 9.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+. In ID83 catalogs, deletion repeat sizes range from 0 to 5.
Examples
catalog.row.order$SBS96
# "ACAA" "ACCA" "ACGA" "ACTA" "CCAA" "CCCA" "CCGA" "CCTA" ...
# There are altogether 96 row names to denote the mutation types
# in SBS96 catalog.
Check and, if possible, correct the chromosome names in a VCF data.frame
.
Description
Check and, if possible, correct the chromosome names in a VCF data.frame
.
Usage
CheckAndFixChrNames(vcf.df, ref.genome, name.of.VCF = NULL)
Arguments
vcf.df |
A VCF as a |
ref.genome |
The reference genome with the chromosome names to check
|
name.of.VCF |
Name of the VCF file. |
Value
If the vcf.df$CHROM
values are correct or
can be corrected, then a vector of chromosome names
that can be used as a replacement for vcf.df$CHROM
.
If the names in vcf.df$CHROM
cannot be made to
be consistent with the chromosome names in ref.genome
,
then stop
.
Check and, if possible, correct the chromosome names in a trans.ranges data.table
Description
Check and, if possible, correct the chromosome names in a trans.ranges data.table
Usage
CheckAndFixChrNamesForTransRanges(
trans.ranges,
vcf.df,
ref.genome,
name.of.VCF = NULL
)
Arguments
trans.ranges |
A |
vcf.df |
A VCF as a |
ref.genome |
The reference genome with the chromosome names to check
|
name.of.VCF |
Name of the VCF file. |
Value
If the vcf.df$CHROM
values are correct or can be corrected,
then a vector of chromosome names that can be used as a replacement for
trans.ranges$chrom
. If the names in vcf.df$CHROM
cannot be
made to be consistent with the chromosome names in
trans.ranges$chrom
, then stop
.
Check whether the rownames of object
are correct, if yes then put the
rows in the correct order.
Description
Check whether the rownames of object
are correct, if yes then put the
rows in the correct order.
Usage
CheckAndReorderRownames(object)
Check and return DBS catalogs
Description
Check and return DBS catalogs
Usage
CheckAndReturnDBSCatalogs(
catDBS78,
catDBS136,
catDBS144 = NULL,
discarded.variants,
annotated.vcfs
)
Arguments
catDBS78 |
An DBS78 catalog. |
catDBS136 |
An DBS136 catalog. |
catDBS144 |
An DBS144 catalog. |
discarded.variants |
A list of discarded variants. |
annotated.vcfs |
A list of annotated VCFs. |
Value
A list of DBS catalogs. Also return the discarded variants and annotated VCFs if they exit.
Check and return the DBS mutation matrix
Description
Check and return the DBS mutation matrix
Usage
CheckAndReturnDBSMatrix(
vcf,
discarded.variants,
mat78,
mat136,
mat144 = NULL,
return.annotated.vcf = FALSE,
sample.id = "counts"
)
Arguments
vcf |
An in-memory VCF file annotated with sequence context and
transcript information by function |
discarded.variants |
A |
mat78 |
The DBS78 mutation count matrix. |
mat136 |
The DBS136 mutation count matrix. |
mat144 |
The DBS144 mutation count matrix. |
return.annotated.vcf |
Whether to return the annotated VCF with additional columns showing the mutation class for each variant. Default is FALSE. |
sample.id |
Usually the sample id, but defaults to "count". |
Value
A list of three 1-column matrices with the names catDBS78
,
catDBS136
, and catDBS144
. If trans.ranges is NULL,
catDBS144
is not generated. Do not rely on the order of elements in
the list. If return.annotated.vcf
= TRUE, another element
annotated.vcf
will appear in the list. If there are DBS variants
whose tetranucleotide context contains "N", they will be excluded in the
analysis and an additional element discarded.variants
will appear in
the return list.
Check and return ID catalog
Description
Check and return ID catalog
Usage
CheckAndReturnIDCatalog(catID, catID166, discarded.variants, annotated.vcfs)
Arguments
catID |
An ID catalog. |
catID166 |
An ID166 (genic-intergenic indel) catalog. |
discarded.variants |
A list of discarded variants. |
annotated.vcfs |
A list of annotated VCFs. |
Value
A list of ID catalog. Also return the discarded variants and annotated VCFs if they exit.
Check and return the ID mutation matrix
Description
Check and return the ID mutation matrix
Usage
CheckAndReturnIDMatrix(
annotated.vcf,
discarded.variants,
ID.mat,
ID166.mat,
return.annotated.vcf = FALSE
)
Arguments
annotated.vcf |
An annotated ID VCF with additional column
|
discarded.variants |
A |
ID.mat |
The ID mutation count matrix. |
ID166.mat |
The ID166 mutation count matrix. |
return.annotated.vcf |
Whether to return |
Value
A list of two 1-column ID matrices containing the mutation catalog
information and the annotated VCF with ID categories information added. If
some ID variants were excluded in the analysis, an additional element
discarded.variants
will appear in the return list.
Check and return SBS catalogs
Description
Check and return SBS catalogs
Usage
CheckAndReturnSBSCatalogs(
catSBS96,
catSBS1536,
catSBS192 = NULL,
discarded.variants,
annotated.vcfs
)
Arguments
catSBS96 |
An SBS96 catalog. |
catSBS1536 |
An SBS1536 catalog. |
catSBS192 |
An SBS192 catalog. |
discarded.variants |
A list of discarded variants. |
annotated.vcfs |
A list of annotated VCFs. |
Value
A list of SBS catalogs. Also return the discarded variants and annotated VCFs if they exit.
Check and return the SBS mutation matrix
Description
Check and return the SBS mutation matrix
Usage
CheckAndReturnSBSMatrix(
vcf,
discarded.variants,
mat96,
mat1536,
mat192 = NULL,
return.annotated.vcf = FALSE,
sample.id = "counts"
)
Arguments
vcf |
An in-memory VCF file annotated with sequence context and
transcript information by function |
discarded.variants |
A |
mat96 |
The SBS96 mutation count matrix. |
mat1536 |
The SBS1536 mutation count matrix. |
mat192 |
The SBS192 mutation count matrix. |
return.annotated.vcf |
Whether to return the annotated VCF with additional columns showing the mutation class for each variant. Default is FALSE. |
sample.id |
Usually the sample id, but defaults to "count". |
Value
A list of three 1-column matrices with the names
catSBS96
, catSBS192
, catSBS1536
. If transcript
information is not available in vcf
, catSBS192
is not
generated. Do not rely on the order of elements in the list. If
return.annotated.vcf
= TRUE, another element annotated.vcf
will appear in the list. If there are SBS variants whose pentanucleotide
context contains "N", they will be excluded in the analysis and an
additional element discarded.variants
will appear in the return
list.
Check DBS mutation class in VCF with the corresponding DBS mutation matrix
Description
Check DBS mutation class in VCF with the corresponding DBS mutation matrix
Usage
CheckDBSClassInVCF(vcf, mat, sample.id)
Arguments
vcf |
An annotated DBS VCF with columns of DBS mutation
classes added by |
mat |
The DBS mutation count matrix. |
sample.id |
Usually the sample id, but defaults to "count". |
Check SBS mutation class in VCF with the corresponding SBS mutation matrix
Description
Check SBS mutation class in VCF with the corresponding SBS mutation matrix
Usage
CheckSBSClassInVCF(vcf, mat, sample.id)
Arguments
vcf |
An annotated SBS VCF with columns of SBS mutation
classes added by |
mat |
The SBS mutation count matrix. |
sample.id |
Usually the sample id, but defaults to "count". |
Check that the sequence context information is consistent with the value of the column REF.
Description
Check that the sequence context information is consistent with the value of the column REF.
Usage
CheckSeqContextInVCF(vcf, column.to.use)
Arguments
vcf |
In-memory VCF as a data.frame; must be an SBS or DBS VCF. |
column.to.use |
The column name as a string of the column in the VCF with the context information. |
Value
Throws error with location information if the value of REF is inconsistent with the value of seq.21bases. Assumes the first base of the reference allele is at position (size(<context string>)-1)/2, and generates error if this is not an integer. Indices are 1-based.
"Collapse" a catalog
Description
Take a mutational spectrum or signature catalog that is based on a fined-grained set of features (for example, single-nucleotide substitutions in the context of the preceding and following 2 bases).
Collapse it to a catalog based on a coarser-grained set of features (for example, single-nucleotide substitutions in the context of the immediately preceding and following bases).
Collapse192CatalogTo96
Collapse an SBS 192 catalog
to an SBS 96 catalog.
Collapse1536CatalogTo96
Collapse an SBS 1536 catalog
to an SBS 96 catalog.
Collapse144CatalogTo78
Collapse a DBS 144 catalog
to a DBS 78 catalog.
Usage
Collapse192CatalogTo96(catalog)
Collapse1536CatalogTo96(catalog)
Collapse144CatalogTo78(catalog)
Arguments
catalog |
A catalog as defined in |
Value
A catalog as defined in ICAMS
.
Examples
# Create an SBS192 catalog and collapse it to an SBS96 catalog
object <- matrix(1, nrow = 192, ncol = 1,
dimnames = list(catalog.row.order$SBS192))
catSBS192 <- as.catalog(object, region = "transcript")
catSBS96 <- Collapse192CatalogTo96(catSBS192)
Covert an ICAMS Catalog to SigProfiler format
Description
Specially, the row orders in ICAMS internal format
(see ICAMS::catalog.row.order
)
are converted to headers in SigProfiler format.
Usage
ConvertCatalogToSigProfilerFormat(input.catalog, file, sep = "\t")
Arguments
input.catalog |
Either a character string, in which case this is the
path to a file containing a catalog in |
file |
The path of the file to be written. |
sep |
Separator to use in the output file. |
Details
For SigProfiler formats, please see the links below for:
SBS: https://osf.io/s93d5/wiki/5.%20Output%20-%20SBS/
DBS: https://osf.io/s93d5/wiki/5.%20Output%20-%20DBS/
ID: https://osf.io/s93d5/wiki/5.%20Output%20-%20ID/
Note
This function can only transform SBS96, SBS192, SBS1536, DBS78 and ID ICAMS catalog to SigProfiler format.
Examples
path <- system.file("extdata",
"strelka.regress.cat.sbs.96.csv",
package = "ICAMS")
catSBS96 <- ReadCatalog(path)
ConvertCatalogToSigProfilerFormat(input.catalog = catSBS96,
file = file.path(tempdir(), "sigproCat.txt"))
Covert an ICAMS SBS96 Catalog to SigProfiler format
Description
Covert an ICAMS SBS96 Catalog to SigProfiler format
Usage
ConvertICAMSCatalogToSigProSBS96(input.catalog, file, sep = "\t")
Arguments
input.catalog |
Either a character string, in which case this is the
path to a file containing a catalog in |
file |
The path of the file to be written. |
sep |
Separator to use in the output file. |
Create dinucleotide abundance
Description
Create dinucleotide abundance
Usage
CreateDinucAbundance(file)
Arguments
file |
Path to the file with the nucleotide abundance information with 2 base pairs. |
Value
A numeric vector whose names indicate 10 different types of 2 base pairs combinations while its values indicate the occurrences of each type.
Create exome transcriptionally stranded regions
Description
Create exome transcriptionally stranded regions
Usage
CreateExomeStrandedRanges(file, trans.ranges)
Arguments
file |
Path to a SureSelect BED file which contains unstranded exome ranges. |
trans.ranges |
A data.table which contains transcript range and strand
information. Please refer to |
Value
A data table which contains chromosome name, start, end position, strand information. It is keyed by chrom, start, and end.
Create the matrix a DBS catalog for *one* sample from an in-memory VCF.
Description
Create the matrix a DBS catalog for *one* sample from an in-memory VCF.
Usage
CreateOneColDBSMatrix(vcf, sample.id = "count", return.annotated.vcf = FALSE)
Arguments
vcf |
An in-memory VCF file annotated with sequence context and
transcript information by function |
sample.id |
Usually the sample id, but defaults to "count". |
Value
A list of three 1-column matrices with the names catDBS78
,
catDBS136
, and catDBS144
. If trans.ranges is NULL,
catDBS144
is not generated. Do not rely on the order of elements in
the list. If return.annotated.vcf
= TRUE, another element
annotated.vcf
will appear in the list. If there are DBS variants
whose tetranucleotide context contains "N", they will be excluded in the
analysis and an additional element discarded.variants
will appear in
the return list.
Note
DBS 144 catalog only contains mutations in transcribed regions.
Create one column of the matrix for an indel catalog from *one* in-memory VCF.
Description
Create one column of the matrix for an indel catalog from *one* in-memory VCF.
Usage
CreateOneColIDMatrix(
ID.vcf,
SBS.vcf = NULL,
sample.id = "count",
return.annotated.vcf = FALSE
)
Arguments
ID.vcf |
An in-memory VCF as a data.frame annotated by the
One design decision for variant callers is the representation of "complex indels", e.g. mutations e.g. CAT > GC. Some callers represent this as C>G, A>C, and T>_. Others might represent it as CAT > CG. Multiple issues can arise. In PCAWG, overlapping indel/SBS calls from different callers were included in the indel VCFs. |
SBS.vcf |
This argument defaults to |
sample.id |
Usually the sample id, but defaults to "count". |
Value
A list of two 1-column ID matrices containing the mutation catalog
information and the annotated VCF with ID categories information added. If
some ID variants were excluded in the analysis, an additional element
discarded.variants
will appear in the return list.
Create the matrix an SBS catalog for *one* sample from an in-memory VCF.
Description
Create the matrix an SBS catalog for *one* sample from an in-memory VCF.
Usage
CreateOneColSBSMatrix(vcf, sample.id = "count", return.annotated.vcf = FALSE)
Arguments
vcf |
An in-memory VCF file annotated with sequence context and
transcript information by function |
sample.id |
Usually the sample id, but defaults to "count". |
return.annotated.vcf |
Whether to return the annotated VCF with additional columns showing the mutation class for each variant. Default is FALSE. |
Value
A list of three 1-column matrices with the names
catSBS96
, catSBS192
, catSBS1536
. If transcript
information is not available in vcf
, catSBS192
is not
generated. Do not rely on the order of elements in the list. If
return.annotated.vcf
= TRUE, another element annotated.vcf
will appear in the list. If there are SBS variants whose pentanucleotide
context contains "N", they will be excluded in the analysis and an
additional element discarded.variants
will appear in the return
list.
Note
catSBS192 only contains mutations in transcribed regions.
Create position probability matrix (PPM) for *one* sample from a Variant Call Format (VCF) file.
Description
Create position probability matrix (PPM) for *one* sample from a Variant Call Format (VCF) file.
Usage
CreateOnePPMFromSBSVCF(vcf, ref.genome, seq.context.width)
Arguments
vcf |
One in-memory data frame of pure SBS mutations – no DBS or 3+BS mutations. |
ref.genome |
A |
seq.context.width |
The number of preceding and following bases to be
extracted around the mutated position from |
Value
A position probability matrix (PPM).
Create position probability matrices (PPM) from a list of SBS vcfs
Description
Create position probability matrices (PPM) from a list of SBS vcfs
Usage
CreatePPMFromSBSVCFs(list.of.SBS.vcfs, ref.genome, seq.context.width)
Arguments
list.of.SBS.vcfs |
List of in-memory data frames of pure SBS mutations – no DBS or 3+BS mutations. |
ref.genome |
A |
seq.context.width |
The number of preceding and following bases to be
extracted around the mutated position from |
Value
A list of position probability matrices (PPM).
Create pentanucleotide abundance
Description
Create pentanucleotide abundance
Usage
CreatePentanucAbundance(file)
Arguments
file |
Path to the file with the nucleotide abundance information with 5 base pairs. |
Value
A numeric vector whose names indicate 512 different types of 5 base pairs combinations while its values indicate the occurrences of each type.
Create stranded dinucleotide abundance
Description
Create stranded dinucleotide abundance
Usage
CreateStrandedDinucAbundance(file)
Arguments
file |
Path to the file with the nucleotide abundance information with 2 base pairs. |
Value
A numeric vector whose names indicate 16 different types of 2 base pairs combinations while its values indicate the occurrences of each type.
Create stranded trinucleotide abundance
Description
Create stranded trinucleotide abundance
Usage
CreateStrandedTrinucAbundance(file)
Arguments
file |
Path to the file with the nucleotide abundance information with 3 base pairs. |
Value
A numeric vector whose names indicate 64 different types of 3 base pairs combinations while its values indicate the occurrences of each type.
Create tetranucleotide abundance
Description
Create tetranucleotide abundance
Usage
CreateTetranucAbundance(file)
Arguments
file |
Path to the file with the nucleotide abundance information with 4 base pairs. |
Value
A numeric vector whose names indicate 136 different types of 4 base pairs combinations while its values indicate the occurrences of each type.
Create a transcript range file from the raw GFF3 File
Description
Create a transcript range file from the raw GFF3 File
Usage
CreateTransRanges(file)
Arguments
file |
The name/path of the raw GFF3 File, or a complete URL. |
Value
A data table which contains chromosome name, start, end position, strand information and gene name. It is keyed by chrom, start, and end. Only genes that are associated with a CCDS ID are kept for transcriptional strand bias analysis.
Create trinucleotide abundance
Description
Create trinucleotide abundance
Usage
CreateTrinucAbundance(file)
Arguments
file |
Path to the file with the nucleotide abundance information with 3 base pairs. |
Value
A numeric vector whose names indicate 32 different types of 3 base pairs combinations while its values indicate the occurrences of each type.
Return the length of microhomology at a deletion
Description
Return the length of microhomology at a deletion
Usage
FindDelMH(context, deleted.seq, pos, trace = 0, warn.cryptic = TRUE)
Arguments
context |
The deleted sequence plus ample surrounding
sequence on each side (at least as long as |
deleted.seq |
The deleted sequence in |
pos |
The position of |
trace |
If > 0, then generate various messages showing how the computation is carried out. |
warn.cryptic |
if |
Details
This function is primarily for internal use, but we export it to document the underlying logic.
Example:
GGCTAGTT
aligned to GGCTAGAACTAGTT
with
a deletion represented as:
GGCTAGAACTAGTT GG------CTAGTT GGCTAGTT GG[CTAGAA]CTAGTT ---- ----
Presumed repair mechanism leading to this:
.... GGCTAGAACTAGTT CCGATCTTGATCAA => .... GGCTAG TT CC GATCAA .... => GGCTAGTT CCGATCAA
Variant-caller software can represent the same deletion in several different, but completely equivalent, ways.
GGC------TAGTT GGCTAGTT GGC[TAGAAC]TAGTT * --- * --- GGCT------AGTT GGCTAGTT GGCT[AGAACT]AGTT ** -- ** -- GGCTA------GTT GGCTAGTT GGCTA[GAACTA]GTT *** - *** - GGCTAG------TT GGCTAGTT GGCTAG[AACTAG]TT **** ****
This function finds:
The maximum match of undeleted sequence to the left of the deletion that is identical to the right end of the deleted sequence, and
The maximum match of undeleted sequence to the right of the deletion that is identical to the left end of the deleted sequence.
The microhomology sequence is the concatenation of items (1) and (2).
Warning
A deletion in a repeat can also be represented
in several different ways. A deletion in a repeat
is abstractly equivalent to a deletion with microhomology that
spans the entire deleted sequence. For example;
GACTAGCTAGTT GACTA----GTT GACTAGTT GACTA[GCTA]GTT *** -*** -
is really a repeat
GACTAG----TT GACTAGTT GACTAG[CTAG]TT **** ---- GACT----AGTT GACTAGTT GACT[AGCT]AGTT ** --** --
This function only flags these "cryptic repeats" with a -1 return; it does not figure out the repeat extent.
Value
The length of the maximum microhomology of del.sequence
in context
.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Examples
# GAGAGG[CTAGAA]CTAGTT
# ---- ----
FindDelMH("GGAGAGGCTAGAACTAGTTAAAAA", "CTAGAA", 8, trace = 0) # 4
# A cryptic repeat
#
# TAAATTATTTATTAATTTATTG
# TAAATTA----TTAATTTATTG = TAAATTATTAATTTATTG
#
# equivalent to
#
# TAAATTATTTATTAATTTATTG
# TAAAT----TATTAATTTATTG = TAAATTATTAATTTATTG
#
# and
#
# TAAATTATTTATTAATTTATTG
# TAAA----TTATTAATTTATTG = TAAATTATTAATTTATTG
FindDelMH("TAAATTATTTATTAATTTATTG", "TTTA", 8, warn.cryptic = FALSE) # -1
Return the number of repeat units in which a deletion is embedded
Description
Return the number of repeat units in which a deletion is embedded
Usage
FindMaxRepeatDel(context, rep.unit.seq, pos)
Arguments
context |
A string that embeds |
rep.unit.seq |
A substring of |
pos |
The position of |
Details
This function is primarily for internal use, but we export it to document the underlying logic.
For example FindMaxRepeatDel("xyaczt", "ac", 3)
returns 0.
If
substr(context, pos, pos + nchar(rep.unit.seq) - 1) != rep.unit.seq
then stop.
If this functions returns 0, then it is necessary to
look for microhomology using the function
FindDelMH
.
Warning
This function depends on the variant caller having
"aligned" the deletion within the context of the
repeat.
For example, a deletion of CAG
in the repeat
GTCAGCAGCATGT
can have 3 "aligned" representations as follows:
CT---CAGCAGGT CTCAG---CAGGT CTCAGCAG---GT
In these cases this function will return 2. (Please
not that the return value does not include the
rep.uni.seq
in the count.)
However, the same deletion can also have an "unaligned" representation, such as
CTCAGC---AGGT
(a deletion of AGC
).
In this case this function will return 1 (a deletion of AGC
in a 2-element repeat of AGC
).
Value
The number of repeat units in which rep.unit.seq
is
embedded, not including
the input rep.unit.seq
in the count.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Examples
FindMaxRepeatDel("xyACACzt", "AC", 3) # 1
FindMaxRepeatDel("xyACACzt", "CA", 4) # 0
Return the number of repeat units in which an insertion is embedded.
Description
Return the number of repeat units in which an insertion is embedded.
Usage
FindMaxRepeatIns(context, rep.unit.seq, pos)
Arguments
context |
A string into which |
rep.unit.seq |
The inserted sequence and candidate repeat unit sequence. |
pos |
|
Details
For example
rep.unit.seq = ac pos = 2 context = xyaczt return 1 rep.unit.seq = ac pos = 4 context = xyaczt return 1 rep.unit.seq = cgct pos = 2 rep.unit.seq = at return 0 context = gacacacacg rep.unit.seq = ac pos = any of 1, 3, 5, 7, 9 return 4
If
substr(context, pos, pos + nchar(rep.unit.seq) - 1) != rep.unit.seq
,
then stop.
Value
If same sequence as rep.unit.seq
occurs ending at
pos
or starting at pos + 1
then the number of
repeat units before the insertion, otherwise 0.
Example gene expression data from two cell lines
Description
This data is designed to be used as an example in function
PlotTransBiasGeneExp
and PlotTransBiasGeneExpToPdf
.
Usage
gene.expression.data.HepG2
gene.expression.data.MCF10A
Format
A data.table
which contains the expression values of genes.
An object of class data.table
(inherits from data.frame
) with 57736 rows and 4 columns.
An object of class data.table
(inherits from data.frame
) with 57736 rows and 4 columns.
Examples
gene.expression.data.HepG2
# Ensembl.gene.ID gene.symbol counts TPM
# ENSG00000000003 TSPAN6 6007 33.922648455
# ENSG00000000005 TNMD 0 0.000000000
# ENSG00000000419 DPM1 4441 61.669371091
# ENSG00000000457 SCYL3 1368 3.334619195
# ENSG00000000460 C1orf112 916 2.416263423
# ... ... ... ...
Generate an empty matrix of k-mer abundance
Description
Generate an empty matrix of k-mer abundance
Usage
GenerateEmptyKmerCounts(k)
Arguments
k |
Length of k-mers (k>=2) |
Value
An empty matrix of k-mer abundance
Generate all possible k-mers of length k.
Description
Generate all possible k-mers of length k.
Usage
GenerateKmer(k)
Arguments
k |
Length of k-mers (k>=2) |
Value
Character vector containing all possible k-mers.
Generate PFMmatrix (Position Frequency Matrix) from a given list of sequences
Description
Generate PFMmatrix (Position Frequency Matrix) from a given list of sequences
Usage
GeneratePlotPFMmatrix(
sequences,
indel.class,
flank.length = 5,
plot.dir = NULL,
plot.title = NULL
)
Arguments
sequences |
A list of strings returned from
|
indel.class |
A single character string that denotes a 1 base pair
insertion or deletion, as taken from |
flank.length |
The length of flanking bases around the position or homopolymer targeted by the indel. |
plot.dir |
If provided, make a dot-line plot for PFMmatrix. |
plot.title |
The title of the dot-line plot |
Value
A matrix recording the frequency of each base (A, C, G, T) on each position of the sequence.
Generate reconstructed VCFs from indel (small insertions and deletions) simple file
Description
Generate reconstructed VCFs from indel (small insertions and deletions) simple file
Usage
GenerateVCFsFromIndelSimpleFile1(file, output.dir, max.mc.cores = 1)
Arguments
file |
The name/path of the simple indel file, or a complete URL. |
output.dir |
The directory where the reconstructed VCFs will be saved. |
max.mc.cores |
The maximum number of cores to use. On Microsoft Windows machines it is silently changed to 1. |
Generate reconstructed VCFs from indel (small insertions and deletions) simple files
Description
Generate reconstructed VCFs from indel (small insertions and deletions) simple files
Usage
GenerateVCFsFromIndelSimpleFiles(
files,
output.dir,
num.parallel.files = 1,
mc.cores.per.file = 1
)
Arguments
files |
Character vector of file paths to the indel simple files. |
output.dir |
The directory where the reconstructed VCFs will be saved. |
num.parallel.files |
The (maximum) number of files to run in
parallel. On Microsoft Windows machines it is silently changed to 1. Each
file in turn can require multiple cores, as governed by
|
mc.cores.per.file |
The maximum number of cores to use for each file. On Microsoft Windows machines it is silently changed to 1. |
Get all the sequence contexts of the indels in a given 1 base-pair indel class
Description
Get all the sequence contexts of the indels in a given 1 base-pair indel class
Usage
Get1BPIndelFlanks(sequence, ref, alt, indel.class, flank.length = 5)
Arguments
sequence |
A string from |
ref |
A string from |
alt |
A string from |
indel.class |
A single character string that denotes a 1 base pair
insertion or deletion, as taken from |
flank.length |
The length of flanking bases around the position or homopolymer targeted by the indel. |
Value
A string for the specified sequence
and indel.class
.
Generate custom k-mer abundance from a given reference genome
Description
Generate custom k-mer abundance from a given reference genome
Usage
GetCustomKmerCounts(k, ref.genome, custom.ranges, filter.path, verbose = FALSE)
Arguments
k |
Length of k-mers (k>=2) |
ref.genome |
A |
custom.ranges |
A keyed data table which has custom ranges information. It
has three columns: chrom, start and end. It should use one-based coordinate
system. You can use the internal function in this package
|
filter.path |
If given, homopolymers will be masked from genome(sequence). Only simple repeat masking is accepted now. |
verbose |
If |
Value
Matrix of the counts of custom k-mer across the ref.genome
Generate k-mer abundance from a given genome
Description
Generate k-mer abundance from a given genome
Usage
GetGenomeKmerCounts(k, ref.genome, filter.path, verbose = FALSE)
Arguments
k |
Length of k-mers (k>=2) |
ref.genome |
A |
filter.path |
If given, homopolymers will be masked from genome(sequence). Only simple repeat masking is accepted now. |
verbose |
If |
Value
Matrix of the counts of each k-mer across the ref.genome
Get mutation loads information from Mutect VCF files.
Description
Get mutation loads information from Mutect VCF files.
Usage
GetMutationLoadsFromMutectVCFs(catalogs)
Arguments
catalogs |
A list generated by calling function
|
Value
A list containing mutation loads information from Mutect VCF files:
-
total.variants
Total number of mutations. -
SBS
Number of single base substitutions. -
DBS
Number of double base substitutions. -
ID
Number of small insertions and deletions. -
discarded.variants
Number of other types of mutations which are excluded in the analysis inICAMS
.
Get mutation loads information from Strelka ID VCF files.
Description
Get mutation loads information from Strelka ID VCF files.
Usage
GetMutationLoadsFromStrelkaIDVCFs(catalogs)
Arguments
catalogs |
A list generated by calling function
|
Value
A list containing mutation loads information from Strelka ID VCF files:
-
total.variants
Total number of mutations. -
SBS
Number of single base substitutions. -
DBS
Number of double base substitutions. -
ID
Number of small insertions and deletions. -
excluded.variants
Number of other types of mutations which are excluded in the analysis inICAMS
.
Get mutation loads information from Strelka SBS VCF files.
Description
Get mutation loads information from Strelka SBS VCF files.
Usage
GetMutationLoadsFromStrelkaSBSVCFs(catalogs)
Arguments
catalogs |
A list generated by calling function
|
Value
A list containing mutation loads information from Strelka SBS VCF files:
-
total.variants
Total number of mutations. -
SBS
Number of single base substitutions. -
DBS
Number of double base substitutions. -
ID
Number of small insertions and deletions. -
discarded.variants
Number of other types of mutations which are excluded in the analysis inICAMS
.
Generate k-mer abundance from given nucleotide sequences
Description
Generate k-mer abundance from given nucleotide sequences
Usage
GetSequenceKmerCounts(sequences, k)
Arguments
sequences |
A vector of nucleotide sequences |
k |
Length of k-mers (k>=2) |
Value
Matrix of the counts of each k-mer inside sequences
Generate stranded k-mer abundance from a given genome and gene annotation file
Description
Generate stranded k-mer abundance from a given genome and gene annotation file
Usage
GetStrandedKmerCounts(
k,
ref.genome,
stranded.ranges,
filter.path,
verbose = FALSE
)
Arguments
k |
Length of k-mers (k>=2) |
ref.genome |
A |
stranded.ranges |
A keyed data table which has stranded ranges information. It has four columns: chrom, start, end and strand. It should use one-based coordinate system. |
filter.path |
If given, homopolymers will be masked from genome(sequence). Only simple repeat masking is accepted now. |
verbose |
If |
Value
Matrix of the counts of each stranded k-mer across the ref.genome
Extract the VAFs (variant allele frequencies) and read depth information from a VCF file
Description
Extract the VAFs (variant allele frequencies) and read depth information from a VCF file
Usage
GetStrelkaVAF(vcf, name.of.VCF = NULL)
GetMutectVAF(vcf, name.of.VCF = NULL, tumor.col.name = NA)
GetFreebayesVAF(vcf, name.of.VCF = NULL)
GetPCAWGConsensusVAF(vcf, mc.cores = 1)
Arguments
vcf |
An in-memory VCF data frame. |
name.of.VCF |
Name of the VCF file. |
tumor.col.name |
Optional. Only applicable to Mutect VCF. Name
or index of the column in Mutect VCF which contains the tumor
sample information. It must have quotation marks if specifying the
column name. If |
mc.cores |
The number of cores to use. Not available on Windows
unless |
Value
The original vcf
with two additional columns added which
contain the VAF(variant allele frequency) and read depth information.
Note
GetPCAWGConsensusVAF
is analogous to GetMutectVAF
,
calculating VAF and read depth from PCAWG7 consensus vcfs
Examples
file <- c(system.file("extdata/Strelka-SBS-vcf",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
MakeDataFrameFromVCF <- getFromNamespace("MakeDataFrameFromVCF", "ICAMS")
df <- MakeDataFrameFromVCF(file)
df1 <- GetStrelkaVAF(df)
Generate Haplotype plot from a given list of sequences
Description
Generate Haplotype plot from a given list of sequences
Usage
HaplotypePlot(
sequences,
indel.class,
flank.length = 5,
title = "Haplotype Plot"
)
Arguments
sequences |
A list of strings returned from
|
indel.class |
A single character string that denotes a 1 base pair
insertion or deletion, as taken from |
flank.length |
The length of flanking bases around the position or homopolymer targeted by the indel. |
title |
The title of the haplotype plot |
Value
A ggplot2 object
ICAMS: In-depth Characterization and Analysis of Mutational Signatures
Description
Analysis and visualization of experimentally elucidated mutational signatures
– the kind of analysis and visualization in Boot et al., "In-depth
characterization of the cisplatin mutational signature in human cell lines
and in esophageal and liver tumors",
Genome Research 2018 https://doi.org/10.1101/gr.230219.117 and
"Characterization of colibactin-associated mutational signature in an
Asian oral squamous cell carcinoma and in other mucosal tumor types",
Genome Research 2020, https://doi.org/10.1101/gr.255620.119.
"ICAMS" stands for In-depth Characterization and
Analysis of Mutational Signatures. "ICAMS" has functions to read in variant
call files (VCFs) and to collate the corresponding catalogs of mutational
spectra and to analyze and plot catalogs of mutational spectra and
signatures.
Details
"ICAMS" can read in VCFs generated by Strelka, Mutect or other variant callers, and collate the mutations into "catalogs" of mutational spectra. "ICAMS" can create and plot catalogs of mutational spectra or signatures for single base substitutions (SBS), doublet base substitutions (DBS), and small insertions and deletions (ID). It can also read and write these catalogs.
Catalogs
A key data type in "ICAMS" is a "catalog" of mutation counts, of mutation densities (see below), or of mutational signatures.
Catalogs are S3 objects of class matrix
and one of
several additional classes that specify the types of the mutations
represented in the catalog. The additional class is one of
-
SBS96Catalog
(strand-agnostic single base substitutions in trinucleotide context) -
SBS192Catalog
(transcription-stranded single-base substitutions in trinucleotide context) -
SBS1536Catalog
-
DBS78Catalog
-
DBS144Catalog
-
DBS136Catalog
-
IndelCatalog
-
ID166Catalog
(genic-intergenic indel catalog)
as.catalog
is the main constructor.
Conceptually, a catalog also has one of the following types,
indicated by the attribute catalog.type
:
Matrix of mutation counts (one column per sample), representing (counts-based) mutational spectra (
catalog.type = "counts"
).Matrix of mutation **densities**, i.e. mutations per occurrences of source sequences (one column per sample), representing (density-based) mutational spectra (
catalog.type = "density"
).Matrix of mutational signatures, which are similar to spectra. However where spectra consist of counts or densities of mutations in each mutation class (e.g. ACA > AAA, ACA > AGA, ACA > ATA, ACC > AAC, ...), signatures consist of the proportions of mutations in each class (with all the proportions summing to 1). A mutational signature can be based on either:
mutation counts (a "counts-based mutational signature",
catalog.type = "counts.signature"
), ormutation densities (a "density-based mutational signature",
catalog.type = "density.signature"
).
Catalogs also have the attribute abundance
, which contains the
counts of different source sequences for mutations. For example,
for SBSs in trinucleotide context, the abundances would be the counts
of each trinucleotide in the human genome, exome, or in the transcribed
region of the genome. See TransformCatalog
for more information. Abundances logically depend on the species in
question and on the part of the genome being analyzed.
In "ICAMS"
abundances can sometimes be inferred from the
catalog
class attribute and the
function arguments region
, ref.genome
,
and catalog.type
.
Otherwise abundances can be provided as an abundance
argument.
See all.abundance
for examples.
Possible values for
region
are the strings genome
, transcript
,
exome
, and unknown
; transcript
includes entire
transcribed regions, i.e. the introns as well as the exons.
If you need to create a catalog from a source other than
this package (i.e. other than with
ReadCatalog
or VCFsToCatalogs
,
VCFsToZipFile
, etc.), then use
as.catalog
.
Subscripting catalogs
If user wants to subscript specific columns from a catalog, it is needed to
call library(ICAMS)
beforehand to preserve the ICAMS catalog
attribute. Otherwise writing or plotting catalog function in ICAMS may not
work properly.
Creating catalogs from variant call files (VCF files)
* VCFsToCatalogs
creates 3 SBS catalogs (96, 192, 1536), 3
DBS catalogs (78, 136, 144) and ID (small insertions and deletions) catalog
from the VCFs.
Plotting catalogs
* PlotCatalog
function plots mutational spectra
for one sample or plot one mutational signature.
* PlotCatalogToPdf
function plots catalogs of mutational spectra or
of mutational signatures to a PDF file.
Wrapper function to create catalogs from VCFs and plot the catalogs to PDF files
* VCFsToCatalogsAndPlotToPdf
creates all types of SBS, DBS
and ID catalogs from VCFs and plots the catalogs.
Wrapper function to create a zip file which contains catalogs and plot PDFs from VCF files
* VCFsToZipFile
creates a zip file which contains SBS, DBS
and ID catalogs and plot PDFs from VCF files.
The ref.genome
(reference genome) argument
Many functions take the argument ref.genome
.
To create a mutational
spectrum catalog from a VCF file, "ICAMS" needs the reference genome sequence
that matches the VCF file. The ref.genome
argument
provides this.
ref.genome
must be one of
A variable from the Bioconductor
BSgenome
package that contains a particular reference genome, for exampleBSgenome.Hsapiens.1000genomes.hs37d5
.The strings
"hg38"
or"GRCh38"
, which specifyBSgenome.Hsapiens.UCSC.hg38
.The strings
"hg19"
or"GRCh37"
, which specifyBSgenome.Hsapiens.1000genomes.hs37d5
.The strings
"mm10"
or"GRCm38"
, which specifyBSgenome.Mmusculus.UCSC.mm10
.
All needed reference genomes must be installed separately by the user.
Further instructions are at
https://bioconductor.org/packages/release/bioc/html/BSgenome.html.
Use of "ICAMS" with reference genomes other than the 2 human genomes
and 1 mouse genome specified above is restricted to
catalog.type
of counts
or counts.signature
unless the user also creates the necessary abundance vectors.
See all.abundance
.
Use available.genomes()
to get the list of available genomes.
Writing catalogs to files
* WriteCatalog
function
writes a catalog to a file.
Reading catalogs
* ReadCatalog
function
reads a file that contains a catalog in standardized format.
Transforming catalogs
TransformCatalog
function transforms catalogs of mutational spectra or
signatures to account for differing abundances of the source
sequence of the mutations in the genome.
For example, mutations from ACG are much rarer in the human genome than mutations from ACC simply because CG dinucleotides are rare in the genome. Consequently, there are two possible representations of mutational spectra or signatures. One representation is based on mutation counts as observed in a given genome or exome, and this approach is widely used, as, for example, at https://cancer.sanger.ac.uk/signatures/, which presents signatures based on observed mutation counts in the human genome. We call these "counts-based spectra" or "counts-based signatures".
Alternatively, mutational spectra or signatures can be represented as mutations per source sequence, for example the number of ACT > AGT mutations occurring at all ACT 3-mers in a genome. We call these "density-based spectra" or "density-based signatures".
This function can also transform spectra based on observed genome-wide counts to "density"-based catalogs. In density-based catalogs mutations are expressed as mutations per source sequences. For example, a density-based catalog represents the proportion of ACCs mutated to ATCs, the proportion of ACGs mutated to ATGs, etc. This is different from counts-based mutational spectra catalogs, which contain the number of ACC > ATC mutations, the number of ACG > ATG mutations, etc.
This function can also transform observed-count based spectra or signatures from genome to exome based counts, or between different species (since the abundances of source sequences vary between genome and exome and between species).
Collapsing catalogs
CollapseCatalog
function
Takes a mutational spectrum or signature catalog that is based on a fined-grained set of features (for example, single-nucleotide substitutions in the context of the preceding and following 2 bases).
Collapses it to a catalog based on a coarser-grained set of features (for example, single-nucleotide substitutions in the context of the immediately preceding and following bases).
Data
-
CatalogRowOrder
Standard order of rownames in a catalog. The rownames encode the type of each mutation. For example, for SBS96 catalogs, the rowname AGAT represents a mutation from AGA > ATA. -
TranscriptRanges
Transcript ranges and strand information for a particular reference genome. -
all.abundance
The counts of different source sequences for mutations. -
GeneExpressionData
Example gene expression data from two cell lines.
"_PACKAGE"
Infer abundance
given a matrix-like object
and additional information.
Description
Infer abundance
given a matrix-like object
and additional information.
Usage
InferAbundance(object, ref.genome, region, catalog.type)
Arguments
object |
A numeric matrix, numeric data frame, or |
ref.genome |
A |
region |
A character string designating a genomic region;
see |
catalog.type |
A character string for |
Value
A value that can be set as the abundance attribute of
a catalog
(which may be NULL
if no abundance
can be inferred).
These two functions is applicable only for internal ICAMS-formatted catalog object.
Description
These two functions is applicable only for internal ICAMS-formatted catalog object.
Usage
InferCatalogClassPrefix(object)
This function converts an data.table imported from external catalog text file into ICAMS internal catalog object of appropriate type.
Description
This function converts an data.table imported from external catalog text file into ICAMS internal catalog object of appropriate type.
Usage
InferCatalogInfo(object)
Infer reference genome name from a character string
Description
Infer reference genome name from a character string
Usage
InferRefGenomeName(ref.genome)
Arguments
ref.genome |
A character string indicating the reference genome. |
Value
The inferred reference genome name.
Infer the correct rownames for a matrix based on its number of rows
Description
Infer the correct rownames for a matrix based on its number of rows
Usage
InferRownames(object)
Test if object is BSgenome.Hsapiens.1000genome.hs37d5
.
Description
Test if object is BSgenome.Hsapiens.1000genome.hs37d5
.
Usage
IsGRCh37(x)
Arguments
x |
Object to test. |
Value
TRUE if x
is BSgenome.Hsapiens.1000genome.hs37d5
.
Test if object is BSgenome.Hsapiens.UCSC.hg38
.
Description
Test if object is BSgenome.Hsapiens.UCSC.hg38
.
Usage
IsGRCh38(x)
Arguments
x |
Object to test. |
Value
TRUE if x
is BSgenome.Hsapiens.UCSC.hg38
.
Test if object is BSgenome.Mmusculus.UCSC.mm10
.
Description
Test if object is BSgenome.Mmusculus.UCSC.mm10
.
Usage
IsGRCm38(x)
Arguments
x |
Object to test. |
Value
TRUE if x
is BSgenome.Mmusculus.UCSC.mm10
.
Check whether an R object contains one of the ICAMS catalog classes
Description
Check whether an R object contains one of the ICAMS catalog classes
Check whether an R object contains one of the ICAMS catalog classes
Usage
IsICAMSCatalog(object)
IsICAMSCatalog(object)
Arguments
object |
An R object. |
Value
A logical value.
A logical value.
Examples
# Create a matrix with all values being 1
object <- matrix(1, nrow = 96, ncol = 1,
dimnames = list(catalog.row.order$SBS96))
IsICAMSCatalog(object) # FALSE
# Use as.catalog to add class attribute to object
catalog <- as.catalog(object)
IsICAMSCatalog(catalog) # TRUE
# Create a matrix with all values being 1
object <- matrix(1, nrow = 96, ncol = 1,
dimnames = list(catalog.row.order$SBS96))
IsICAMSCatalog(object) # FALSE
# Use as.catalog to add class attribute to object
catalog <- as.catalog(object)
IsICAMSCatalog(catalog) # TRUE
Check whether the BSgenome package is installed
Description
Check whether the BSgenome package is installed
Usage
IsRefGenomeInstalled(ref.genome)
Arguments
ref.genome |
A |
Value
A logical value indicating whether the BSgenome package is installed.
Read in the data lines of a Variant Call Format (VCF) file
Description
Read in the data lines of a Variant Call Format (VCF) file
Usage
MakeDataFrameFromVCF(file)
Arguments
file |
The name/path of the VCF file, or a complete URL. |
Value
A data frame storing mutation records of a VCF file.
MakeVCFDBSdf Take DBS ranges and the original VCF and generate a VCF with dinucleotide REF and ALT alleles.
Description
MakeVCFDBSdf Take DBS ranges and the original VCF and generate a VCF with dinucleotide REF and ALT alleles.
Usage
MakeVCFDBSdf(DBS.range.df, SBS.vcf.dt)
Arguments
DBS.range.df |
Data frame with columns CHROM, LOW, HIGH |
SBS.vcf.dt |
A data table containing the VCF from which
|
Value
A minimal VCF with only the columns CHROM
, POS
,
ID
, REF
, ALT
, VAF
, read.depth
.
[Deprecated, use VCFsToCatalogs(variant.caller = "mutect") instead] Create SBS, DBS and Indel catalogs from Mutect VCF files
Description
[Deprecated, use VCFsToCatalogs(variant.caller = "mutect") instead]
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the Mutect VCFs specified by files
Usage
MutectVCFFilesToCatalog(
files,
ref.genome,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
tumor.col.names = NA,
flag.mismatches = 0,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
files |
Character vector of file paths to the Mutect VCF files. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Vector of column names or column indices in
VCFs which contain the tumor sample information. The order of elements in
|
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls VCFsToSBSCatalogs
,
VCFsToDBSCatalogs
and VCFsToIDCatalogs
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
catID
: Matrix of ID (small insertions and deletions) catalog. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant. -
ID
: ID VCF annotated byAnnotateIDVCF
with one new columnID.class
showing the mutation class for each ID variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Examples
## Not run:
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <- MutectVCFFilesToCatalog(file, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37,
region = "genome")}
## End(Not run)
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "mutect") instead] Create SBS, DBS and Indel catalogs from Mutect VCF files and plot them to PDF
Description
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "mutect") instead]
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the Mutect VCFs specified by files
and plot them to
PDF
Usage
MutectVCFFilesToCatalogAndPlotToPdf(
files,
ref.genome,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
tumor.col.names = NA,
output.file = "",
flag.mismatches = 0,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
files |
Character vector of file paths to the Mutect VCF files. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Vector of column names or column indices in
VCFs which contain the tumor sample information. The order of elements in
|
output.file |
Optional. The base name of the PDF files to be produced;
multiple files will be generated, each ending in |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls MutectVCFFilesToCatalog
and
PlotCatalogToPdf
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
catID
: Matrix of ID (small insertions and deletions) catalog. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant. -
ID
: ID VCF annotated byAnnotateIDVCF
with one new columnID.class
showing the mutation class for each ID variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Examples
## Not run:
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <-
MutectVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37,
region = "genome",
output.file =
file.path(tempdir(), "Mutect"))}
## End(Not run)
[Deprecated, use VCFsToZipFile(variant.caller = "mutect") instead] Create a zip file which contains catalogs and plot PDFs from Mutect VCF files
Description
[Deprecated, use VCFsToZipFile(variant.caller = "mutect") instead]
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the Mutect VCFs specified by dir
, save the catalogs
as CSV files, plot them to PDF and generate a zip archive of all the output files.
Usage
MutectVCFFilesToZipFile(
dir,
zipfile,
ref.genome,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
tumor.col.names = NA,
base.filename = "",
flag.mismatches = 0,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
dir |
Pathname of the directory which contains only the Mutect
VCF files. Each Mutect VCF must have a file extension ".vcf" (case
insensitive) and share the same |
zipfile |
Pathname of the zip file to be created. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Vector of column names or column indices in
VCFs which contain the tumor sample information. The order of elements in
|
base.filename |
Optional. The base name of the CSV and PDF files to be
produced; multiple files will be generated, each ending in
|
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls MutectVCFFilesToCatalog
,
PlotCatalogToPdf
, WriteCatalog
and
zip::zipr
.
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
catID
: Matrix of ID (small insertions and deletions) catalog. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant. -
ID
: ID VCF annotated byAnnotateIDVCF
with one new columnID.class
showing the mutation class for each ID variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Examples
## Not run:
dir <- c(system.file("extdata/Mutect-vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <-
MutectVCFFilesToZipFile(dir,
zipfile = file.path(tempdir(), "test.zip"),
ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37,
region = "genome",
base.filename = "Mutect")
unlink(file.path(tempdir(), "test.zip"))}
## End(Not run)
Take strings representing a genome and return the BSgenome
object.
Description
Take strings representing a genome and return the BSgenome
object.
Usage
NormalizeGenomeArg(ref.genome)
Arguments
ref.genome |
A |
Value
If ref.genome
is
a BSgenome
object, return it.
Otherwise return the BSgenome
object identified by the
string ref.genome
.
Plot the SBS96 part of a SignatureAnalyzer COMPOSITE signature or catalog
Description
Plot the SBS96 part of a SignatureAnalyzer COMPOSITE signature or catalog
Usage
Plot96PartOfCompositeToPDF(catalog, name, type = "density")
Arguments
catalog |
Catalog or signature matrix |
name |
Name of file to print to. |
type |
See |
Plot one spectrum or signature
Description
Plot the spectrum of one sample or plot one signature. The
type of graph is based on attribute("catalog.type")
of the input catalog.
You can first use TransformCatalog
to get different types of
catalog and then do the plotting.
Usage
PlotCatalog(
catalog,
plot.SBS12 = NULL,
cex = NULL,
grid = NULL,
upper = NULL,
xlabels = NULL,
ylabels = NULL,
ylim = NULL
)
Arguments
catalog |
A catalog as defined in |
plot.SBS12 |
Only meaningful for class |
cex |
Has the usual meaning. Taken from |
grid |
A logical value indicating whether to draw grid lines. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
upper |
A logical value indicating whether to draw horizontal lines and the names of major mutation class on top of graph. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
xlabels |
A logical value indicating whether to draw x axis labels. Only
implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog.
If |
ylabels |
A logical value indicating whether to draw y axis labels. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
ylim |
Has the usual meaning. Only implemented for SBS96Catalog, IndelCatalog, ID166Catalog. |
Value
An invisible list whose first element is a logic value
indicating whether the plot is successful. For SBS96Catalog
,
SBS192Catalog
, DBS78Catalog
, DBS144Catalog
and
IndelCatalog
, the list will have a second element, which is a
numeric vector giving the coordinates of all the bar midpoints drawn,
useful for adding to the graph. For SBS192Catalog with "counts"
catalog.type and non-NULL abundance and plot.SBS12 = TRUE
, the list
will have an additional element which is a list containing the strand bias
statistics.
Comments
For SBS192Catalog with "counts" catalog.type and
non-NULL abundance and plot.SBS12 = TRUE
, the strand bias statistics
are Benjamini-Hochberg q-values based on two-sided binomial tests of the
mutation counts on the transcribed and untranscribed strands relative to
the actual abundances of C and T on the transcribed strand. On the SBS12
plot, asterisks indicate q-values as follows *, Q<0.05
; **,
Q<0.01
; ***, Q<0.001
.
Note
The sizes of repeats involved in deletions range from 0 to 5+ in the mutational-spectra and signature catalog rownames, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Examples
file <- system.file("extdata",
"strelka.regress.cat.sbs.96.csv",
package = "ICAMS")
catSBS96 <- ReadCatalog(file)
colnames(catSBS96) <- "sample"
PlotCatalog(catSBS96)
Plot catalog to a PDF file
Description
Plot catalog to a PDF file. The type of graph is based on
attribute("catalog.type")
of the input catalog. You can first use
TransformCatalog
to get different types of catalog and then do
the plotting.
Usage
PlotCatalogToPdf(
catalog,
file,
plot.SBS12 = NULL,
cex = NULL,
grid = NULL,
upper = NULL,
xlabels = NULL,
ylabels = NULL,
ylim = NULL
)
Arguments
catalog |
A catalog as defined in |
file |
The name of the PDF file to be produced. |
plot.SBS12 |
Only meaningful for class |
cex |
Has the usual meaning. Taken from |
grid |
A logical value indicating whether to draw grid lines. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
upper |
A logical value indicating whether to draw horizontal lines and the names of major mutation class on top of graph. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
xlabels |
A logical value indicating whether to draw x axis labels. Only
implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog.
If |
ylabels |
A logical value indicating whether to draw y axis labels. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
ylim |
Has the usual meaning. Only implemented for SBS96Catalog, IndelCatalog, ID166Catalog. |
Value
An invisible list whose first element is a logic value
indicating whether the plot is successful. For SBS192Catalog with
"counts" catalog.type and non-null abundance and plot.SBS12 = TRUE
,
the list will have a second element which is a list containing the strand
bias statistics.
Comments
For SBS192Catalog with "counts" catalog.type and
non-NULL abundance and plot.SBS12 = TRUE
, the strand bias statistics
are Benjamini-Hochberg q-values based on two-sided binomial tests of the
mutation counts on the transcribed and untranscribed strands relative to
the actual abundances of C and T on the transcribed strand. On the SBS12
plot, asterisks indicate q-values as follows *, Q<0.05
; **,
Q<0.01
; ***, Q<0.001
.
Note
The sizes of repeats involved in deletions range from 0 to 5+ in the mutational-spectra and signature catalog rownames, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Examples
file <- system.file("extdata",
"strelka.regress.cat.sbs.96.csv",
package = "ICAMS")
catSBS96 <- ReadCatalog(file)
colnames(catSBS96) <- "sample"
PlotCatalogToPdf(catSBS96, file = file.path(tempdir(), "test.pdf"))
Generate dot-line plot for sequence contest of 1bp indel
Description
Generate dot-line plot for sequence contest of 1bp indel
Usage
PlotPFMmatrix(PFMmatrix, title, cex.main = 1.5, cex.lab = 1.25, cex.axis = 1)
Arguments
PFMmatrix |
An object return from |
title |
A string provides the title of the plot |
cex.main |
Passed to R plot function. Title size |
cex.lab |
Passed to R plot function. Axis label size |
cex.axis |
Passed to R plot function. Axis text size |
Value
An invisible list.
Plot position probability matrix (PPM) for *one* sample from a Variant Call Format (VCF) file.
Description
Plot position probability matrix (PPM) for *one* sample from a Variant Call Format (VCF) file.
Usage
PlotPPM(ppm, title)
Arguments
ppm |
A position probability matrix (PPM) for *one* sample. |
title |
The main title of the plot. |
Value
invisible(TRUE)
Plot position probability matrices (PPM) to a PDF file
Description
Plot position probability matrices (PPM) to a PDF file
Usage
PlotPPMToPdf(list.of.ppm, file, titles = names(list.of.ppm))
Arguments
list.of.ppm |
A list of position probability matrices (PPM) |
file |
The name of the PDF file to be produced. |
titles |
A vector of titles on top of each PPM plot. |
Value
invisible(TRUE)
Plot transcription strand bias with respect to gene expression values
Description
Plot transcription strand bias with respect to gene expression values
Usage
PlotTransBiasGeneExp(
annotated.SBS.vcf,
expression.data,
Ensembl.gene.ID.col,
expression.value.col,
num.of.bins,
plot.type,
damaged.base = NULL,
ymax = NULL
)
Arguments
annotated.SBS.vcf |
An SBS VCF annotated by
|
expression.data |
A |
Ensembl.gene.ID.col |
Name of column which has the Ensembl gene ID
information in |
expression.value.col |
Name of column which has the gene expression
values in |
num.of.bins |
The number of bins that will be plotted on the graph. |
plot.type |
A character string indicating one mutation type to be plotted. It should be one of "C>A", "C>G", "C>T", "T>A", "T>C", "T>G". |
damaged.base |
One of |
ymax |
Limit for the y axis. If not specified, it defaults to NULL and the y axis limit equals 1.5 times of the maximum mutation counts in a specific mutation type. |
Value
A list whose first element is a logic value indicating whether the plot is successful. The second element is a named numeric vector containing the p-values printed on the plot.
Note
The p-values are calculated by logistic regression using function
glm
. The dependent variable is labeled "1" and "0" if
the mutation from annotated.SBS.vcf
falls onto the untranscribed and
transcribed strand respectively. The independent variable is the binary
logarithm of the gene expression value from expression.data
plus one,
i.e. log_2 (x + 1)
where x
stands for gene
expression value.
Examples
file <- c(system.file("extdata/Strelka-SBS-vcf/",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka")
SBS.vcf <- list.of.vcfs$SBS[[1]]
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37)
PlotTransBiasGeneExp(annotated.SBS.vcf = annotated.SBS.vcf,
expression.data = gene.expression.data.HepG2,
Ensembl.gene.ID.col = "Ensembl.gene.ID",
expression.value.col = "TPM",
num.of.bins = 4, plot.type = "C>A")
}
Plot transcription strand bias with respect to gene expression values to a PDF file
Description
Plot transcription strand bias with respect to gene expression values to a PDF file
Usage
PlotTransBiasGeneExpToPdf(
annotated.SBS.vcf,
file,
expression.data,
Ensembl.gene.ID.col,
expression.value.col,
num.of.bins,
plot.type = c("C>A", "C>G", "C>T", "T>A", "T>C", "T>G"),
damaged.base = NULL
)
Arguments
annotated.SBS.vcf |
An SBS VCF annotated by
|
file |
The name of output file. |
expression.data |
A |
Ensembl.gene.ID.col |
Name of column which has the Ensembl gene ID
information in |
expression.value.col |
Name of column which has the gene expression
values in |
num.of.bins |
The number of bins that will be plotted on the graph. |
plot.type |
A vector of character indicating types to be plotted. It can be one or more types from "C>A", "C>G", "C>T", "T>A", "T>C", "T>G". The default is to print all the six mutation types. |
damaged.base |
One of |
Value
A list whose first element is a logic value indicating whether the plot is successful. The second element is a named numeric vector containing the p-values printed on the plot.
Note
The p-values are calculated by logistic regression using function
glm
. The dependent variable is labeled "1" and "0" if
the mutation from annotated.SBS.vcf
falls onto the untranscribed and
transcribed strand respectively. The independent variable is the binary
logarithm of the gene expression value from expression.data
plus one,
i.e. log_2 (x + 1)
where x
stands for gene
expression value.
Examples
file <- c(system.file("extdata/Strelka-SBS-vcf/",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka")
SBS.vcf <- list.of.vcfs$SBS[[1]]
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37)
PlotTransBiasGeneExpToPdf(annotated.SBS.vcf = annotated.SBS.vcf,
expression.data = gene.expression.data.HepG2,
Ensembl.gene.ID.col = "Ensembl.gene.ID",
expression.value.col = "TPM",
num.of.bins = 4,
plot.type = c("C>A","C>G","C>T","T>A","T>C"),
file = file.path(tempdir(), "test.pdf"))
}
[Deprecated, use ReadAndSplitVCFs(variant.caller = "mutect") instead] Read and split Mutect VCF files
Description
[Deprecated, use ReadAndSplitVCFs(variant.caller = "mutect") instead] Read and split Mutect VCF files
Usage
ReadAndSplitMutectVCFs(
files,
names.of.VCFs = NULL,
tumor.col.names = NA,
suppress.discarded.variants.warnings = TRUE
)
Arguments
files |
Character vector of file paths to the Mutect VCF files. |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Vector of column names or column indices in
VCFs which contain the tumor sample information. The order of elements in
|
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Value
A list containing the following objects:
-
SBS
: List of VCFs with only single base substitutions. -
DBS
: List of VCFs with only doublet base substitutions as called by Mutect. -
ID
: List of VCFs with only small insertions and deletions. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
See Also
Examples
## Not run:
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadAndSplitMutectVCFs(file)
## End(Not run)
[Deprecated, use ReadAndSplitVCFs(variant.caller = "strelka") instead] Read and split Strelka SBS VCF files
Description
[Deprecated, use ReadAndSplitVCFs(variant.caller = "strelka") instead] The function will find and merge adjacent SBS pairs into DBS if their VAFs are very similar. The default threshold value for VAF is 0.02.
Usage
ReadAndSplitStrelkaSBSVCFs(
files,
names.of.VCFs = NULL,
suppress.discarded.variants.warnings = TRUE
)
Arguments
files |
Character vector of file paths to the Strelka SBS VCF files. |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Value
A list of elements as follows:
-
SBS.vcfs
: List of data.frames of pure SBS mutations – no DBS or 3+BS mutations. -
DBS.vcfs
: List of data.frames of pure DBS mutations – no SBS or 3+BS mutations. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
See Also
Examples
## Not run:
file <- c(system.file("extdata/Strelka-SBS-vcf",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadAndSplitStrelkaSBSVCFs(file)
## End(Not run)
Read and split VCF files
Description
Read and split VCF files
Usage
ReadAndSplitVCFs(
files,
variant.caller = "unknown",
num.of.cores = 1,
names.of.VCFs = NULL,
tumor.col.names = NA,
filter.status = DefaultFilterStatus(variant.caller),
get.vaf.function = NULL,
...,
max.vaf.diff = 0.02,
suppress.discarded.variants.warnings = TRUE,
always.merge.SBS = FALSE,
chr.names.to.process = NULL
)
Arguments
files |
Character vector of file paths to the VCF files. |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
max.vaf.diff |
Not applicable if |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
always.merge.SBS |
If |
chr.names.to.process |
A character vector specifying the chromosome names in VCF whose variants will be kept and processed, other chromosome variants will be discarded. If NULL(default), all variants will be kept except those on chromosomes with names that contain strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt". |
Value
A list containing the following objects:
-
SBS
: List of VCFs with only single base substitutions. -
DBS
: List of VCFs with only doublet base substitutions. -
ID
: List of VCFs with only small insertions and deletions. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
See Also
Examples
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")
Read chromosome and position information from a bed format file.
Description
Read chromosome and position information from a bed format file.
Usage
ReadBedRanges(file)
Arguments
file |
Path to the file in bed format. |
Value
A data.table keyed by chrom, start, and end. It uses one-based coordinates.
Note
Only chromosomes 1-22 and X and Y will be kept.
Read catalog
Description
Read a catalog in standardized format from path.
Usage
ReadCatalog(
file,
ref.genome = NULL,
region = "unknown",
catalog.type = "counts",
strict = NULL,
stop.on.error = TRUE
)
Arguments
file |
Path to a catalog on disk in a standardized format. The recognized formats are:
|
ref.genome |
A |
region |
region A character string designating a genomic region;
see |
catalog.type |
One of "counts", "density", "counts.signature", "density.signature". |
strict |
Ignored and deprecated. |
stop.on.error |
If TRUE, call |
Details
See also WriteCatalog
Value
A catalog as an S3 object; see as.catalog
.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Examples
file <- system.file("extdata",
"strelka.regress.cat.sbs.96.csv",
package = "ICAMS")
catSBS96 <- ReadCatalog(file)
Get error message and either stop or create a null error output for read catalog
Description
Get error message and either stop or create a null error output for read catalog
Usage
ReadCatalogErrReturn(err.info, nrow, stop.on.error = TRUE, do.message = TRUE)
Arguments
err.info |
The information passed to the |
nrow |
The number of rows to put in the 1-column NA return matrix. |
stop.on.error |
If |
do.message |
If |
Internal read catalog function to be wrapped in a tryCatch
Description
Internal read catalog function to be wrapped in a tryCatch
Usage
ReadCatalogInternal(
file,
ref.genome = NULL,
region = "unknown",
catalog.type = "counts"
)
Arguments
file |
Path to a catalog on disk in a standardized format. The recognized formats are:
|
ref.genome |
A |
region |
region A character string designating a genomic region;
see |
catalog.type |
One of "counts", "density", "counts.signature", "density.signature". |
Read a 192-channel spectra (or signature) catalog in Duke-NUS format
Description
WARNING: will not work with region = "genome"
. For this
you must first read with region = "unknown"
, then
convert the cat96
return to "genome"
and
ignore the cat192
return, which is nonsensical.
Usage
ReadDukeNUSCat192(
file,
ref.genome = NULL,
region = "unknown",
catalog.type = "counts",
abundance = NULL
)
Details
The file needs to have the column names Before Ref After Var in the first 4 columns
Value
A list with two elements
Read in the data lines of a Variant Call Format (VCF) file created by Mutect
Description
Read in the data lines of a Variant Call Format (VCF) file created by Mutect
Usage
ReadMutectVCF(file, name.of.VCF = NULL, tumor.col.name = NA)
Arguments
file |
The name/path of the VCF file, or a complete URL. |
name.of.VCF |
Name of the VCF file. If |
tumor.col.name |
Name or index of the column in VCF which contains the
tumor sample information. It must have quotation marks if
specifying the column name. If |
Value
A data frame storing data lines of a VCF file with two additional columns added which contain the VAF(variant allele frequency) and read depth information.
Read Mutect VCF files.
Description
Read Mutect VCF files.
Usage
ReadMutectVCFs(files, names.of.VCFs = NULL, tumor.col.names = NA)
Arguments
files |
Character vector of file paths to the VCF files. |
names.of.VCFs |
Character vector of names of the VCF files. The order of
names in |
tumor.col.names |
Vector of column names or column indices in VCFs which
contain the tumor sample information. The order of elements in
|
Value
A list of data frames which store data lines of VCF files with two additional columns added which contain the VAF(variant allele frequency) and read depth information.
Read a 96-channel spectra (or signature) catalog where rownames are e.g. "A[C>A]T"
Description
The file needs to have the rownames in the first column.
Usage
ReadStapleGT96SBS(
file,
ref.genome = NULL,
region = "unknown",
catalog.type = "counts",
abundance = NULL,
sep = "\t"
)
Read in the data lines of an ID VCF created by Strelka version 1
Description
Read in the data lines of an ID VCF created by Strelka version 1
Usage
ReadStrelkaIDVCF(file, name.of.VCF = NULL)
Arguments
file |
The name/path of the VCF file, or a complete URL. |
name.of.VCF |
Name of the VCF file. If |
Value
A data frame storing data lines of the VCF file.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
[Deprecated, use ReadAndSplitVCFs(variant.caller = "strelka") instead] Read Strelka ID (small insertions and deletions) VCF files
Description
[Deprecated, use ReadAndSplitVCFs(variant.caller = "strelka") instead] Read Strelka ID (small insertions and deletions) VCF files
Usage
ReadStrelkaIDVCFs(files, names.of.VCFs = NULL)
Arguments
files |
Character vector of file paths to the VCF files. |
names.of.VCFs |
Character vector of names of the VCF files. The order of
names in |
Value
A list of data frames containing data lines of the VCF files.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
See Also
Examples
## Not run:
file <- c(system.file("extdata/Strelka-ID-vcf",
"Strelka.ID.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadStrelkaIDVCFs(file)
## End(Not run)
Read in the data lines of an SBS VCF created by Strelka version 1
Description
Read in the data lines of an SBS VCF created by Strelka version 1
Usage
ReadStrelkaSBSVCF(file, name.of.VCF = NULL)
Arguments
file |
The name/path of the VCF file, or a complete URL. |
name.of.VCF |
Name of the VCF file. If |
Value
A data frame storing data lines of a VCF file with two additional columns added which contain the VAF(variant allele frequency) and read depth information.
Read Strelka SBS (single base substitutions) VCF files.
Description
Read Strelka SBS (single base substitutions) VCF files.
Usage
ReadStrelkaSBSVCFs(files, names.of.VCFs = NULL)
Arguments
files |
Character vector of file paths to the VCF files. |
names.of.VCFs |
Character vector of names of the VCF files. The order of
names in |
Value
A list of data frames which store data lines of VCF files with two additional columns added which contain the VAF(variant allele frequency) and read depth information.
Read transcript ranges and strand information from a gff3 format file. Use this one for the new, cut down gff3 file (2018 11 24)
Description
Read transcript ranges and strand information from a gff3 format file. Use this one for the new, cut down gff3 file (2018 11 24)
Usage
ReadTranscriptRanges(file)
Arguments
file |
Path to the file with the transcript information with 1-based start end positions of genomic ranges. |
Value
A data.table keyed by chrom, start, and end.
Read in the data lines of a Variant Call Format (VCF) file
Description
Read in the data lines of a Variant Call Format (VCF) file
Usage
ReadVCF(
file,
variant.caller = "unknown",
name.of.VCF = NULL,
tumor.col.name = NA,
filter.status = DefaultFilterStatus(variant.caller),
get.vaf.function = NULL,
...
)
Arguments
file |
The name/path of the VCF file, or a complete URL. |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
name.of.VCF |
Name of the VCF file. If |
tumor.col.name |
Optional. Only applicable to Mutect VCF. Name
or index of the column in Mutect VCF which contains the tumor
sample information. It must have quotation marks if specifying the
column name. If |
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
Value
A data frame storing data lines of the VCF file with two additional columns added which contain the VAF(variant allele frequency) and read depth information.
Read VCF files
Description
Read VCF files
Usage
ReadVCFs(
files,
variant.caller = "unknown",
num.of.cores = 1,
names.of.VCFs = NULL,
tumor.col.names = NA,
filter.status = DefaultFilterStatus(variant.caller),
get.vaf.function = NULL,
...
)
Arguments
files |
Character vector of file paths to the VCF files. |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
Value
A list of data frames storing data lines of the VCF files with two additional columns added which contain the VAF(variant allele frequency) and read depth information.
Examples
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadVCFs(file, variant.caller = "mutect")
Remove ranges that fall on both strands
Description
Remove ranges that fall on both strands
Usage
RemoveRangesOnBothStrand(stranded.ranges)
Arguments
stranded.ranges |
A keyed data table which has stranded ranges information. It has four columns: chrom, start, end and strand. |
Value
A data table which has removed ranges that fall on both strands from
the input stranded.ranges
.
Is there any column in df
with name "end"?
If there is, change its name to "end_old" so that it will
conflict with code in other parts of ICAMS package.
Description
Is there any column in df
with name "end"?
If there is, change its name to "end_old" so that it will
conflict with code in other parts of ICAMS package.
Usage
RenameColumnsWithNameEnd(df)
Is there any column in df
with name "start"?
If there is, change its name to "start_old" so that it will
conflict with code in other parts of ICAMS package.
Description
Is there any column in df
with name "start"?
If there is, change its name to "start_old" so that it will
conflict with code in other parts of ICAMS package.
Usage
RenameColumnsWithNameStart(df)
Is there any column in df
with name "strand"?
If there is, change its name to "strand_old" so that it will
conflict with code in other parts of ICAMS package.
Description
Is there any column in df
with name "strand"?
If there is, change its name to "strand_old" so that it will
conflict with code in other parts of ICAMS package.
Usage
RenameColumnsWithNameStrand(df)
Is there any column in df
with name "VAF"?
If there is, change its name to "VAF_old" so that it will
conflict with code in other parts of ICAMS package.
Description
Is there any column in df
with name "VAF"?
If there is, change its name to "VAF_old" so that it will
conflict with code in other parts of ICAMS package.
Usage
RenameColumnsWithNameVAF(df)
Convert 1536-channel mutation-type identifiers like this "ACCGTA" -> "AC[C>A]GT"
Description
This is an internal function needed for generating "non-canonical" row number formats for catalogs.
Usage
Restaple1536(c1)
Arguments
c1 |
A vector of character strings with the first 5 characters
being the source trinucleotide and the last character being the
mutated (center) nucleotide. E.g. |
Convert 96-channel mutation-type identifiers like this "ACTA" -> "A[C>A]T"
Description
This is an internal function needed for generating "non-canonical" row number formats for catalogs.
Usage
Restaple96(c1)
Arguments
c1 |
A vector of character strings with the first 3 characters
being the source trinucleotide and the last character being the
mutated (center) nucleotide. E.g. |
Reverse complement strings that represent stranded DBSs
Description
Reverse complement strings that represent stranded DBSs
Usage
RevcDBS144(mutstring)
Arguments
mutstring |
A vector of 4-character strings representing stranded DBSs, for example "AATC" represents AA > TC mutations. |
Value
Return the vector of reverse complements of the first 2 characters concatenated with the reverse complement of the second 2 characters, e.g. "AATC" returns "TTGA".
Reverse complement strings that represent stranded SBSs
Description
Reverse complement strings that represent stranded SBSs
Usage
RevcSBS96(mutstring)
Arguments
mutstring |
A vector of 4-character strings representing stranded SBSs in trinucleotide context, for example "AATC" represents AAT > ACT mutations. |
Value
Return the vector of reverse complements of the first 3 characters concatenated with the reverse complement of the last character, e.g. "AATC" returns "ATTG".
Select variants according to chromosome names specified by user
Description
Select variants according to chromosome names specified by user
Usage
SelectVariantsByChromName(df, chr.names.to.process, name.of.VCF = NULL)
Arguments
df |
An in-memory data.frame representing a VCF. |
chr.names.to.process |
A character vector specifying the chromosome
names in |
name.of.VCF |
Name of the VCF file. |
Value
A list with the elements
-
df
: A data frame with variants only from chromosomes specified bychr.names.to.process
. -
discarded.variants
: Non-NULL only if there are variants that are from chromosomes not specified bychr.names.to.process
.
Read a VCF file into a data frame with minimal processing.
Description
Read a VCF file into a data frame with minimal processing.
Usage
SimpleReadVCF(file)
Arguments
file |
The name/path of the VCF file, or a complete URL. |
Details
Header lines beginning "##" are removed, and column "#CHROM" is renamed to "CHROM". Other column names are unchanged. Columns "#CHROM", "POS", "REF", and "ALT" must be in the input.
Value
A data frame storing mutation records of a VCF file.
Examples
file <- c(system.file("extdata/Strelka-SBS-vcf",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
df <- SimpleReadVCF(file)
Split each Mutect VCF into SBS, DBS, and ID VCFs (plus VCF-like data frame with left-over rows)
Description
Split each Mutect VCF into SBS, DBS, and ID VCFs (plus VCF-like data frame with left-over rows)
Usage
SplitListOfMutectVCFs(
list.of.vcfs,
suppress.discarded.variants.warnings = TRUE
)
Arguments
list.of.vcfs |
List of VCFs as in-memory data.frames. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Value
A list containing the following objects:
-
SBS
: List of VCFs with only single base substitutions. -
DBS
: List of VCFs with only doublet base substitutions as called by Mutect. -
ID
: List of VCFs with only small insertions and deletions. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
Split a list of in-memory Strelka SBS VCF into SBS, DBS, and variants involving > 2 consecutive bases
Description
SBSs are single base substitutions, e.g. C>T, A<G,.... DBSs are double base substitutions, e.g. CC>TT, AT>GG, ... Variants involving > 2 consecutive bases are rare, so this function just records them. These would be variants such ATG>CCT, AGAT>TCTA, ...
Usage
SplitListOfStrelkaSBSVCFs(
list.of.vcfs,
suppress.discarded.variants.warnings = TRUE
)
Arguments
list.of.vcfs |
A list of in-memory data frames containing Strelka SBS VCF file contents. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Value
A list of elements as follows:
-
SBS.vcfs
: List of data.frames of pure SBS mutations – no DBS or 3+BS mutations. -
DBS.vcfs
: List of data.frames of pure DBS mutations – no SBS or 3+BS mutations. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
Split each VCF into SBS, DBS, and ID VCFs (plus VCF-like data frame with left-over rows)
Description
Split each VCF into SBS, DBS, and ID VCFs (plus VCF-like data frame with left-over rows)
Usage
SplitListOfVCFs(
list.of.vcfs,
variant.caller,
max.vaf.diff = 0.02,
num.of.cores = 1,
suppress.discarded.variants.warnings = TRUE,
always.merge.SBS = FALSE,
chr.names.to.process = NULL
)
Arguments
list.of.vcfs |
List of VCFs as in-memory data frames. The VCFs should
have |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
max.vaf.diff |
The maximum difference of VAF, default value is 0.02. If
the absolute difference of VAFs for adjacent SBSs is bigger than
|
num.of.cores |
The number of cores to use. Not available on Windows
unless |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
always.merge.SBS |
If |
chr.names.to.process |
A character vector specifying the chromosome
names in VCF whose variants will be kept and processed, other chromosome
variants will be discarded. If |
Value
A list containing the following objects:
-
SBS
: List of VCFs with only single base substitutions. -
DBS
: List of VCFs with only doublet base substitutions as called by Mutect. -
ID
: List of VCFs with only small insertions and deletions. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
Examples
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.vcfs <- ReadVCFs(file, variant.caller = "mutect")
split.vcfs <- SplitListOfVCFs(list.of.vcfs, variant.caller = "mutect")
Split a mutect2 VCF into SBS, DBS, and ID VCFs, plus a list of other mutations
Description
Split a mutect2 VCF into SBS, DBS, and ID VCFs, plus a list of other mutations
Usage
SplitOneMutectVCF(vcf.df, name.of.VCF = NULL, chr.names.to.process = NULL)
Arguments
vcf.df |
An in-memory data.frame representing a Mutect VCF, including
VAFs, which are added by |
name.of.VCF |
Name of the VCF file. |
chr.names.to.process |
A character vector specifying the chromosome
names in VCF whose variants will be kept and processed, other chromosome
variants will be discarded. If |
Value
A list with 3 in-memory VCFs and discarded variants that were not incorporated into the first 3 VCFs:
* SBS
: VCF with only single base substitutions.
* DBS
: VCF with only doublet base substitutions
as called by Mutect.
* ID
: VCF with only small insertions and deletions.
* discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
@md
Split a VCF into SBS, DBS, and ID VCFs, plus a list of other mutations
Description
Split a VCF into SBS, DBS, and ID VCFs, plus a list of other mutations
Usage
SplitOneVCF(
vcf.df,
max.vaf.diff = 0.02,
name.of.VCF = NULL,
always.merge.SBS = FALSE,
chr.names.to.process = NULL
)
Arguments
vcf.df |
An in-memory data.frame representing a VCF, including
VAFs, which are added by |
max.vaf.diff |
The maximum difference of VAF, default value is 0.02. If
the absolute difference of VAFs for adjacent SBSs is bigger than
|
name.of.VCF |
Name of the VCF file. |
always.merge.SBS |
If |
chr.names.to.process |
A character vector specifying the chromosome
names in VCF whose variants will be kept and processed, other chromosome
variants will be discarded. If |
Value
A list with 3 in-memory VCFs and discarded variants that were not incorporated into the first 3 VCFs:
* SBS
: VCF with only single base substitutions.
* DBS
: VCF with only doublet base substitutions.
* ID
: VCF with only small insertions and deletions.
* discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
@md
Split an in-memory SBS VCF into pure SBSs, pure DBSs, and variants involving > 2 consecutive bases
Description
SBSs are single base substitutions, e.g. C>T, A>G,.... DBSs are double base substitutions, e.g. CC>TT, AT>GG, ... Variants involving > 2 consecutive bases are rare, so this function just records them. These would be variants such ATG>CCT, AGAT>TCTA, ...
Usage
SplitSBSVCF(vcf.df, max.vaf.diff = 0.02, name.of.VCF = NULL, always.merge.SBS)
Arguments
vcf.df |
An in-memory data frame containing an SBS VCF file contents. |
max.vaf.diff |
The maximum difference of VAF, default value is 0.02. If
the absolute difference of VAFs for adjacent SBSs is bigger than
|
name.of.VCF |
Name of the VCF file. |
always.merge.SBS |
If |
Value
A list of in-memory objects with the elements:
-
SBS.vcf
: Data frame of pure SBS mutations – no DBS or 3+BS mutations. -
DBS.vcf
: Data frame of pure DBS mutations – no SBS or 3+BS mutations. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
Split an in-memory Strelka VCF into SBS, DBS, and variants involving > 2 consecutive bases
Description
SBSs are single base substitutions, e.g. C>T, A>G,.... DBSs are double base substitutions, e.g. CC>TT, AT>GG, ... Variants involving > 2 consecutive bases are rare, so this function just records them. These would be variants such ATG>CCT, AGAT>TCTA, ...
Usage
SplitStrelkaSBSVCF(
vcf.df,
max.vaf.diff = 0.02,
name.of.VCF = NULL,
always.merge.SBS = FALSE
)
Arguments
vcf.df |
An in-memory data frame containing a Strelka VCF file contents. |
max.vaf.diff |
The maximum difference of VAF, default value is 0.02. If
the absolute difference of VAFs for adjacent SBSs is bigger than
|
name.of.VCF |
Name of the VCF file. |
always.merge.SBS |
If |
Value
A list of in-memory objects with the elements:
-
SBS.vcf
: Data frame of pure SBS mutations – no DBS or 3+BS mutations. -
DBS.vcf
: Data frame of pure DBS mutations – no SBS or 3+BS mutations. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details.
Standardize the chromosome name annotations for a data frame.
Description
Standardize the chromosome name annotations for a data frame.
Usage
StandardChromName(df)
Arguments
df |
A data frame whose first column contains the Chromosome name |
Value
A data frame whose Chromosome names are only in the form of 1:22, "X" and "Y".
Standardize the chromosome name annotations for a data frame.
Description
Standardize the chromosome name annotations for a data frame.
Usage
StandardChromNameNew(df, name.of.VCF = NULL)
Arguments
df |
An in-memory data.frame representing a VCF. |
name.of.VCF |
Name of the VCF file. |
Value
A list with the elements
-
df
a data frame with variants that had "legal" chromosome names (see below for illegal chromosome names). -
discarded.variants
: Non-NULL only if there are variants with illegal chromosome names; these are names that contain the strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt".
Stop if catalog.type
is illegal.
Description
Stop if catalog.type
is illegal.
Usage
StopIfCatalogTypeIllegal(catalog.type)
Arguments
catalog.type |
Character string to check. |
Stop if the number of rows in object
is illegal
Description
Stop if the number of rows in object
is illegal
Usage
StopIfNrowIllegal(object)
Arguments
object |
A |
Stop if region
is illegal.
Description
Stop if region
is illegal.
Usage
StopIfRegionIllegal(region)
Arguments
region |
Character string to check. |
Stop if region
is illegal for an in-transcript catalogs
Description
Stop if region
is illegal for an in-transcript catalogs
Usage
StopIfTranscribedRegionIllegal(region)
Arguments
region |
The region to test (a character string) |
[Deprecated, use VCFsToCatalogs(variant.caller = "strelka") instead] Create ID (small insertions and deletions) catalog from Strelka ID VCF files
Description
[Deprecated, use VCFsToCatalogs(variant.caller = "strelka") instead]
Create ID (small insertions and deletions) catalog from the Strelka ID VCFs
specified by files
Usage
StrelkaIDVCFFilesToCatalog(
files,
ref.genome,
region = "unknown",
names.of.VCFs = NULL,
flag.mismatches = 0,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
files |
Character vector of file paths to the Strelka ID VCF files. |
ref.genome |
A |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls VCFsToIDCatalogs
Value
A list of elements:
-
catalog
: The ID (small insertions and deletions) catalog with attributes added. Seeas.catalog
for more details. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of data frames which contain the original VCF's ID mutation rows with three additional columnsseq.context.width
,seq.context
andID.class
added. The category assignment of each ID mutation in VCF can be obtained fromID.class
column.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Examples
## Not run:
file <- c(system.file("extdata/Strelka-ID-vcf",
"Strelka.ID.GRCh37.s1.vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catID <- StrelkaIDVCFFilesToCatalog(file, ref.genome = "hg19",
region = "genome")}
## End(Not run)
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "strelka") instead] Create ID (small insertions and deletions) catalog from Strelka ID VCF files and plot them to PDF
Description
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "strelka") instead]
Create ID (small insertions and deletions) catalog from the Strelka ID VCFs
specified by files
and plot them to PDF
Usage
StrelkaIDVCFFilesToCatalogAndPlotToPdf(
files,
ref.genome,
region = "unknown",
names.of.VCFs = NULL,
output.file = "",
flag.mismatches = 0,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
files |
Character vector of file paths to the Strelka ID VCF files. |
ref.genome |
A |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
output.file |
Optional. The base name of the PDF file to be produced;
the file is ending in |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls StrelkaIDVCFFilesToCatalog
and
PlotCatalogToPdf
Value
A list of elements:
-
catalog
: The ID (small insertions and deletions) catalog with attributes added. Seeas.catalog
for more details. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of data frames which contain the original VCF's ID mutation rows with three additional columnsseq.context.width
,seq.context
andID.class
added. The category assignment of each ID mutation in VCF can be obtained fromID.class
column.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Examples
## Not run:
file <- c(system.file("extdata/Strelka-ID-vcf",
"Strelka.ID.GRCh37.s1.vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catID <-
StrelkaIDVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19",
region = "genome",
output.file =
file.path(tempdir(), "StrelkaID"))}
## End(Not run)
[Deprecated, use VCFsToZipFile(variant.caller = "strelka") instead] Create a zip file which contains ID (small insertions and deletions) catalog and plot PDF from Strelka ID VCF files
Description
[Deprecated, use VCFsToZipFile(variant.caller = "strelka") instead]
Create ID (small insertions and deletions) catalog from the Strelka ID VCFs
specified by dir
, save the catalog as CSV file, plot it to PDF and
generate a zip archive of all the output files.
Usage
StrelkaIDVCFFilesToZipFile(
dir,
zipfile,
ref.genome,
region = "unknown",
names.of.VCFs = NULL,
base.filename = "",
flag.mismatches = 0,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
dir |
Pathname of the directory which contains only the Strelka
ID VCF files. Each Strelka ID VCF must have a file extension
".vcf" (case insensitive) and share the same |
zipfile |
Pathname of the zip file to be created. |
ref.genome |
A |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
base.filename |
Optional. The base name of the CSV and PDF file to be
produced; the file is ending in |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls StrelkaIDVCFFilesToCatalog
,
PlotCatalogToPdf
, WriteCatalog
and
zip::zipr
.
Value
A list of elements:
-
catalog
: The ID (small insertions and deletions) catalog with attributes added. Seeas.catalog
for more details. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of data frames which contain the original VCF's ID mutation rows with three additional columnsseq.context.width
,seq.context
andID.class
added. The category assignment of each ID mutation in VCF can be obtained fromID.class
column.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Examples
## Not run:
dir <- c(system.file("extdata/Strelka-ID-vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <-
StrelkaIDVCFFilesToZipFile(dir,
zipfile = file.path(tempdir(), "test.zip"),
ref.genome = "hg19",
region = "genome",
base.filename = "Strelka-ID")
unlink(file.path(tempdir(), "test.zip"))}
## End(Not run)
[Deprecated, use VCFsToCatalogs(variant.caller = "strelka") instead] Create SBS and DBS catalogs from Strelka SBS VCF files
Description
[Deprecated, use VCFsToCatalogs(variant.caller = "strelka") instead]
Create 3 SBS catalogs (96, 192, 1536) and 3 DBS catalogs (78, 136, 144) from
the Strelka SBS VCFs specified by files
. The function will find and
merge adjacent SBS pairs into DBS if their VAFs are very similar. The default
threshold value for VAF is 0.02.
Usage
StrelkaSBSVCFFilesToCatalog(
files,
ref.genome,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
files |
Character vector of file paths to the Strelka SBS VCF files. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls VCFsToSBSCatalogs
and
VCFsToDBSCatalogs
.
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Examples
## Not run:
file <- c(system.file("extdata/Strelka-SBS-vcf",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <- StrelkaSBSVCFFilesToCatalog(file, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37,
region = "genome")}
## End(Not run)
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "strelka") instead] Create SBS and DBS catalogs from Strelka SBS VCF files and plot them to PDF
Description
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "strelka") instead]
Create 3 SBS catalogs (96, 192, 1536) and 3 DBS catalogs (78, 136, 144) from
the Strelka SBS VCFs specified by files
and plot them to PDF. The
function will find and merge adjacent SBS pairs into DBS if their VAFs are
very similar. The default threshold value for VAF is 0.02.
Usage
StrelkaSBSVCFFilesToCatalogAndPlotToPdf(
files,
ref.genome,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
output.file = "",
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
files |
Character vector of file paths to the Strelka SBS VCF files. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
output.file |
Optional. The base name of the PDF files to be produced;
multiple files will be generated, each ending in |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls StrelkaSBSVCFFilesToCatalog
and
PlotCatalogToPdf
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Examples
## Not run:
file <- c(system.file("extdata/Strelka-SBS-vcf",
"Strelka.SBS.GRCh37.s1.vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <-
StrelkaSBSVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37,
region = "genome",
output.file =
file.path(tempdir(), "StrelkaSBS"))}
## End(Not run)
[Deprecated, use VCFsToZipFile(variant.caller = "strelka") instead] Create a zip file which contains catalogs and plot PDFs from Strelka SBS VCF files
Description
[Deprecated, use VCFsToZipFile(variant.caller = "strelka") instead]
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) from the
Strelka SBS VCFs specified by dir
, save the catalogs as CSV files,
plot them to PDF and generate a zip archive of all the output files. The
function will find and merge adjacent SBS pairs into DBS if their VAFs are
very similar. The default threshold value for VAF is 0.02.
Usage
StrelkaSBSVCFFilesToZipFile(
dir,
zipfile,
ref.genome,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
base.filename = "",
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
dir |
Pathname of the directory which contains only the Strelka
SBS VCF files. Each Strelka SBS VCF must have a file extension
".vcf" (case insensitive) and share the same |
zipfile |
Pathname of the zip file to be created. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
base.filename |
Optional. The base name of the CSV and PDF files to be
produced; multiple files will be generated, each ending in
|
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Details
This function calls StrelkaSBSVCFFilesToCatalog
,
PlotCatalogToPdf
, WriteCatalog
and
zip::zipr
.
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Examples
## Not run:
dir <- c(system.file("extdata/Strelka-SBS-vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <-
StrelkaSBSVCFFilesToZipFile(dir,
zipfile = file.path(tempdir(), "test.zip"),
ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37,
region = "genome",
base.filename = "Strelka-SBS")
unlink(file.path(tempdir(), "test.zip"))}
## End(Not run)
Get all the sequence contexts of the indels in a given 1 base-pair indel class from a VCF
Description
Get all the sequence contexts of the indels in a given 1 base-pair indel class from a VCF
Usage
SymmetricalContextsFor1BPIndel(annotated.vcf, indel.class, flank.length = 5)
Arguments
annotated.vcf |
An in-memory |
indel.class |
A single character string that denotes a 1 base pair
insertion or deletion, as taken from |
flank.length |
The length of flanking bases around the position or homopolymer targeted by the indel. |
Value
A list of all sequence contexts for the specified indel.class
.
Source catalog type is counts or counts.signature
Description
counts.signature -> density.signature, counts.signature counts -> anything
Usage
TCFromCouSigCou(s, t)
density -> <anything> density.signature -> density.signature, counts.signature
Description
density -> <anything> density.signature -> density.signature, counts.signature
Usage
TCFromDenSigDen(s, t)
This function makes catalogs from the sample Mutect VCF file and compares it with the expected catalog information.
Description
This function makes catalogs from the sample Mutect VCF file and compares it with the expected catalog information.
Usage
TestMakeCatalogFromMutectVCFs()
This function is to make catalogs from the sample Strelka ID VCF files to compare with the expected catalog information.
Description
This function is to make catalogs from the sample Strelka ID VCF files to compare with the expected catalog information.
Usage
TestMakeCatalogFromStrelkaIDVCFs()
This function is to make catalogs from the sample Strelka SBS VCF files to compare with the expected catalog information.
Description
This function is to make catalogs from the sample Strelka SBS VCF files to compare with the expected catalog information.
Usage
TestMakeCatalogFromStrelkaSBSVCFs()
Plot the a SignatureAnalyzer COMPOSITE signature or catalog into separate pdfs
Description
Plot the a SignatureAnalyzer COMPOSITE signature or catalog into separate pdfs
Usage
TestPlotCatCOMPOSITE(catalog, filename.header, type, id = colnames(catalog))
Arguments
catalog |
Catalog or signature matrix |
filename.header |
Contain path and the beginning part of the file name.
The name of the pdf files will be:
filename.header |
type |
See |
id |
A vector containing the identifiers of the samples
or signatures in |
For indels, convert ICAMS/PCAWG7 rownames into SigProfiler rownames
Description
For indels, convert ICAMS/PCAWG7 rownames into SigProfiler rownames
Usage
TransRownames.ID.PCAWG.SigPro(vector.of.rownames)
Examples
ICAMS:::TransRownames.ID.PCAWG.SigPro("DEL:C:1:0") # 1:Del:C:0;
ICAMS:::TransRownames.ID.PCAWG.SigPro("INS:repeat:2:5+") # 2:Ins:R:5
For indels, convert SigProfiler rownames into ICAMS/PCAWG7 rownames
Description
For indels, convert SigProfiler rownames into ICAMS/PCAWG7 rownames
Usage
TransRownames.ID.SigPro.PCAWG(vector.of.rownames)
Examples
ICAMS:::TransRownames.ID.SigPro.PCAWG("1:Del:C:0") # DEL:C:1:0;
ICAMS:::TransRownames.ID.SigPro.PCAWG("2:Ins:R:5") # INS:repeat:2:5+
Transcript ranges data
Description
Transcript ranges and strand information for a particular reference genome.
Usage
trans.ranges.GRCh37
trans.ranges.GRCh38
trans.ranges.GRCm38
Format
A data.table
which contains transcript
range and strand information for a particular reference genome.
colname
s are chrom
, start
, end
, strand
,
Ensembl.gene.ID
, gene.symbol
. It uses one-based coordinates.
An object of class data.table
(inherits from data.frame
) with 19083 rows and 6 columns.
An object of class data.table
(inherits from data.frame
) with 19096 rows and 6 columns.
An object of class data.table
(inherits from data.frame
) with 20325 rows and 6 columns.
Details
This information is needed to generate catalogs that
depend on transcriptional
strand information, for example catalogs of
class SBS192Catalog
.
trans.ranges.GRCh37
: Human GRCh37.
trans.ranges.GRCh38
: Human GRCh38.
trans.ranges.GRCm38
: Mouse GRCm38.
For these two tables, only genes that are associated with a CCDS ID are kept for transcriptional strand bias analysis.
This information is needed for StrelkaSBSVCFFilesToCatalog
,
StrelkaSBSVCFFilesToCatalogAndPlotToPdf
,
MutectVCFFilesToCatalog
,
MutectVCFFilesToCatalogAndPlotToPdf
,
VCFsToSBSCatalogs
and VCFsToDBSCatalogs
.
Source
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gff3.gz
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M21/gencode.vM21.annotation.gff3.gz
Examples
trans.ranges.GRCh37
# chrom start end strand Ensembl.gene.ID gene.symbol
# 1 65419 71585 + ENSG00000186092 OR4F5
# 1 367640 368634 + ENSG00000235249 OR4F29
# 1 621059 622053 - ENSG00000284662 OR4F16
# 1 859308 879961 + ENSG00000187634 SAMD11
# 1 879583 894689 - ENSG00000188976 NOC2L
# ... ... ... ... ... ...
Transform between counts and density spectrum catalogs and counts and density signature catalogs
Description
Transform between counts and density spectrum catalogs and counts and density signature catalogs
Usage
TransformCatalog(
catalog,
target.ref.genome = NULL,
target.region = NULL,
target.catalog.type = NULL,
target.abundance = NULL
)
Arguments
catalog |
An SBS or DBS catalog as described in |
target.ref.genome |
A |
target.region |
A |
target.catalog.type |
A character string acting as a catalog type
identifier, one of "counts", "density", "counts.signature",
"density.signature"; see |
target.abundance |
A vector of counts, one for each source K-mer for mutations (e.g. for
strand-agnostic single nucleotide substitutions in trinucleotide – i.e.
3-mer – context, one count each for ACA, ACC, ACG, ... TTT). See
|
Details
Only the following transformations are legal:
-
counts -> counts
(deprecated, generates a warning; we strongly suggest that you work with densities if comparing spectra or signatures generated from data with different underlying abundances.) -
counts -> density
-
counts -> (counts.signature, density.signature)
-
density -> counts
(the semantics are to infer the genome-wide or exome-wide counts based on the densities) -
density -> density
(a null operation, generates a warning) -
density -> (counts.signature, density.signature)
-
counts.signature -> counts.signature
(used to transform between the source abundance andtarget.abundance
) -
counts.signature -> density.signature
-
counts.signature -> (counts, density)
(generates an error) -
density.signature -> density.signature
(a null operation, generates a warning) -
density.signature -> counts.signature
-
density.signature -> (counts, density)
(generates an error)
Value
A catalog as defined in ICAMS
.
Rationale
The TransformCatalog
function transforms catalogs of mutational spectra or
signatures to account for differing abundances of the source
sequence of the mutations in the genome.
For example, mutations from ACG are much rarer in the human genome than mutations from ACC simply because CG dinucleotides are rare in the genome. Consequently, there are two possible representations of mutational spectra or signatures. One representation is based on mutation counts as observed in a given genome or exome, and this approach is widely used, as, for example, at https://cancer.sanger.ac.uk/cosmic/signatures, which presents signatures based on observed mutation counts in the human genome. We call these "counts-based spectra" or "counts-based signatures".
Alternatively, mutational spectra or signatures can be represented as mutations per source sequence, for example the number of ACT > AGT mutations occurring at all ACT 3-mers in a genome. We call these "density-based spectra" or "density-based signatures".
This function can also transform spectra based on observed genome-wide counts to "density"-based catalogs. In density-based catalogs mutations are expressed as mutations per source sequences. For example, a density-based catalog represents the proportion of ACCs mutated to ATCs, the proportion of ACGs mutated to ATGs, etc. This is different from counts-based mutational spectra catalogs, which contain the number of ACC > ATC mutations, the number of ACG > ATG mutations, etc.
This function can also transform observed-count based spectra or signatures from genome to exome based counts, or between different species (since the abundances of source sequences vary between genome and exome and between species).
Examples
file <- system.file("extdata",
"strelka.regress.cat.sbs.96.csv",
package = "ICAMS")
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catSBS96.counts <- ReadCatalog(file, ref.genome = "hg19",
region = "genome",
catalog.type = "counts")
catSBS96.density <- TransformCatalog(catSBS96.counts,
target.ref.genome = "hg19",
target.region = "genome",
target.catalog.type = "density")}
Convert SBS1536-channel mutations-type identifiers like this "AC[C>A]GT" -> "ACCGTA"
Description
Convert SBS1536-channel mutations-type identifiers like this "AC[C>A]GT" -> "ACCGTA"
Usage
Unstaple1536(c1)
Arguments
c1 |
A vector of character strings with the mutation indicated by
e.g. |
Convert DBS78-channel mutations-type identifiers like this "AC>GA" -> "ACGA"
Description
Convert DBS78-channel mutations-type identifiers like this "AC>GA" -> "ACGA"
Usage
Unstaple78(c1)
Arguments
c1 |
A vector of character strings with a |
Convert SBS96-channel mutations-type identifiers like this "A[C>A]T" -> "ACTA"
Description
Convert SBS96-channel mutations-type identifiers like this "A[C>A]T" -> "ACTA"
Usage
Unstaple96(c1)
Arguments
c1 |
A vector of character strings with the mutation indicated by
e.g. |
Create SBS, DBS and Indel catalogs from VCFs
Description
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the Mutect VCFs specified by files
Usage
VCFsToCatalogs(
files,
ref.genome,
variant.caller = "unknown",
num.of.cores = 1,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
tumor.col.names = NA,
filter.status = DefaultFilterStatus(variant.caller),
get.vaf.function = NULL,
...,
max.vaf.diff = 0.02,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE,
chr.names.to.process = NULL
)
Arguments
files |
Character vector of file paths to the VCF files. |
ref.genome |
A |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
max.vaf.diff |
Not applicable if |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
chr.names.to.process |
A character vector specifying the chromosome names in VCF whose variants will be kept and processed, other chromosome variants will be discarded. If NULL(default), all variants will be kept except those on chromosomes with names that contain strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt". |
Details
This function calls VCFsToSBSCatalogs
,
VCFsToDBSCatalogs
and VCFsToIDCatalogs
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
catID
: Matrix of ID (small insertions and deletions) catalog. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant. -
ID
: ID VCF annotated byAnnotateIDVCF
with one new columnID.class
showing the mutation class for each ID variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Examples
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <- VCFsToCatalogs(file, ref.genome = "hg19",
variant.caller = "mutect", region = "genome")}
Create SBS, DBS and Indel catalogs from VCFs and plot them to PDF
Description
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the VCFs specified by files
and plot them to
PDF
Usage
VCFsToCatalogsAndPlotToPdf(
files,
output.dir,
ref.genome,
variant.caller = "unknown",
num.of.cores = 1,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
tumor.col.names = NA,
filter.status = DefaultFilterStatus(variant.caller),
get.vaf.function = NULL,
...,
max.vaf.diff = 0.02,
base.filename = "",
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE,
chr.names.to.process = NULL
)
Arguments
files |
Character vector of file paths to the VCF files. |
output.dir |
The directory where the PDF files will be saved. |
ref.genome |
A |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
max.vaf.diff |
Not applicable if |
base.filename |
Optional. The base name of the PDF files to be produced;
multiple files will be generated, each ending in |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
chr.names.to.process |
A character vector specifying the chromosome names in VCF whose variants will be kept and processed, other chromosome variants will be discarded. If NULL(default), all variants will be kept except those on chromosomes with names that contain strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt". |
Details
This function calls VCFsToCatalogs
and
PlotCatalogToPdf
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
catID
: Matrix of ID (small insertions and deletions) catalog. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant. -
ID
: ID VCF annotated byAnnotateIDVCF
with one new columnID.class
showing the mutation class for each ID variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Examples
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <-
VCFsToCatalogsAndPlotToPdf(file, ref.genome = "hg19",
output.dir = tempdir(),
variant.caller = "mutect",
region = "genome",
base.filename = "Mutect")}
Create DBS catalogs from VCFs
Description
Create a list of 3 catalogs (one each for DBS78, DBS144 and DBS136) out of the contents in list.of.DBS.vcfs. The VCFs must not contain any type of mutation other then DBSs.
Usage
VCFsToDBSCatalogs(
list.of.DBS.vcfs,
ref.genome,
num.of.cores = 1,
trans.ranges = NULL,
region = "unknown",
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
list.of.DBS.vcfs |
List of in-memory data frames of pure DBS mutations – no SBS or 3+BS mutations. The list names will be the sample ids in the output catalog. |
ref.genome |
A |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Value
A list containing the following objects:
-
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Note
DBS 144 catalog only contains mutations in transcribed regions.
Examples
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.DBS.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")$DBS
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs.DBS <- VCFsToDBSCatalogs(list.of.DBS.vcfs, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37,
region = "genome")}
Create ID (small insertions and deletions) catalog from ID VCFs
Description
Create ID (small insertions and deletions) catalog from ID VCFs
Usage
VCFsToIDCatalogs(
list.of.vcfs,
ref.genome,
num.of.cores = 1,
trans.ranges = NULL,
region = "unknown",
flag.mismatches = 0,
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
list.of.vcfs |
List of in-memory ID VCFs. The list names will be the sample ids in the output catalog. |
ref.genome |
A |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string acting as a region identifier, one of "genome", "exome". |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Value
A list of elements:
-
catalog
: The ID (small insertions and deletions) catalog with attributes added. Seeas.catalog
for details. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of data frames which contain the original VCF's ID mutation rows with three additional columnsseq.context.width
,seq.context
andID.class
added. The category assignment of each ID mutation in VCF can be obtained fromID.class
column.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Examples
file <- c(system.file("extdata/Strelka-ID-vcf/",
"Strelka.ID.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.ID.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka")$ID
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5",
quietly = TRUE)) {
catID <- VCFsToIDCatalogs(list.of.ID.vcfs, ref.genome = "hg19",
region = "genome")}
Create SBS catalogs from SBS VCFs
Description
Create a list of 3 catalogs (one each for 96, 192, 1536) out of the contents in list.of.SBS.vcfs. The SBS VCFs must not contain DBSs, indels, or other types of mutations.
Usage
VCFsToSBSCatalogs(
list.of.SBS.vcfs,
ref.genome,
num.of.cores = 1,
trans.ranges = NULL,
region = "unknown",
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Arguments
list.of.SBS.vcfs |
List of in-memory data frames of pure SBS mutations – no DBS or 3+BS mutations. The list names will be the sample ids in the output catalog. |
ref.genome |
A |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 catalog will not be generated. Each catalog has attributes
added. See as.catalog
for more details.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Note
SBS 192 catalogs only contain mutations in transcribed regions.
Examples
file <- c(system.file("extdata/Mutect-vcf",
"Mutect.GRCh37.s1.vcf",
package = "ICAMS"))
list.of.SBS.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")$SBS
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs.SBS <- VCFsToSBSCatalogs(list.of.SBS.vcfs, ref.genome = "hg19",
trans.ranges = trans.ranges.GRCh37,
region = "genome")}
Create a zip file which contains catalogs and plot PDFs from VCFs
Description
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the VCFs specified by dir
, save the catalogs
as CSV files, plot them to PDF and generate a zip archive of all the output files.
Usage
VCFsToZipFile(
dir,
files,
zipfile,
ref.genome,
variant.caller = "unknown",
num.of.cores = 1,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
tumor.col.names = NA,
filter.status = DefaultFilterStatus(variant.caller),
get.vaf.function = NULL,
...,
max.vaf.diff = 0.02,
base.filename = "",
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE,
chr.names.to.process = NULL
)
Arguments
dir |
Pathname of the directory which contains VCFs that come from the
same variant caller. Each VCF must have a file extension
".vcf" (case insensitive) and share the same |
files |
Character vector of file paths to the VCF files. Only
one of argument |
zipfile |
Pathname of the zip file to be created. |
ref.genome |
A |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
max.vaf.diff |
Not applicable if |
base.filename |
Optional. The base name of the CSV and PDF files to be
produced; multiple files will be generated, each ending in
|
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
chr.names.to.process |
A character vector specifying the chromosome names in VCF whose variants will be kept and processed, other chromosome variants will be discarded. If NULL(default), all variants will be kept except those on chromosomes with names that contain strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt". |
Details
This function calls VCFsToCatalogs
,
PlotCatalogToPdf
, WriteCatalog
and
zip::zipr
.
Value
A list containing the following objects:
-
catSBS96
,catSBS192
,catSBS1536
: Matrix of 3 SBS catalogs (one each for 96, 192, and 1536). -
catDBS78
,catDBS136
,catDBS144
: Matrix of 3 DBS catalogs (one each for 78, 136, and 144). -
catID
: Matrix of ID (small insertions and deletions) catalog. -
discarded.variants
: Non-NULL only if there are variants that were excluded from the analysis. See the added extra columndiscarded.reason
for more details. -
annotated.vcfs
: Non-NULL only ifreturn.annotated.vcfs
= TRUE. A list of elements:-
SBS
: SBS VCF annotated byAnnotateSBSVCF
with three new columnsSBS96.class
,SBS192.class
andSBS1536.class
showing the mutation class for each SBS variant. -
DBS
: DBS VCF annotated byAnnotateDBSVCF
with three new columnsDBS78.class
,DBS136.class
andDBS144.class
showing the mutation class for each DBS variant. -
ID
: ID VCF annotated byAnnotateIDVCF
with one new columnID.class
showing the mutation class for each ID variant.
-
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
ID classification
See https://github.com/steverozen/ICAMS/blob/v3.0.9-branch/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
Note
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Comments
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
Examples
dir <- c(system.file("extdata/Mutect-vcf",
package = "ICAMS"))
if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) {
catalogs <-
VCFsToZipFile(dir,
zipfile = file.path(tempdir(), "test.zip"),
ref.genome = "hg19",
variant.caller = "mutect",
region = "genome",
base.filename = "Mutect")
unlink(file.path(tempdir(), "test.zip"))}
Analogous to VCFsToZipFile
, also generates density CSV and PDF files in the zip
archive.
Description
Analogous to VCFsToZipFile
, also generates density CSV and PDF files in the zip
archive.
Usage
VCFsToZipFileXtra(
dir,
zipfile,
ref.genome,
variant.caller = "unknown",
num.of.cores = 1,
trans.ranges = NULL,
region = "unknown",
names.of.VCFs = NULL,
tumor.col.names = NA,
filter.status = DefaultFilterStatus(variant.caller),
get.vaf.function = NULL,
...,
max.vaf.diff = 0.02,
base.filename = "",
return.annotated.vcfs = FALSE,
suppress.discarded.variants.warnings = TRUE
)
Write a catalog to a file.
Description
This internal function is called by exported functions to do the actual writing of the catalog.
Usage
WriteCat(catalog, file, num.row, row.order, row.header, strict, sep = ",")
Arguments
catalog |
A catalog as defined in |
file |
The path of the file to be written. |
num.row |
The number of rows in the file to be written. |
row.order |
The row order to be used for writing the file. |
row.header |
The row header to be used for writing the file. |
strict |
If TRUE, then stop if additional checks on the input fail. |
Write a catalog
Description
Write a catalog to a file.
Usage
WriteCatalog(catalog, file, strict = TRUE)
Arguments
catalog |
A catalog as defined in |
file |
The path to the file to be created. |
strict |
If TRUE, do additional checks on the input, and stop if the checks fail. |
Details
See also ReadCatalog
.
Note
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
Examples
file <- system.file("extdata",
"strelka.regress.cat.sbs.96.csv",
package = "ICAMS")
catSBS96 <- ReadCatalog(file)
WriteCatalog(catSBS96, file = file.path(tempdir(), "catSBS96.csv"))
Write Indel Catalogs in SigProExtractor format
Description
Write Indel Catalogs in SigProExtractor format to a file.
Usage
WriteCatalogIndelSigPro(catalog, file, strict = TRUE, sep = "\t")
Arguments
catalog |
A catalog as defined in |
file |
The path to the file to be created. |
strict |
If TRUE, do additional checks on the input, and stop if the checks fail. |
sep |
Separator to use in the output file. In older version SigProfiler read comma-separated files; as of May 2020 it reads tab-separated files. |
Note
In ID (small insertions and deletions) catalogs in SigProExtractor format, deletion repeat sizes range from 0 to 5, rather than 0 to 5+.
K-mer abundances
Description
An R list with one element each for
BSgenome.Hsapiens.1000genomes.hs37d5
,
BSgenome.Hsapiens.UCSC.hg38
and BSgenome.Mmusculus.UCSC.mm10
.
Each element is in turn a sub-list keyed by
exome
, transcript
,
and genome
. Each element of the sub list
is keyed by the number of rows in the catalog class (as a string, e.g.
"78"
, not 78
). The keys are:
78 (DBS78Catalog
), 96 (SBS96Catalog
), 136 (DBS136Catalog
),
144 (DBS144Catalog
), 192 (SBS192Catalog
),
and 1536 (SBS1536Catalog
). So, for example to get the exome
abundances for SBS96 catalogs for BSgenome.Hsapiens.UCSC.hg38
exomes
one would reference
all.abundance[["BSgenome.Hsapiens.UCSC.hg38"]][["exome"]][["96"]]
or all.abundance$BSgenome.Hsapiens.UCSC.hg38$exome$"96"
.
The value of the abundance is an integer vector with the K-mers
as names and each value being the count of that K-mer.
Usage
all.abundance
Format
See Description.
Examples
all.abundance$BSgenome.Hsapiens.UCSC.hg38$transcript$`144`
# AA AC AG AT CA CC ...
# 90769160 57156295 85738416 87552737 83479655 63267896 ...
# There are 90769160 AAs on the sense strands of transcripts in
# this genome.
Create a catalog from a matrix
, data.frame
, or vector
Description
Create a catalog from a matrix
, data.frame
, or vector
Usage
as.catalog(
object,
ref.genome = NULL,
region = "unknown",
catalog.type = "counts",
abundance = NULL,
infer.rownames = FALSE
)
Arguments
object |
A numeric |
ref.genome |
A |
region |
A character string designating a region, one of
|
catalog.type |
One of "counts", "density", "counts.signature", "density.signature". |
abundance |
If |
infer.rownames |
If |
Value
A catalog as described in ICAMS
.
Examples
# Create an SBS96 catalog with all mutation counts equal to 1.
object <- matrix(1, nrow = 96, ncol = 1,
dimnames = list(catalog.row.order$SBS96))
catSBS96 <- as.catalog(object)
Reverse complement every string in string.vec
Description
Based on reverseComplement
.
Handles IUPAC ambiguity codes but not "u" (uracil).
(see <https://en.wikipedia.org/wiki/Nucleic_acid_notation>).
Usage
revc(string.vec)
Arguments
string.vec |
A character vector. |
Value
A character vector with the reverse complement of every
string in string.vec
.
Examples
revc("aTgc") # GCAT
# A vector and strings with ambiguity codes
revc(c("ATGC", "aTGc", "wnTCb")) # GCAT GCAT VGANW
## Not run:
revc("ACGU") # An error
## End(Not run)