Type: | Package |
Title: | Convert TCR Gene Names |
Description: | Convert T Cell Receptor (TCR) gene names between the 10X Genomics, Adaptive Biotechnologies, and ImMunoGeneTics (IMGT) nomenclatures. |
Version: | 1.0 |
License: | MIT + file LICENSE |
URL: | https://github.com/seshadrilab/tcrconvertr, https://seshadrilab.github.io/tcrconvertr/ |
BugReports: | https://github.com/seshadrilab/tcrconvertr/issues |
Encoding: | UTF-8 |
Imports: | stats, utils, rappdirs |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown, roxyglobals, testthat (≥ 3.0.0), mockery |
Config/testthat/edition: | 3 |
Config/roxyglobals/filename: | globals.R |
Config/roxyglobals/unique: | FALSE |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-04-09 17:05:56 UTC; emmabishop |
Author: | Emma Bishop |
Maintainer: | Emma Bishop <emmab5@uw.edu> |
Repository: | CRAN |
Date/Publication: | 2025-04-17 07:20:08 UTC |
Add -01
to gene names lacking gene-level info
Description
Some genes just have the IMGT subgroup (e.g. TRBV2) and allele (e.g. *01)
designation. The Adaptive format always includes an IMGT gene (e.g. -01)
designation, with "-01" as the apparent default. add_dash_one()
adds a
default gene-level designation if it's missing.
Usage
add_dash_one(gene_str)
Arguments
gene_str |
A string, the gene name. |
Value
A string, the updated gene name.
Examples
add_dash_one("TRBV2*01")
Create lookup tables
Description
build_lookup_from_fastas()
processes IMGT reference FASTA files in a given
folder to generate lookup tables used for making gene name conversions. It
extracts all gene names and transforms them into 10X and Adaptive formats
following predefined conversion rules. The resulting files are created:
-
lookup.csv
: IMGT gene names and their 10X and Adaptive equivalents. -
lookup_from_tenx.csv
: Gene names aggregated by their 10X identifiers, with one representative allele (*01
) for each. -
lookup_from_adaptive.csv
: Adaptive gene names, with or without alleles and gene designations, and their IMGT and 10X equivalents.
The files are stored in a given subfolder (species
) within the appropriate
application folder via rappdirs
. For example:
MacOS:
~/Library/Application Support/<AppName>
Windows:
C:\Documents and Settings\<User>\Application Data\Local Settings\<AppAuthor>\<AppName>
Linux:
~/.local/share/<AppName>
If a folder named species
already exists in that location, it will be replaced.
Usage
build_lookup_from_fastas(data_dir, species)
Arguments
data_dir |
A string, the directory containing FASTA files. |
species |
A string, the name of species that will be used when running TCRconvert with these lookup tables. |
Details
Key transformations from IMGT:
-
10X:
Remove allele information (e.g.,
*01
) and modify/DV
occurrences.
-
Adaptive:
Apply renaming rules, such as adding gene-level designations and zero-padding single-digit numbers.
Convert constant genes to
"NoData"
(Adaptive only captures VDJ) which becomeNA
after the merge inconvert_gene()
.
Value
A string, path to new lookup directory
Examples
# For the example, create and use a temporary folder
fastadir <- file.path(tempdir(), "TCRconvertR_tmp")
dir.create(fastadir, showWarnings = FALSE, recursive = TRUE)
trav <- get_example_path("fasta_dir/test_trav.fa")
trbv <- get_example_path("fasta_dir/test_trbv.fa")
file.copy(c(trav, trbv), fastadir)
# Build lookup tables
build_lookup_from_fastas(fastadir, "rabbit")
# Clean up temporary folder
unlink(fastadir, recursive = TRUE)
Choose lookup table
Description
choose_lookup()
determines which CSV lookup table to use based on the the
input format (frm
) and returns the path to that file.
Usage
choose_lookup(frm, to, species = "human", verbose = TRUE)
Arguments
frm |
A string, the input format of TCR data. Must be one of
|
to |
A string, the output format of TCR data. Must be one of
|
species |
A string, the species. Optional; defaults to |
verbose |
A boolean, whether to show messages. Optional; defaults to |
Value
A string, the path to correct lookup table.
Examples
choose_lookup("imgt", "adaptive")
Convert gene names
Description
convert_gene()
converts T-cell receptor (TCR) gene names between the IMGT,
10X, and Adaptive formats. It determines the columns to convert based
on the input format (frm
) unless specified by the user (frm_cols
). It
returns a modified version of the input data frame with converted gene names
while preserving row order.
Usage
convert_gene(df, frm, to, species = "human", frm_cols = NULL, verbose = TRUE)
Arguments
df |
A dataframe containing TCR gene names. |
frm |
A string, the input format of TCR data. Must be one of
|
to |
A string, the output format of TCR data. Must be one of
|
species |
A string,the species. Optional; defaults to |
frm_cols |
A character vector of custom gene column names.
Optional; defaults to |
verbose |
A boolean, whether to display messages. Optional; defaults to |
Details
Gene names are converted by performing a merge
between the relevant
input columns and a species-specific lookup table containing IMGT reference
genes in all three formats.
Behavioral Notes
If a gene name cannot be mapped, it is replaced with
NA
and a warning is raised.If
frm
is'imgt'
andfrm_cols
is not provided, 10X column names are assumed.Constant (C) genes are set to
NA
when converting to Adaptive formats, as Adaptive does not capture constant regions.The input does not need to include all gene types; partial inputs (e.g., only V genes) are supported.
If no values in a custom column can be mapped (e.g. a CDR3 column) it is skipped and a warning is raised.
Standard Column Names
If frm_cols
is not provided, these column names will be used if present:
-
IMGT:
"v_gene"
,"d_gene"
,"j_gene"
,"c_gene"
-
10X:
"v_gene"
,"d_gene"
,"j_gene"
,"c_gene"
-
Adaptive:
"v_resolved"
,"d_resolved"
,"j_resolved"
-
Adaptive v2:
"vMaxResolved"
,"dMaxResolved"
,"jMaxResolved"
Value
A dataframe with converted TCR gene names.
Examples
tcr_file <- get_example_path("tenx.csv")
df <- read.csv(tcr_file)[c("barcode", "v_gene", "j_gene", "cdr3")]
df
convert_gene(df, "tenx", "adaptive", verbose = FALSE)
Extract all gene names from a folder of FASTAs
Description
extract_imgt_genes()
first runs parse_imgt_fasta()
on all FASTA files in
a given folder to pull out the gene names. Then it returns those names in an
alphabetically sorted dataframe.
Usage
extract_imgt_genes(data_dir)
Arguments
data_dir |
A string, the path to directory containing FASTA files. |
Value
A dataframe of gene names.
Examples
# Given a folder with FASTA files containing these headers:
# >SomeText|TRAC*01|MoreText|
# >SomeText|TRAV1-1*01|MoreText|
# >SomeText|TRAV1-1*02|MoreText|
# >SomeText|TRAV1-2*01|MoreText|
# >SomeText|TRAV14/DV4*01|MoreText|
# >SomeText|TRAV38-1*01|MoreText|
# >SomeText|TRAV38-2/DV8*01|MoreText|
# >SomeText|TRBV29-1*01|MoreText|
# >SomeText|TRBV29-1*02|MoreText|
# >SomeText|TRBV29/OR9-2*01|MoreText|
fastadir <- get_example_path("fasta_dir/")
extract_imgt_genes(fastadir)
Get full path to an example file or directory
Description
get_example_path()
takes a file or folder name that is expected to be
located under the TCRconvertR
examples
directory and gets the full path
to that item.
Usage
get_example_path(file_name)
Arguments
file_name |
A string, the name of the example file or directory. |
Value
A string, the path to example file or directory.
Examples
# Will probably be in a temp folder for the function example
get_example_path("tenx.csv")
Add a 0
to single-digit gene-level designation
Description
pad_single_digit()
takes a gene name and ensures that any single-digit
number following a sequence of letters is padded with a leading zero.
This is to match the Adaptive format.
Usage
pad_single_digit(gene_str)
Arguments
gene_str |
A string, the gene name. |
Value
A string, the updated gene name.
Examples
pad_single_digit("TCRBV1-2")
Extract gene names from a reference FASTA
Description
parse_imgt_fasta()
extracts the second element from a "|"-delimited FASTA
header, which will be the gene name for IMGT reference FASTAs.
Usage
parse_imgt_fasta(infile)
Arguments
infile |
A string, the path to FASTA file. |
Value
A character vector of gene names.
Examples
# Given a FASTA file containing this header:
# >SomeText|TRBV29-1*01|MoreText|
# >SomeText|TRBV29-1*02|MoreText|
# >SomeText|TRBV29/OR9-2*01|MoreText|
fasta <- get_example_path("fasta_dir/test_trbv.fa")
parse_imgt_fasta(fasta)
Save a lookup table to a CSV file
Description
save_lookup()
saves a data frame as a CSV file (without row names) in the
specified directory.
Usage
save_lookup(df, savedir, name)
Arguments
df |
A data frame containing the lookup table data. |
savedir |
A string, the path to the save directory. |
name |
A string, the file name (should end in |
Value
Nothing
Examples
# Create a temp save directory and load an example
save_dir <- file.path(tempdir(), "TCRconvertR_tmp")
dir.create(save_dir, showWarnings = FALSE, recursive = TRUE)
dat <- read.csv(get_example_path("fasta_dir/lookup.csv"))
save_lookup(dat, save_dir, "newlookup.csv")
# Clean up temporary folder
unlink(save_dir, recursive = TRUE)
Determine input columns to use
Description
which_frm_cols()
determines the columns that are expected to hold gene
name information in the input file based on the input format (frm
). It
returns a vector of those column names.
Usage
which_frm_cols(df, frm, frm_cols = NULL, verbose = TRUE)
Arguments
df |
Dataframe containing TCR gene names. |
frm |
A string, the input format of TCR data. Must be one of
|
frm_cols |
A character vector, the custom column names to use. |
verbose |
A boolean, whether to show messages. Optional; defaults to |
Value
A character vector, column names to use.
Examples
tcr_file <- get_example_path("tenx.csv")
df <- read.csv(tcr_file)
which_frm_cols(df, "tenx")