Version: | 0.6.0 |
Title: | Import Data from EDC Software |
Description: | A convenient toolbox to import data exported from Electronic Data Capture (EDC) software 'TrialMaster'. |
License: | GPL-3 |
URL: | https://github.com/DanChaltiel/EDCimport, https://danchaltiel.github.io/EDCimport/ |
BugReports: | https://github.com/DanChaltiel/EDCimport/issues |
Depends: | R (≥ 3.6.0) |
Imports: | cli, dplyr, forcats, fs, glue, ggplot2, haven, lubridate, purrr, readr, rlang, scales, stats, stringr, tibble, tidyr, tidyselect, utils, lifecycle |
Suggests: | bslib, callr, crosstable, DT, gtools, htmlwidgets, janitor, knitr, openxlsx, patchwork, plotly, quarto, rmarkdown, rstudioapi, testthat (≥ 3.1.8), shiny, usethis, vdiffr, withr |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Config/testthat/edition: | 3 |
Config/testthat/parallel: | true |
Config/testthat/start-first: | local, trialmaster, utils |
VignetteBuilder: | quarto |
NeedsCompilation: | no |
Packaged: | 2025-06-24 12:14:37 UTC; Dan |
Author: | Dan Chaltiel |
Maintainer: | Dan Chaltiel <dan.chaltiel@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-06-24 12:30:02 UTC |
EDCimport: Import Data from EDC Software
Description
A convenient toolbox to import data exported from Electronic Data Capture (EDC) software 'TrialMaster'.
Author(s)
Maintainer: Dan Chaltiel dan.chaltiel@gmail.com (ORCID)
See Also
Useful links:
Report bugs at https://github.com/DanChaltiel/EDCimport/issues
Assert that a dataframe has one row per patient
Description
Check that there is no duplicate on the column holding patient ID in a pipeable style.
Mostly useful after joining two datasets.
Usage
assert_no_duplicate(df, by = NULL, id_col = get_subjid_cols())
Arguments
df |
a dataframe |
by |
(optional) grouping columns |
id_col |
the name of the columns holding patient ID |
Value
the df
dataset, unchanged
Examples
## Not run:
#without duplicate => no error, continue the pipeline
tibble(subjid=c(1:10)) %>% assert_no_duplicate() %>% nrow()
#with duplicate => throws an error
tibble(subjid=c(1:10, 1:2)) %>% assert_no_duplicate() %>% nrow()
#By groups
df = tibble(subjid=rep(1:10, 4), visit=rep(c("V1", "V2"), 2, each=10),
group=rep(c("A", "B"), each=20))
df %>% assert_no_duplicate() #error
df %>% assert_no_duplicate(by=c(visit, group)) #no error
## End(Not run)
Clean up the names of all datasets
Description
Clean the names of all the datasets in the database. By default, it converts names to lowercase letters, numbers, and underscores only.
Usage
edc_clean_names(database, clean_fun = NULL)
Arguments
database |
an edc_database object, from |
clean_fun |
a cleaning function to be applied to column names. |
Value
an edc_database object
Examples
#db = read_trialmaster("filename.zip", pw="xx")
db = edc_example() %>%
edc_clean_names()
names(db$enrol)
Show the current CRF status distribution
Description
Generate a barplot showing the distribution of CRF status (Complete, Incomplete, ...) for each dataset of the database.
Usage
edc_crf_plot(
crfstat_col = "CRFSTAT",
...,
details = FALSE,
pal = edc_pal_crf(),
reverse = FALSE,
x_label = "{dataset}",
treat_as_worst = NULL,
datasets = get_datasets(),
lookup = edc_lookup()
)
edc_pal_crf()
Arguments
crfstat_col |
the column name of the CRF status |
... |
unused |
details |
whether to show all the CRF status levels. When |
pal |
the palette, defaulting to the helper |
reverse |
whether to reverse the CRF status level order. |
x_label |
a glue pattern determining the tick label in the x axis. Available variables are the ones of |
treat_as_worst |
a regex for levels that should be treated as worst in the ordering. |
datasets , lookup |
internal |
Value
a ggplot
Source
ggsci:::ggsci_db$lancet[["lanonc"]] %>% dput()
Examples
## Not run:
#import a TM database and use load_database(), then:
edc_crf_plot() + ggtitle(date_extraction)
edc_crf_plot(reverse=TRUE)
edc_crf_plot(details=TRUE, treat_as_worst="No Data")
edc_crf_plot(x_label="{crfname} (N={n_id}, n={nrow})")
p = edc_crf_plot(details=TRUE)
p$data$crfstat %>% unique()
#> [1] "Incomplete" "No Data Locked" "No Data" "Signed"
#> [5] "Partial Monitored" "Monitored" "Complete Locked" "Complete"
## End(Not run)
Standardized warning system
Description
When checking your data, filter your dataset to get only problematic rows.
Then, use either:
-
edc_data_warn()
to generate a standardized warning that can be forwarded to the datamanager. -
edc_data_stop()
to abort the script if the problem is too serious.
Each time edc_data_warn is used, the warning is saved internally so that a summary of all your warnings can be retrieved using edc_data_warnings.
The result can be saved into an Excel file using save_edc_data_warnings()
.
Usage
edc_data_warn(
df,
message,
...,
issue_n = "xx",
max_subjid = 5,
csv_path = FALSE,
envir = parent.frame(),
col_subjid = get_subjid_cols()
)
edc_data_stop(df, message, ..., issue_n, max_subjid, csv_path, envir, col_subjid)
edc_data_warnings()
Arguments
df |
the filtered dataframe |
message |
the message. Can use cli formats. |
... |
unused |
issue_n |
identifying row number |
max_subjid |
max number of subject ID to show in the message |
csv_path |
a path to save |
envir |
the environment to evaluate |
col_subjid |
column name for subject ID. Set to |
Value
df
invisibly
Examples
library(dplyr)
db = edc_example()
load_database(db)
enrol %>%
filter(age>70) %>%
edc_data_warn("Age should not be >70", issue_n=1)
enrol %>%
filter(age<25) %>%
edc_data_warn("Age should not be <25", issue_n=2)
data1 %>%
filter(n()>1, .by=subjid) %>%
edc_data_warn("There are duplicated patients in `data1` ({nrow(.data)} rows)", issue_n=3)
enrol %>%
filter(age<25) %>%
edc_data_warn("Age should not be <25", issue_n=NULL)
edc_data_warnings()
## Not run:
enrol %>%
filter(age<25) %>%
edc_data_warn("Age should not be <25", csv_path="check/check_age_25.csv")
enrol %>%
filter(age<25) %>%
edc_data_stop("Age should *never* be <25")
## End(Not run)
EDCimport Database
Description
This class of object represents a database, as the result of an EDCimport reading function.
It has its own print()
method.
Functions returning edc_database
objects
As per now, reading functions are: read_trialmaster()
, read_all_sas()
, read_all_xpt()
, and read_all_csv()
.
Structure
While it is not usually useful to query them, an edc_database
object is a named list containing:
all the datasets from the source files
-
datetime_extraction
anddate_extraction
the inferred date of data extraction -
.lookup
a temporary copy of the lookup table
See Also
Save the database as an Excel file
Description
Because RStudio is not very good at showing data, it can be more convenient to browse the
database using MS Excel. This function turns the whole TM export (or any named list of datasets)
into an Excel workbook, with one tab for each dataset.
Use edc_db_to_excel()
to create the file and edc_browse_excel()
to open it.
Usage
edc_db_to_excel(
filename = tempfile(fileext = ".xlsx"),
...,
datasets = get_datasets(),
overwrite = FALSE,
open = FALSE
)
edc_browse_excel()
Arguments
filename |
the path to the Excel output file. Default to a temporary file. Use the special value |
... |
unused |
datasets |
a named list of dataframes. Default to the TM export. |
overwrite |
whether to overwrite any existing file. Default to |
open |
whether to open the Excel file afterward. Default to |
Value
nothing
Examples
## Not run:
db = edc_example()
load_database(db)
edc_db_to_excel() #default arguments are usually OK
edc_db_to_excel(filename=TRUE)
## End(Not run)
Example database
Description
A list of tables that simulates the extraction of a clinical database. Used in EDCimport
examples and tests.
Usage
edc_example(N = 50, seed = 42, outdated = FALSE)
Arguments
N |
the number of patients |
seed |
the random seed |
outdated |
whether to simulate times after the data extraction date |
Value
A list of tables of class edc_database
.
Search the whole database
Description
Find a keyword in columns or values, in all the datasets of the database.
Usage
edc_find_value(
keyword,
ignore_case = TRUE,
data = get_datasets(),
lookup = edc_lookup()
)
edc_find_column(keyword, ignore_case = TRUE, lookup = edc_lookup())
Arguments
keyword |
The keyword to search for. Regular expressions are only supported in |
ignore_case |
Logical. If |
data |
A list of datasets. |
lookup |
A lookup table. |
Value
a tibble
Examples
db = edc_example()
load_database(db)
edc_find_value("respi")
edc_find_value(2010)
edc_find_column("ad")
edc_find_column("date")
#with regex
edc_find_column("\\d")
edc_find_column("\\(") #you need to escape special characters
Shows how many code you wrote
Description
Shows how many code you wrote
Usage
edc_inform_code(main = "main.R", Rdir = "R/")
Arguments
main |
the main R file, which sources the other ones |
Rdir |
the R directory, where sourced R files are located |
Value
Nothing
Join within the EDCimport framework
Description
Perform a join with default by
to the Subject ID and default suffix to the
name of the y
dataset. See [dplyr::mutate-joins]
for the description of the
join logic.
Usage
edc_left_join(
x,
y,
by = NULL,
suffix = NULL,
cols = everything(),
remove_dups = FALSE
)
Arguments
x , y |
Data frames to join |
by |
The key to join on, as character. Defaults to |
suffix |
The disambiguation suffix. Defaults to the actual name of the |
cols |
<tidy-select> The columns to select in |
remove_dups |
Whether to remove columns in |
Value
a dataframe
Examples
db = edc_example()
load_database(db)
data1$common = data2$common = "Common"
x = enrol %>%
edc_left_join(data2) %>%
edc_right_join(data1)
#crfname get a suffix, common
names(x)
Retrieve the lookup table from options
Description
Retrieve the lookup table from options
Usage
edc_lookup(..., check = TRUE)
Arguments
... |
passed on to |
check |
whether to check for internal consistency |
Value
the lookup dataframe summarizing the database import
Examples
db = edc_example()
load_database(db)
edc_lookup()
edc_lookup(dataset)
Set global options for EDCimport
Description
Use this function to manage your EDCimport
parameters globally while taking advantage of autocompletion.
Use edc_peek_options()
to see which option is currently set and edc_reset_options()
to set all options back to default.
Usage
edc_options(
...,
trialmaster_pw,
path_7zip,
edc_lookup,
edc_subjid_ref,
edc_plotly,
edc_fct_yesno,
edc_cols_subjid,
edc_cols_meta,
edc_cols_id,
edc_cols_crfname,
edc_meta_cols_pct,
edc_warn_max_subjid,
edc_read_verbose,
edc_correction_verbose,
edc_get_key_cols_verbose,
edc_lookup_overwrite_warn,
.local = FALSE
)
Arguments
... |
unused |
trialmaster_pw |
the password of the trialmaster zip archive. For instance, you can use |
path_7zip |
the path to the 7zip executable. Default to |
edc_lookup |
(Internal) a reference to the lookup table (usually |
edc_subjid_ref |
used in edc_warn_patient_diffs the vector of the reference subject IDs. You should usually write |
edc_plotly |
used in edc_swimmerplot whether to use plotly to visualize the plot. |
edc_fct_yesno |
used in fct_yesno list of values to be considered as Yes/No values. Defaults to |
edc_cols_subjid , edc_cols_meta |
the name of the columns holding the subject id (default to |
edc_cols_id , edc_cols_crfname |
deprecated |
edc_meta_cols_pct |
The minimal proportion of datasets a column has to reach to be considered "meta" |
edc_warn_max_subjid |
The max number of subject IDs to show in edc_data_warn |
edc_read_verbose , edc_correction_verbose , edc_get_key_cols_verbose |
the verbosity of the output of functions read_trialmaster and read_all_xpt, and manual_correction. For example, set |
edc_lookup_overwrite_warn |
default to TRUE. Whether there should be warning when overwriting |
.local |
if TRUE, the effect will only apply to the local frame (internally using |
Value
Nothing, called for its side effects
Patient gridplot
Description
Draw a gridplot giving, for each patient and each dataset, whether the patient is present in the dataset. Data are drawn from get_datasets.
Usage
edc_patient_gridplot(
sort_rows = TRUE,
sort_cols = TRUE,
gradient = FALSE,
axes_flip = FALSE,
show_grid = TRUE,
preprocess = NULL,
palette = c(Yes = "#00468BFF", No = "#ED0000FF"),
datasets = get_datasets(),
lookup = edc_lookup()
)
Arguments
sort_rows |
whether to sort patients from "present in most datasets" to "present in least datasets" |
sort_cols |
whether to sort datasets from "containing the most patients" to "containing the least patients" |
gradient |
whether to add a color gradient for repeating measures |
axes_flip |
whether to flip the axes, so that patients are on the Y axis and datasets on the X axis |
show_grid |
whether to show the grid |
preprocess |
a function to preprocess the patient ID, e.g. |
palette |
the colors to use |
datasets , lookup |
internal |
Value
a ggplot
object
Examples
## Not run:
tm = read_trialmaster("path/to/archive.zip")
load_database(db)
edc_patient_gridplot(sort_rows=FALSE, sort_cols=FALSE)
edc_patient_gridplot(axes_flip=TRUE, show_grid=TRUE,
preprocess=~str_remove(.x, "\\D*")) #remove all non-digits
## End(Not run)
See which EDCimport
option is currently set
Description
See which EDCimport
option is currently set
Usage
edc_peek_options(keep_null = FALSE)
Arguments
keep_null |
set to TRUE to get a list |
Value
A named list of EDCimport options
Plot the populations
Description
In a RCT, you usually have several populations of analysis, and this function allow to show which patient is in which population graphically.
Usage
edc_population_plot(x, id_per_row = 50, ref = "first")
Arguments
x |
a named list of subject ID, as numeric or factor. |
id_per_row |
number of patients per rows. |
ref |
the whole population. Default to the first member of |
Value
a ggplot
Examples
#in real word code, use filter and pull to get these vectors
pop_total = c(1:180) %>% setdiff(55) #screen failure, no patient 55
pop_itt = pop_total %>% setdiff(10) #patient 10 has had the wrong treatment
pop_safety = pop_total %>% setdiff(c(40,160)) #patients 40 and 160 didn't receive any treatment
pop_m_itt = pop_total %>% setdiff(c(40,160,80)) #patient 80 had a wrong inclusion criterion
pop_evaluable = pop_total %>% setdiff(c(40,160,101,147,186)) #patients with no recist evaluation
l = list(
"Total population"=pop_total,
"ITT population"=pop_itt,
"Safety population"=pop_safety,
"mITT population"=pop_m_itt,
"Evaluable population"=pop_evaluable
)
edc_population_plot(l)
edc_population_plot(l[-1], ref=pop_total)
edc_population_plot(l, ref=1:200)
edc_population_plot(l, id_per_row=60)
Reset all EDCimport
options
Description
Reset all EDCimport
options
Usage
edc_reset_options(
except = c("edc_lookup", "trialmaster_pw", "path_7zip"),
quiet = FALSE
)
Arguments
except |
options that are not reset by default |
quiet |
set to |
Value
Nothing, called for its side effects
Split mixed datasets
Description
Split mixed tables, i.e. tables that hold both long data (N values per patient) and short data (one value per patient, duplicated on N lines), into one long table and one short table.
Usage
edc_split_mixed(
database,
datasets = everything(),
...,
ignore_cols = NULL,
verbose = FALSE
)
Arguments
database |
an edc_database object, from |
datasets |
<tidy-select> datasets to split in the database |
... |
not used, ensure arguments are named |
ignore_cols |
columns to ignore in long tables. Default to |
verbose |
whether to print informations about the process. |
Value
an edc_database object
Examples
#db = read_trialmaster("filename.zip", pw="xx")
db = edc_example() %>%
edc_split_mixed(c(ae, starts_with("long")),
ignore_cols="crfstat")
names(db)
edc_lookup()
db$ae #`aesoc`, `aegr`, and `sae` are long, but `n_ae` is short
db$ae_short
db$ae_long
Swimmer plot of all dates columns
Description
Join all tables on id
with only date columns to build a ggplot (or a
plotly if plotly=TRUE
) showing all dates for each subject.
This allows outliers to be easily identified.
Usage
edc_swimmerplot(
...,
group = NULL,
origin = NULL,
include = NULL,
exclude = NULL,
id_subset = "all",
id_sort = FALSE,
id_cols = get_subjid_cols(),
time_unit = c("days", "weeks", "months", "years"),
aes_color = c("variable", "label"),
plotly = getOption("edc_plotly", FALSE),
id = "deprecated",
id_lim = "deprecated",
.lookup = "deprecated"
)
Arguments
... |
not used |
group |
a grouping variable, given as "dataset$column". |
origin |
a variable to consider as time 0, given as "dataset$column". |
include , exclude |
a character vector of variables to exclude/include, in the form |
id_subset |
the subjects to include in the plot. |
id_sort |
whether to sort subjects by date (or time). |
id_cols |
the subject identifiers columns. Identifiers be coerced as numeric if possible. See get_subjid_cols if needed. |
time_unit |
if |
aes_color |
either |
plotly |
whether to use |
id |
deprecated |
id_lim |
deprecated |
.lookup |
deprecated |
Value
either a plotly
or a ggplot
Examples
#db = read_trialmaster("filename.zip", pw="xx")
db = edc_example()
load_database(db)
edc_swimmerplot(id_lim=c(5,45))
edc_swimmerplot(origin="enrol$enrol_date", time_unit="months",
include=c("data1", "data3"),
exclude=c("DATA1$DATE2", "data3$date\\d\\d"),
id_sort=TRUE)
edc_swimmerplot(group="enrol$arm", id_subset=1:10, aes_color="label")
## Not run:
p = edc_swimmerplot(plotly=TRUE)
save_plotly(p, "edc_swimmerplot.html")
## End(Not run)
Harmonize the subject ID of the database
Description
Turns the subject ID columns of all datasets into a factor containing levels for all
the subjects of the database. Avoid problems when joining tables, and some checks can
be performed on the levels. See vignette("postprocessing")
for a real-life case.
Usage
edc_unify_subjid(
database,
preprocess = NULL,
mode = c("factor", "numeric"),
col_subjid = NULL
)
Arguments
database |
an edc_database object, from |
preprocess |
an optional function to modify the subject ID column (at the character level). Default behavior is only to remove trailing zeros if numeric. |
mode |
the output type of the subject ID columns |
col_subjid |
names of the subject ID columns (as character) |
Value
database, with subject id modified
Examples
db = edc_example()
db$enrol$subjid %>% head() #double vector
db2 = edc_unify_subjid(db)
db2$enrol$subjid %>% head() #factor with 50 levels
db3 = edc_unify_subjid(db, preprocess=function(x) paste0("#", x))
db3$enrol$subjid %>% head()
#use numeric mode to get a numeric output
db4 = edc_unify_subjid(db, preprocess=function(x) as.numeric(x)+1, mode="numeric")
db4$enrol$subjid %>% head()
Shiny data explorer
Description
Run a Shiny application that allows to browse the datasets.
Usage
edc_viewer(data = NULL, background = TRUE, port = 1209)
Arguments
data |
A list of dataframes to view. If |
background |
Whether the app should run in a background process. |
port |
The TCP port that the application should listen on. |
Warn if extraction is too old
Description
Warn if extraction is too old
Usage
edc_warn_extraction_date(max_days = 30)
Arguments
max_days |
the max acceptable age of the data |
Value
nothing
Examples
db = edc_example()
load_database(db)
edc_warn_extraction_date()
Check the validity of the subject ID column
Description
Compare a subject ID vector to the study's reference subject ID (usually something like enrolres$subjid
), and warn if any patient is missing or extra.
check_subjid()
is the old, deprecated name.
Usage
edc_warn_patient_diffs(
x,
ref = getOption("edc_subjid_ref"),
issue_n = "xx",
data_name = NULL,
col_subjid = get_subjid_cols()
)
Arguments
x |
the subject ID vector to check, or a dataframe which ID column will be guessed |
ref |
the reference for subject ID. Should usually be set through |
issue_n |
identifying row number |
data_name |
the name of the data (for the warning message) |
col_subjid |
name of the subject ID column if |
Value
nothing, called for errors/warnings
Examples
db = edc_example()
load_database(db)
options(edc_subjid_ref=enrol$subjid)
#usually, you set something like:
#options(edc_subjid_ref=enrolres$subjid)
edc_warn_patient_diffs(data1)
data1 %>% dplyr::filter(subjid>1) %>% edc_warn_patient_diffs(issue_n=NULL)
edc_warn_patient_diffs(c(data1$subjid, 99, 999))
Format factor levels as Yes/No
Description
Format factor levels as arbitrary values of Yes/No (with Yes always first) while leaving untouched all vectors that contain other information.
Usage
fct_yesno(
x,
input = list(yes = c("Yes", "Oui"), no = c("No", "Non"), na = c("NA", "")),
output = c("Yes", "No"),
strict = FALSE,
mutate_character = TRUE,
fail = TRUE
)
Arguments
x |
a vector of any type/class. |
input |
list of values to be considered as "yes", "no", and |
output |
the output factor levels. |
strict |
whether to match the input strictly or use stringr::str_detect to find them. Can also be "ignore_case" to just ignore the case. |
mutate_character |
whether to turn characters into factor. |
fail |
whether to fail if some levels cannot be recoded to yes/no. |
Value
a factor, or x
untouched.
Examples
fct_yesno(c("No", "Yes")) #levels are in order
set.seed(42)
N=6
x = tibble(
a=sample(c("Yes", "No"), size=N, replace=TRUE),
b=sample(c("Oui", "Non"), size=N, replace=TRUE),
c=sample(0:1, size=N, replace=TRUE),
d=sample(c(TRUE, FALSE), size=N, replace=TRUE),
e=sample(c("1-Yes", "0-No", "2-NA"), size=N, replace=TRUE),
y=sample(c("aaa", "bbb", "ccc"), size=N, replace=TRUE),
z=1:N,
)
x
#y and z are left untouched (or throw an error if fail=TRUE)
sapply(x, fct_yesno, fail=FALSE, simplify=FALSE)
# as "1-Yes" is not in `input`, x$e is untouched/fails if strict=TRUE
fct_yesno(x$e)
fct_yesno(x$e, strict=TRUE, fail=FALSE)
fct_yesno(x$e, output=c("Ja", "Nein"))
Get columns that are common to multiple datasets
Description
Attempt to list all columns in the database and group the ones that are
common to some datasets.
Useful to find keys to pivot or summarise data.
Usage
get_common_cols(lookup = edc_lookup(), min_datasets = 3)
## S3 method for class 'common_cols'
summary(object, ...)
Arguments
lookup |
the lookup table, default to |
min_datasets |
the minimal number of datasets to be considered |
object |
an object of class "common_cols" |
... |
unused |
Value
a tibble of class "common_cols"
Examples
db = edc_example()
load_database(db)
x = get_common_cols(min_datasets=1)
x
summary(x)
Retrieve the datasets as a list of data.frames
Description
Get the datasets from the lookup table as a list of data.frames.
Usage
get_datasets(lookup = edc_lookup(), envir = parent.frame())
Arguments
lookup |
the lookup table |
envir |
(internal use) |
Value
a list of all datasets
Get key column names
Description
Retrieve names of patient ID and CRF name from the actual names of the datasets, without respect of the case. Default values should be set through options.
Usage
get_subjid_cols(lookup = edc_lookup())
Arguments
lookup |
the lookup table |
Value
a character vector
options
Use edc_options()
to set default values:
-
edc_cols_subjid
defaults toc("SUBJID", "PTNO")
-
edc_cols_crfname
defaults toc("FORMDESC", "CRFNAME")
Examples
#get_subjid_cols()
#get_crfname_cols()
Get a table with the latest date for each patient
Description
This function search for date columns in every tables and returns the latest date for each patient with the variable it comes from. Useful in survival analysis to get the right censoring time.
Usage
lastnews_table(
except = NULL,
with_ties = FALSE,
show_delta = FALSE,
numeric_id = TRUE,
prefer = NULL,
regex = FALSE,
warn_if_future = TRUE
)
Arguments
except |
the datasets/columns that should not be searched. Example: a scheduled visit for which the patient may have died before attending should not be considered. |
with_ties |
in case of tie, whether to return the first |
show_delta |
whether to compute the difference between the last |
numeric_id |
set to FALSE if the patient ID column is not numeric |
prefer |
preferred origins in the event of a tie. Usually the followup table. |
regex |
whether to consider |
warn_if_future |
whether to show a warning about dates that are after the extraction date. Can also be a csv file path to save the warning as csv (see |
Value
a dataframe
Examples
db = edc_example()
load_database(db)
lastnews_table()
lastnews_table(except="data3")
lastnews_table(except="data3$date9")
lastnews_table(prefer="date10", show_delta=TRUE)
lastnews_table() %>%
dplyr::count(origin = glue::glue("{origin_data}${origin_col}"),
sort=TRUE)
csv_file = tempfile(fileext=".csv")
lastnews_table(prefer="date9", warn_if_future=csv_file)
Load a list in an environment
Description
Load a list in an environment
Usage
load_database(db, env = parent.frame(), remove = TRUE)
Arguments
db |
an edc_database object (to be fair, any list would do) |
env |
the environment onto which the list should be loaded |
remove |
if |
Value
nothing, called for its side-effect
Examples
db = edc_example()
load_database(db, remove=FALSE)
print(db)
print(lengths(db))
Manual correction
Description
When finding wrong or unexpected values in an exported dataset, it can be useful to temporarily correct them by hard-coding a value. However, this manual correction should be undone as soon as the central database is updated with the correction.
-
manual_correction()
applies a correction in a specific dataset column location and throws an error if the correction is already in place. This check applies only once per R session so you can source your script without errors. -
reset_manual_correction()
resets all checks. For instance, it is called byread_trialmaster()
.
Usage
manual_correction(
data,
col,
rows,
wrong,
correct,
verbose = getOption("edc_correction_verbose", TRUE)
)
reset_manual_correction()
Arguments
data , col , rows |
the rows of a column of a dataframe where the error lies |
wrong |
the actual wrong value |
correct |
the temporary correction value |
verbose |
whether to print informations (once) |
Value
Nothing, used for side effects
Examples
library(dplyr)
x = iris %>% mutate(id=row_number(), .before=1) %>% as_tibble()
x$Sepal.Length[c(1,3,5)]
#1st correction is silent
manual_correction(x, Sepal.Length, rows=c(1,3,5),
wrong=c(5.1, 4.7, 5.0), correct=c(5, 4, 3))
x$Sepal.Length[c(1,3,5)]
#further correction is silent
manual_correction(x, Sepal.Length, rows=c(1,3,5),
wrong=c(5.1, 4.7, 5.0), correct=c(5, 4, 3))
#if the database is corrected, an error is thrown
## Not run:
reset_manual_correction()
x$Sepal.Length[c(1,3,5)] = c(5, 4, 3) #mimics db correction
manual_correction(x, Sepal.Length, rows=c(1,3,5),
wrong=c(5.1, 4.7, 5.0), correct=c(5, 4, 3))
## End(Not run)
Read all .csv
files in a directory
Description
Read all .csv
files in a directory, with labels if specified.
Usage
read_all_csv(
path,
...,
labels_from = NULL,
format_file = NULL,
subdirectories = FALSE,
read_fun = "guess",
datetime_extraction = "guess",
verbose = getOption("edc_read_verbose", 1),
clean_names_fun = NULL
)
Arguments
path |
[ |
... |
unused |
labels_from |
[ |
format_file |
[ |
subdirectories |
[ |
read_fun |
[ |
datetime_extraction |
[ |
verbose |
[ |
clean_names_fun |
|
Value
a list containing one dataframe for each .csv
file in the folder, the extraction date (datetime_extraction
), and a summary of all imported tables (.lookup
).
Labels file
labels_from
should contain the information about column labels. It should be a data file (.csv
) containing 2 columns: one for the column name and the other for its associated label. Use options(edc_col_name="xxx", edc_col_label="xxx")
to specify the names of the columns.
Format file
format_file
should contain the information about SAS formats. It can be either:
a
procformat.sas
file, containing the whole PROC FORMATor a data file (
.csv
or.sas7bdat
) containing 3 columns:-
FMTNAME
the SAS format name (repeated) -
START
the variable level -
LABEL
the label associated to the level
You can get this datafile from SAS using
PROC FORMAT
with optionCNTLOUT
. Otherwise, you can useoptions(edc_var_format_name="xxx", edc_var_level="xxx", edc_var_label="xxx")
to specify different column names.-
See Also
Other EDCimport reading functions:
read_all_sas()
,
read_all_xpt()
,
read_trialmaster()
Examples
# Create a directory with multiple csv files and a label lookup.
path = paste0(tempdir(), "/read_all_csv")
dir.create(paste0(path, "/subdir"), recursive=TRUE)
write.csv(iris, paste0(path, "/iris.csv"))
write.csv(mtcars, paste0(path, "/mtcars.csv"))
write.csv(mtcars, paste0(path, "/subdir/mtcars.csv"))
write.csv(airquality, paste0(path, "/airquality.csv"))
labs = c(iris, mtcars, airquality) %>% names()
write.csv(data.frame(name=labs, label=toupper(labs)), paste0(path, "/labels.csv"))
db = read_all_csv(path, labels_from="labels.csv", subdirectories=TRUE) %>%
set_project_name("My great project")
db
edc_lookup()
Read all .sas7bdat
files in a directory
Description
Read all .sas7bdat
files in a directory. Formats (factors levels) can be applied from a procformat.sas
SAS file, or from a format dictionary. See the "Format file" section below. Column labels are read directly from the .sas7bdat
files.
Usage
read_all_sas(
path,
...,
format_file = "procformat.sas",
subdirectories = FALSE,
datetime_extraction = "guess",
verbose = getOption("edc_read_verbose", 1),
clean_names_fun = NULL
)
Arguments
path |
[ |
... |
unused |
format_file |
[ |
subdirectories |
[ |
datetime_extraction |
[ |
verbose |
[ |
clean_names_fun |
|
Value
a list containing one dataframe for each .xpt
file in the folder, the extraction date (datetime_extraction
), and a summary of all imported tables (.lookup
).
Format file
format_file
should contain the information about SAS formats. It can be either:
a
procformat.sas
file, containing the whole PROC FORMATor a data file (
.csv
or.sas7bdat
) containing 3 columns:-
FMTNAME
the SAS format name (repeated) -
START
the variable level -
LABEL
the label associated to the level
You can get this datafile from SAS using
PROC FORMAT
with optionCNTLOUT
. Otherwise, you can useoptions(edc_var_format_name="xxx", edc_var_level="xxx", edc_var_label="xxx")
to specify different column names.-
See Also
Other EDCimport reading functions:
read_all_csv()
,
read_all_xpt()
,
read_trialmaster()
Examples
# Create a directory with multiple sas files.
path = paste0(tempdir(), "/read_all_sas")
dir.create(paste0(path, "/subdir"), recursive=TRUE)
haven::write_sas(attenu, paste0(path, "/attenu.sas7bdat"))
haven::write_sas(mtcars, paste0(path, "/mtcars.sas7bdat"))
haven::write_sas(mtcars, paste0(path, "/subdir/mtcars.sas7bdat"))
haven::write_sas(esoph, paste0(path, "/esoph.sas7bdat"))
db = read_all_sas(path, format_file=NULL, subdirectories=TRUE) %>%
set_project_name("My great project")
db
edc_lookup()
Read all .xpt
files in a directory
Description
Read all .xpt
files in a directory (unzipped TrialMaster archive).
If 7zip
is installed, you should probably rather use read_trialmaster()
instead.
Formats (factors levels) can be applied from a procformat.sas
SAS file, or from a format dictionary. See the "Format file" section below. Column labels are read directly from the .xpt
files.
Usage
read_all_xpt(
path,
...,
format_file = "procformat.sas",
datetime_extraction = "guess",
subdirectories = FALSE,
verbose = getOption("edc_read_verbose", 1),
clean_names_fun = NULL,
directory = "deprecated",
key_columns = "deprecated"
)
Arguments
path |
[ |
... |
unused |
format_file |
[ |
datetime_extraction |
[ |
subdirectories |
[ |
verbose |
[ |
clean_names_fun |
|
directory |
deprecated in favor for |
key_columns |
deprecated |
Value
a list containing one dataframe for each .xpt
file in the folder, the extraction date (datetime_extraction
), and a summary of all imported tables (.lookup
).
Format file
format_file
should contain the information about SAS formats. It can be either:
a
procformat.sas
file, containing the whole PROC FORMATor a data file (
.csv
or.sas7bdat
) containing 3 columns:-
FMTNAME
the SAS format name (repeated) -
START
the variable level -
LABEL
the label associated to the level
You can get this datafile from SAS using
PROC FORMAT
with optionCNTLOUT
. Otherwise, you can useoptions(edc_var_format_name="xxx", edc_var_level="xxx", edc_var_label="xxx")
to specify different column names.-
See Also
Other EDCimport reading functions:
read_all_csv()
,
read_all_sas()
,
read_trialmaster()
Examples
# Create a directory with multiple .xpt files.
path = paste0(tempdir(), "/read_all_xpt")
dir.create(paste0(path, "/subdir"), recursive=TRUE)
haven::write_xpt(attenu, paste0(path, "/attenu.xpt"))
haven::write_xpt(mtcars, paste0(path, "/mtcars.xpt"))
haven::write_xpt(mtcars, paste0(path, "/subdir/mtcars.xpt"))
haven::write_xpt(esoph, paste0(path, "/esoph.xpt"))
db = read_all_xpt(path, format_file=NULL, subdirectories=TRUE) %>%
set_project_name("My great project")
db
edc_lookup()
Read the .zip
archive of a TrialMaster export
Description
Import the .zip
archive of a TrialMaster trial export as a list of dataframes. The archive filename should be leaved untouched as it contains the project name and the date of extraction.
Generate a .rds
cache file for future reads.
If 7zip
is not installed or available, use read_all_xpt()
instead.
The TM export should be of type SAS Xport
, with the checkbox
"Include Codelists" ticked.
Usage
read_trialmaster(
archive,
...,
use_cache = "write",
clean_names_fun = NULL,
subdirectories = FALSE,
pw = getOption("trialmaster_pw"),
verbose = getOption("edc_read_verbose", 1),
key_columns = "deprecated"
)
Arguments
archive |
[ |
... |
unused |
use_cache |
[ |
clean_names_fun |
|
subdirectories |
[ |
pw |
[ |
verbose |
[ |
key_columns |
deprecated |
Value
a list containing one dataframe for each .xpt
file in the folder, the extraction date (datetime_extraction
), and a summary of all imported tables (.lookup
).
See Also
Other EDCimport reading functions:
read_all_csv()
,
read_all_sas()
,
read_all_xpt()
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
Save EDCimport warning to Excel
Description
Each time edc_data_warn is used, the warning is saved internally so that a summary can be retrieved using edc_data_warnings. This summary can then be saved into a .xlsx
file using save_edc_data_warnings()
.
Usage
save_edc_data_warnings(
edc_warnings = edc_data_warnings(),
path = "edc_data_warnings.xlsx",
overwrite = TRUE,
open = FALSE
)
Arguments
edc_warnings |
the result of edc_data_warnings |
path |
a |
overwrite |
If |
open |
If |
Value
a logical(1), whether the file could be written, invisibly
Save a plotly to an HTML file
Description
Save a plotly to an HTML file
Usage
save_plotly(p, file, ...)
Arguments
p |
a plot object ( |
file |
a file path to save the HTML file |
... |
passed on to htmlwidgets::saveWidget |
Value
nothing, used for side effect
Examples
## Not run:
db = edc_example()
load_database(db)
p = edc_swimmerplot(id_lim=c(5,45))
save_plotly(p, "graph/swimplots/edc_swimmerplot.html", title="My Swimmerplot")
## End(Not run)
Save sessionInfo()
output
Description
Save sessionInfo()
output into a text file.
Usage
save_sessioninfo(path = "check/session_info.txt", with_date = TRUE)
Arguments
path |
target path to write the file |
with_date |
whether to insert the date before the file extension |
Value
nothing
Examples
## Not run:
save_sessioninfo()
## End(Not run)
Search for newer data
Description
Search in some folders if a TrialMaster database more recent than the current extraction is present. By default, it will search the "data" folder and the OS usual "Downloads" folder. If a newer database is found, user will be asked if they want to move it to the "data" folder.
Usage
search_for_newer_data(
archive,
...,
source = path_home("Downloads"),
target = "data",
ask = TRUE,
advice = TRUE
)
Arguments
archive |
TM archive path, giving the project name and date |
... |
unused |
source |
the path vector to be searched, default to both "data" and the usual "Downloads" folder |
target |
the path where files should be copied |
ask |
whether to ask the user to move the file to "data" |
advice |
whether to advice how to move it instead, if |
Value
the path to the newer file, invisibly.
Examples
## Not run:
archive = "data/MYPROJECT_ExportTemplate_xxx_SAS_XPORT_2024_06_01_12_00.zip"
#tm = read_trialmaster(archive)
search_for_newer_data(archive)
## End(Not run)
Select only distinct columns
Description
Select all columns that has only one level for a given grouping scope. Useful when dealing with mixed datasets containing both long data and repeated short data.
Usage
select_distinct(df, .by)
Arguments
df |
a dataframe |
.by |
optional grouping columns |
Value
df
with less columns
Examples
db = edc_example()
db$ae %>% colnames()
#`crfname` has one level for the whole dataset
db$ae %>% select_distinct() %>% colnames()
#`n_ae` has one level per patient
db$ae %>% select_distinct(.by=subjid) %>% colnames()
Set the project name
Description
Set or override the project name
Usage
set_project_name(db, name)
Arguments
db |
the edc_database |
name |
the project name |
Value
nothing
Examples
db = edc_example() %>%
set_project_name("My great project")
edc_lookup()
Identify if a dataframe has a long or a wide format
Description
A dataset is either in the wide format or in the long format. This function identifies the format of a dataframe with respect to a subject ID. If a dataframe has some wide and long columns, it is considered "mixed".
Usage
table_format(
df,
id = get_subjid_cols(),
...,
ignore_cols = get_meta_cols(0.95),
na_rm = FALSE,
warn = TRUE
)
Arguments
df |
a dataframe |
id |
the identifying subject ID |
... |
not used |
ignore_cols |
columns to ignore. |
na_rm |
whether to consider missing values |
warn |
whether to warn if ID is not found |
Value
a string value in c("wide", "long", "mixed)
See Also
https://tidyr.tidyverse.org/articles/pivot.html
Examples
db = edc_example()
sapply(db, table_format, warn=FALSE)
Unify a vector
Description
Turn a vector of length N to a vector of length 1 after checking that there
is only one unique value.
Useful to safely flatten a duplicated table.
Preserves the label
attribute if set.
Usage
unify(x, collapse_chr = FALSE, warn = TRUE)
Arguments
x |
a vector |
collapse_chr |
whether to collapse non-unique character values |
warn |
whether to warn if non-unique values were found |
Value
a vector of length 1
Examples
unify(c(1,1,1,1))
#unify(c(1,1,2,1)) #warning
library(dplyr)
set.seed(42)
x=tibble(id=rep(letters[1:5],10), value=rep(1:5,10),
value2=sample(letters[6:10], 50, replace=TRUE))
x %>% summarise(value=unify(value), .by=id) #safer than `value=value[1]`
x %>% summarise(value2=unify(value2, collapse_chr=TRUE, warn=FALSE), .by=id)
x$value[2]=1
x %>% summarise(value2=unify(value2), .by=id) #warning about that non-unique value