Getting started with the codyna package

This vignette demonstrates some basic usage of the codyna package. First, we load the package.

We also load the engagement data available in the package (see ?engagement for further information)

Pattern Discovery

The codyna package provides an extensive set of features for discovering patterns in sequence data, such as n-grams, gapped patterns or repeated sequences of the same state using the function discover_patterns. The argument len can be used to specify the pattern lengths to look for. Similarly, argument gap specifies the gap sizes for gapped patterns.

discover_patterns(engagement, type = "ngram", len = 2:3)
#> # A tibble: 36 × 6
#>    pattern                          length count proportion contained_in support
#>    <chr>                             <int> <int>      <dbl>        <int>   <dbl>
#>  1 Active->Active                        2 10218     0.434           969   0.969
#>  2 Active->Active->Active                3  8386     0.372           931   0.931
#>  3 Disengaged->Disengaged                2  5186     0.220           811   0.811
#>  4 Disengaged->Disengaged->Disenga…      3  3925     0.174           706   0.706
#>  5 Average->Average                      2  2774     0.118           789   0.789
#>  6 Average->Active                       2  1545     0.0656          853   0.853
#>  7 Average->Average->Average             3  1439     0.0638          545   0.545
#>  8 Average->Active->Active               3  1265     0.0561          806   0.806
#>  9 Disengaged->Average                   2  1092     0.0464          709   0.709
#> 10 Active->Average                       2  1071     0.0455          695   0.695
#> # ℹ 26 more rows
discover_patterns(engagement, type = "gapped", gap = 1)
#> # A tibble: 9 × 6
#>   pattern                   length count proportion contained_in support
#>   <chr>                      <dbl> <int>      <dbl>        <int>   <dbl>
#> 1 Active->*->Active              3  8718     0.387           934   0.934
#> 2 Disengaged->*->Disengaged      3  4063     0.180           722   0.722
#> 3 Average->*->Active             3  2129     0.0944          850   0.85 
#> 4 Average->*->Average            3  1712     0.0759          611   0.611
#> 5 Active->*->Average             3  1534     0.0680          719   0.719
#> 6 Disengaged->*->Average         3  1412     0.0626          677   0.677
#> 7 Active->*->Disengaged          3  1126     0.0499          611   0.611
#> 8 Average->*->Disengaged         3  1021     0.0453          533   0.533
#> 9 Disengaged->*->Active          3   840     0.0372          529   0.529
discover_patterns(engagement, type = "repeated", len = 2:3)
#> # A tibble: 6 × 6
#>   pattern                           length count proportion contained_in support
#>   <chr>                              <int> <int>      <dbl>        <int>   <dbl>
#> 1 Active->Active                         2 10218      0.562          969   0.969
#> 2 Active->Active->Active                 3  8386      0.610          931   0.931
#> 3 Disengaged->Disengaged                 2  5186      0.285          811   0.811
#> 4 Disengaged->Disengaged->Disengag…      3  3925      0.285          706   0.706
#> 5 Average->Average                       2  2774      0.153          789   0.789
#> 6 Average->Average->Average              3  1439      0.105          545   0.545

The returned data frames show the length of the pattern, the number of times it occurred across all sequences, its proportion among patterns of the same length, the number sequence that contained the pattern, and the proportion of sequences that contained the pattern (support). The function discover_patterns can also be used to look for specific patterns, for example

discover_patterns(engagement, pattern = "Active->*")
#> # A tibble: 3 × 6
#>   pattern            length count proportion contained_in support
#>   <chr>               <int> <int>      <dbl>        <int>   <dbl>
#> 1 Active->Active          2 10218     0.859           969   0.969
#> 2 Active->Average         2  1071     0.0900          695   0.695
#> 3 Active->Disengaged      2   605     0.0509          508   0.508

Here, the wildcard * matches any state, i.e., we are looking for patterns that start with the Active state and the following state can be any state.

We can also compute various sequence indices

sequence_indices(engagement)
#> # A tibble: 1,000 × 23
#>    valid_n valid_proportion unique_states mean_spell_duration max_spell_duration
#>      <int>            <dbl>         <int>               <dbl>              <dbl>
#>  1      23                1             3                3.83                 11
#>  2      23                1             3                3.29                 11
#>  3      24                1             3                3.43                  8
#>  4      24                1             3                4                     9
#>  5      24                1             3                3.43                 12
#>  6      23                1             3                5.75                 13
#>  7      23                1             3                2.88                  7
#>  8      23                1             3                3.29                  8
#>  9      23                1             3                2.88                  7
#> 10      24                1             3                8                    20
#> # ℹ 990 more rows
#> # ℹ 18 more variables: longitudinal_entropy <dbl>, simpson_diversity <dbl>,
#> #   self_loop_tendency <dbl>, transition_rate <dbl>,
#> #   transition_complexity <dbl>, initial_state_persistence <dbl>,
#> #   initial_state_proportion <dbl>, initial_state_influence_decay <dbl>,
#> #   cyclic_feedback_strength <dbl>, first_state <chr>, last_state <chr>,
#> #   dominant_state <chr>, dominant_proportion <dbl>, …

Early Warning Signals

The codyna package provides methods for the detection of early warning signals (EWS). These methods have been adapted from the EWSmethods with a focus on high performance. Instead of explicit rolling window calculations, codyna implements the measures using update formulas, resulting up to 1000-fold reduction in computation time in some instances. First, we prepare some simple time series data for analysis.

set.seed(123)
ts_data <- stats::arima.sim(list(order = c(1, 1, 0), ar = 0.6), n = 200)

Both rolling window and expanding window methods are supported.

ews_roll <- detect_warnings(ts_data, method = "rolling")
ews_exp <- detect_warnings(ts_data, method = "expanding")

The function detect_warnings returns an object of class ews, and the results can be easily visualized with the plot method of this class.

plot(ews_roll)

plot(ews_exp)

Regime Detection

One of the core features of codyna is regime detection for time series data. Various methods are included with a user-friendly interface and automated parameter selection based on sensitivity. We continue with the example time series data.

regimes <- detect_regimes(
  data = ts_data,
  method = "threshold",
  sensitivity = "medium"
)
regimes
#> # A tibble: 201 × 9
#>     value  time change    id type          magnitude confidence stability  score
#>  *  <dbl> <dbl> <lgl>  <int> <chr>             <dbl> <lgl>      <chr>      <dbl>
#>  1  0         1 FALSE      1 none               0    NA         Initial   NA    
#>  2  0.623     2 TRUE       2 threshold_me…      0.25 NA         Unstable   0.225
#>  3  0.441     3 FALSE      2 none               0    NA         Transiti…  0.35 
#>  4  2.12      4 FALSE      2 none               0    NA         Transiti…  0.475
#>  5  3.62      5 FALSE      2 none               0    NA         Transiti…  0.6  
#>  6  2.56      6 FALSE      2 none               0    NA         Transiti…  0.725
#>  7  2.62      7 FALSE      2 none               0    NA         Transiti…  0.6  
#>  8  2.19      8 FALSE      2 none               0    NA         Transiti…  0.475
#>  9  0.858     9 FALSE      2 none               0    NA         Transiti…  0.35 
#> 10 -0.158    10 FALSE      2 none               0    NA         Unstable   0.225
#> # ℹ 191 more rows

The columns value and time list the original time series values and time points. The column change shows when regime changes occur, and the type describes the type of regime change (which depends on the applied method). The id column provides the regime identifiers. The column magnitude quantifies the magnitude of the regime shift, and confidence is a method-dependent measure on the likelihood of an actual regime shift. In addition regime stability is described by stability along a stability score provided in the score column. The resulting object is of class regimes which has a customized plot method for visualizing the stability of the regimes along the original time series data.

plot(regimes)