Type: | Package |
Title: | Inferring Latent Diffusion Networks |
Version: | 1.2.4 |
Date: | 2019-02-27 |
Description: | This is an R implementation of the netinf algorithm (Gomez Rodriguez, Leskovec, and Krause, 2010)<doi:10.1145/1835804.1835933>. Given a set of events that spread between a set of nodes the algorithm infers the most likely stable diffusion network that is underlying the diffusion process. |
License: | MIT + file LICENSE |
Imports: | Rcpp (≥ 0.12.5), assertthat, checkmate, ggplot2, ggrepel, stats |
LinkingTo: | Rcpp, RcppProgress |
BugReports: | https://github.com/desmarais-lab/NetworkInference/issues |
Suggests: | testthat, knitr, rmarkdown, pander, igraph, utils, dplyr |
RoxygenNote: | 6.1.1 |
SystemRequirements: | C++11 |
LazyData: | true |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Packaged: | 2019-02-28 03:27:18 UTC; fridolinlinder |
Author: | Fridolin Linder [aut, cre], Bruce Desmarais [ctb] |
Maintainer: | Fridolin Linder <fridolin.linder@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2019-02-28 05:50:06 UTC |
NetworkInference: Inferring latent diffusion networks
Description
This package provides an R implementation of the netinf
algorithm
created by Gomez Rodriguez, Leskovec, and Krause (2010). Given a set of
events that spread between a set of nodes the algorithm infers the most likely
stable diffusion network that is underlying the diffusion process.
Details
The package provides three groups of functions: 1) data preparation 2) estimation and 3) interpretation.
Data preparation
The core estimation function netinf
requires an object of class
cascade
(see as_cascade_long and as_cascade_wide).
Cascade data contains information on the potential nodes in the network as
well as on event times for each node in each cascade.
Estimation
Diffusion networks are estimated using the netinf
function. It
produces a diffusion network in form of an edgelist (of class
data.frame
).
Interpretation and Visualization
Cascade data can be visualized with the plot
method of the cascade
class (diffnet, plot.cascade
). Results of the estimation process can
be visualized using the plotting method of the diffnet
class.
Performance
If higher performance is required and for very large data sets, a faster pure C++ implementation is available in the Stanford Network Analysis Project (SNAP). The software can be downloaded at http://snap.stanford.edu/netinf/.
Convert a cascade object to a data frame
Description
Generates a data frame containing the cascade information in the cascade object.
Usage
## S3 method for class 'cascade'
as.data.frame(x, row.names = NULL, optional = FALSE,
...)
Arguments
x |
Cascade object to convert. |
row.names |
NULL or a character vector giving the row names for the data frame. Missing values are not allowed. |
optional |
logical. If TRUE, setting row names and converting column names (to syntactic names: see make.names) is optional. (Not supported) |
... |
Additional arguments passed to |
Value
A data frame with three columns. Containing 1) The names of
the nodes ("node_name"
) that experience an event in each cascade,
2) the event time ("event_time"
) of the corresponding node,
3) the cascade identifier "cascade_id"
.
Examples
data(cascades)
as.data.frame(cascades)
Convert a cascade object to a matrix
Description
Generates a matrix
containing the cascade information in the
cascade object in wide format. Missing values are used for nodes that do not
experience an event in a cascade.
Usage
## S3 method for class 'cascade'
as.matrix(x, ...)
Arguments
x |
cascade object to convert. |
... |
additional arguments to be passed to or from methods. (Currently not supported.) |
Value
A matrix containing all cascade information in wide format. That is,
each row of the matrix corresponds to a node and each column to a cascade.
Cell entries are event times. Censored nodes have NA
for their entry.
Examples
data(cascades)
as.matrix(cascades)
Transform long data to cascade
Description
Create a cascade object from data in long format.
Usage
as_cascade_long(data, cascade_node_name = "node_name",
event_time = "event_time", cascade_id = "cascade_id",
node_names = NULL)
Arguments
data |
data.frame, containing the cascade data
with column names corresponding to the arguments provided to
|
cascade_node_name |
character, column name of |
event_time |
character, column name of |
cascade_id |
character, column name of the cascade identifier. |
node_names |
character, factor or numeric vector containing the names for each node. Optional. If not provided, node names are inferred from the cascade data. |
Details
Each row of the data describes one event in the cascade. The data must contain at least three columns:
Cascade node name: The identifier of the node that experiences the event.
Event time: The time when the node experiences the event. Note that if the time column is of class date or any other special time class, it will be converted to an integer with 'as.numeric()'.
Cascade id: The identifier of the cascade that the event pertains to.
The default names for these columns are node_name
, event_time
and cascade_id
. If other names are used in the data
object the
names have to be specified in the corresponding arguments (see argument
documentation)
Value
An object of class cascade
. This is a list containing three
(named) elements:
-
"node_names"
A character vector of node names. -
"cascade_nodes"
A list with one character vector per cascade containing the node names in order of the events. -
"cascade_times"
A list with one element per cascade containing the event times for the nodes in"cascade_names"
.
Examples
df <- simulate_rnd_cascades(10, n_nodes = 20)
cascades <- as_cascade_long(df)
is.cascade(cascades)
Transform wide data to cascade
Description
Create a cascade object from data in wide format.
Usage
as_cascade_wide(data, node_names = NULL)
Arguments
data |
data.frame or matrix, rows corresponding to nodes, columns to cascades. Matrix entries are the event times for each node, cascade pair. Missing values indicate censored observations, that is, nodes that did not have an event). Specify column and row names if cascade and node ids other than integer sequences are desired. Note that, if the time column is of class date or any other special time class, it will be converted to an integer with 'as.numeric()'. |
node_names |
character, factor or numeric vector, containing names for each node. Optional. If not provided, node names are inferred from the provided data. |
Details
If data is in wide format, each row corresponds to a node and each column to
a cascade. Each cell indicates the event time for a node - cascade
combination. If a node did not experience an event for a cascade (the node
is censored) the cell entry must be NA
.
Value
An object of class cascade
. This is a list containing three
(named) elements:
-
"node_names"
A character vector of node names. -
"cascade_nodes"
A list with one character vector per cascade containing the node names in order of the events. -
"cascade_times"
A list with one element per cascade containing the event times for the nodes in"cascade_names"
.
Examples
data("policies")
cascades <- as_cascade_long(policies, cascade_node_name = 'statenam',
event_time = 'adopt_year', cascade_id = 'policy')
wide_policies = as.matrix(cascades)
cascades <- as_cascade_wide(wide_policies)
is.cascade(cascades)
Example cascades
Description
An example dataset of 31 nodes and 54 cascades. From the original netinf implementation in SNAP.
Usage
data(cascades)
Format
An object of class cascade
containing 4 objects
- node_names
Character node names
- cascade_nodes
A list of integer vectors. Each containing the names of the nodes infected in this cascades in the order of infection
- cascade_times
A list of numeric vectors. Each containing the infection times for the corresponding nodes in cascade_nodes
Source
https://github.com/snap-stanford/snap/blob/master/examples/netinf/example-cascades.txt
Count the number of possible edges in the dataset
Description
Across all cascades, count the edges that are possible. An edge from node
u
to node v
is only possible if in at least one cascade u
experienced an event
before v
.
Usage
count_possible_edges(cascades)
Arguments
cascades |
Object of class cascade containing the data. |
Value
An integer count.
Examples
data(cascades)
count_possible_edges(cascades)
Drop nodes from a cascade object
Description
Drop nodes from a cascade object
Usage
drop_nodes(cascades, nodes, drop = TRUE)
Arguments
cascades |
cascade, object to drop nodes from. |
nodes |
character or integer, vector of node_ids to drop. |
drop |
logical, Should empty cascades be dropped. |
Value
An object of class cascade containing the cascades without the dropped nodes.
Examples
data(policies)
cascades <- as_cascade_long(policies, cascade_node_name = 'statenam',
event_time = 'adopt_year', cascade_id = 'policy')
new_cascades <- drop_nodes(cascades, c("California", "New York"))
Is the object of class cascade?
Description
Is the object of class cascade?
Usage
is.cascade(object)
Arguments
object |
the object to be tested. |
Value
TRUE
if object is a cascade, FALSE
otherwise.
Examples
data(cascades)
is.cascade(cascades)
# > TRUE
is.cascade(1)
# > FALSE
Is the object of class diffnet?
Description
Tests if an object is of class diffnet. The class diffnet is appended to the
object returned by netinf
for dispatch of appropriate plotting
methods.
Usage
is.diffnet(object)
Arguments
object |
the object to be tested. |
Value
TRUE
if object is a diffnet, FALSE
otherwise.
Examples
data(cascades)
result <- netinf(cascades, n_edges = 6, params = 1)
is.diffnet(result)
Infer latent diffusion network
Description
Infer a network of diffusion ties from a set of cascades. Each cascade is defined by pairs of node ids and infection times.
Usage
netinf(cascades, trans_mod = "exponential", n_edges = NULL,
p_value_cutoff = NULL, params = NULL, quiet = FALSE,
trees = FALSE)
Arguments
cascades |
an object of class cascade containing node and cascade
information. See |
trans_mod |
character, indicating the choice of model:
|
n_edges |
integer, number of edges to infer. Leave unspecified if using
|
p_value_cutoff |
numeric, in the interval (0, 1). If
specified, edges are inferred in each iteration until the Vuong test for
edge addition reaches the p-value cutoff or when the maximum
possible number of edges is reached. Leave unspecified if using
|
params |
numeric, Parameters for diffusion model. If left unspecified reasonable parameters are inferred from the data. See details for how to specify parameters for the different distributions. |
quiet |
logical, Should output on progress by suppressed. |
trees |
logical, Should the inferred cascade trees be returned. Note, that this will lead to a different the structure of the function output. See section Value for details. |
Details
The algorithm is describe in detail in Gomez-Rodriguez et al. (2010). Additional information can be found on the netinf website (http://snap.stanford.edu/netinf/).
Exponential distribution:
trans_mod = "exponential"
,params = c(lambda)
. Parametrization:\lambda e^{-\lambda x}
.Rayleigh distribution:
trans_mod = "rayleigh"
,params = c(alpha)
. Parametrization:\frac{x}{\alpha^2} \frac{e^{-x^2}}{2\alpha^2}
.Log-normal distribution:
trans_mod = "log-normal"
,params = c(mu, sigma)
. Parametrization:\frac{1}{x\sigma\sqrt{2\pi}}e^{-\frac{(ln x - \mu)^2}{2\sigma^2}}
.
If higher performance is required and for very large data sets, a faster pure C++ implementation is available in the Stanford Network Analysis Project (SNAP). The software can be downloaded at http://snap.stanford.edu/netinf/.
Value
Returns the inferred diffusion network as an edgelist in an object of
class diffnet
and data.frame
. The first
column contains the sender, the second column the receiver node. The
third column contains the improvement in fit from adding the edge that is
represented by the row. The output additionally has the following
attributes:
-
"diffusion_model"
: The diffusion model used to infer the diffusion network. -
"diffusion_model_parameters"
: The parameters for the model that have been inferred by the approximate profile MLE procedure.
If the argument trees
is set to TRUE
, the output is a list
with the first element being the data.frame
described above, and
the second element being the trees in edge-list form in a single
data.frame
.
References
M. Gomez-Rodriguez, J. Leskovec, A. Krause. Inferring Networks of Diffusion and Influence.The 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2010.
Examples
# Data already in cascades format:
data(cascades)
out <- netinf(cascades, trans_mod = "exponential", n_edges = 5, params = 1)
# Starting with a dataframe
df <- simulate_rnd_cascades(10, n_nodes = 20)
cascades2 <- as_cascade_long(df, node_names = unique(df$node_name))
out <- netinf(cascades2, trans_mod = "exponential", n_edges = 5, params = 1)
Plot a cascade object
Description
Allows plotting of one or multiple, labeled or unlabeled cascades.
Usage
## S3 method for class 'cascade'
plot(x, label_nodes = TRUE, selection = NULL, ...)
Arguments
x |
object of class cascade to be plotted. |
label_nodes |
logical, indicating if should the nodes in each cascade be
labeled. If the cascades are very dense setting this to |
selection |
a vector of cascade ids to plot. |
... |
additional arguments passed to plot. |
Details
The function returns a ggplot plot object (class gg, ggplot
) which
can be modified like any other ggplot. See the ggplot documentation and the
examples below for more details.
Value
A ggplot plot object.
Examples
data(cascades)
plot(cascades, selection = names(cascades$cascade_nodes)[1:5])
plot(cascades, label_nodes = FALSE, selection = sample(1:54, 20))
# Modify resulting ggplot object
library(ggplot2)
p <- plot(cascades, label_nodes = FALSE, selection = sample(1:54, 20))
## Add a title
p <- p + ggtitle('Your Title')
p
## Change Axis
p <- p + xlab("Your modified y axis label") #x and y labels are flipped here
p <- p + ylab("Your modified x axis label") #x and y labels are flipped here
p
Visualize netinf output
Description
Visualize the inferred diffusion network or the marginal gain in fit obtained by addition of each edge.
Usage
## S3 method for class 'diffnet'
plot(x, type = "network", ...)
Arguments
x |
object of class diffnet to be plotted. |
type |
character, one of |
... |
additional arguments. |
Details
If 'type = improvement' a ggplot object is returned. It can be modified like any other ggplot. See the ggplot documentation and the examples in plot.cascade.
Value
A ggplot plot object if type = "improvement"
otherwise an
igraph plot.
Examples
## Not run:
data(cascades)
res <- netinf(cascades, quiet = TRUE)
plot(res, type = "network")
plot(res, type = "improvement")
plot(res, type = "p-value")
## End(Not run)
US State Policy Adoption (SPID)
Description
The SPID data includes information on the year of adoption for over 700 policies in the American states.
Usage
data(policies)
Format
The data comes in two objects of class data.frame
. The first
object, named policies
contains the adoption events. Each row
corresponds to an adoption event. Each adoption event is described by
the three columns:
-
statenam
: Name of the adopting state. -
policy
: Name of the policy. -
adopt_year
: Year when the state adopted the policy.
The second object (policies_metadata
) contains more details on each
of the policies. It contains these columns:
-
policy
: Name of the policy. -
source
: Original source of the data. -
first_year
: First year any state adopted this policy. -
last_year
: Last year any state adopted this policy. -
adopt_count
: Number of states that adopted this policy. -
description
: Description of the policy. -
majortopic
: Topic group the policy belongs to.
Both data.frame
objects can be joined (merged) on the common column
policy
(see example code).
Details
This version 1.0 of the database. For each policy we document the year of first adoption for each state. Adoption dates range from 1691 to 2017 and includes all fifty states. Policies are adopted by anywhere from 1 to 50 states, with an average of 24 adoptions. The data were assembled from a variety of sources, including academic publications and policy advocacy/information groups. Policies were coded according to the Policy Agendas Project major topic code. Additional information on policies is available at the source repository.
Source
https://doi.org/10.7910/DVN/CVYSR7
References
Boehmke, Frederick J.; Mark Brockway; Bruce A. Desmarais; Jeffrey J. Harden; Scott LaCombe; Fridolin Linder; and Hanna Wallach. 2018. "A New Database for Inferring Public Policy Innovativeness and Diffusion Networks." Working paper.
Examples
data('policies')
# Join the adoption events with the metadata
merged_policies <- merge(policies, policies_metadata, by = 'policy')
Larger simulated validation network.
Description
A network from simulated data. For testing purposes.
Usage
data(sim_validation)
Format
An object of class data.frame
with 4 columns, containing:
- origin_node
Origin of diffusion edge.
- destination_node
Destination node of diffusion edge.
- improvement
Improvement in score for the edge
- p-value
p-value for vuong test
Source
See code below.
Simulate cascades from a diffusion network
Description
Simulate diffusion cascades based on the generative model underlying netinf and a diffusion network.
Usage
simulate_cascades(diffnet, nsim = 1, max_time = Inf,
start_probabilities = NULL, partial_cascade = NULL, params = NULL,
model = NULL, nodes = NULL)
Arguments
diffnet |
object of class |
nsim |
integer, number of cascades to simulate. |
max_time |
numeric, the maximum time after which observations are censored |
start_probabilities |
a vector of probabilities for each node in diffnet,
to be the node with the first event. If |
partial_cascade |
object of type cascade, containing one partial cascades for which further development should be simulated. |
params |
numeric, (optional) parameters for diffusion time distribution.
See the details section of |
model |
character, diffusion model to use. One of |
nodes |
vector of node ids if different from nodes included in
|
Value
A data frame with three columns. Containing 1) The names of
the nodes ("node_name"
) that experience an event in each cascade,
2) the event time ("event_time"
) of the corresponding node,
3) the cascade identifier "cascade_id"
.
Examples
data(cascades)
out <- netinf(cascades, trans_mod = "exponential", n_edges = 5, params = 1)
simulated_cascades <- simulate_cascades(out, nsim = 10)
# Simulation from partial cascade
Simulate a set of random cascades
Description
Simulate random cascades, for testing and demonstration purposes. No actual diffusion model is underlying these cascades.
Usage
simulate_rnd_cascades(n_cascades, n_nodes)
Arguments
n_cascades |
Number of cascades to generate. |
n_nodes |
Number of nodes in the system. |
Value
A data frame containing (in order of columns) node ids, event time and cascade identifier.
Examples
df <- simulate_rnd_cascades(10, n_nodes = 20)
head(df)
Select a subset of cascades from cascade object
Description
Select a subset of cascades from cascade object
Usage
subset_cascade(cascade, selection)
Arguments
cascade |
cascade, object to select from |
selection |
character or integer, vector of cascade_ids to select |
Value
An object of class cascade containing just the selected cascades
Examples
data(policies)
cascades <- as_cascade_long(policies, cascade_node_name = 'statenam',
event_time = 'adopt_year', cascade_id = 'policy')
cascade_names <- names(cascades$cascade_times)
subset_cascade(cascades, selection = cascade_names[1:10])
Subset a cascade object in time
Description
Remove each all events occurring outside the desired subset for each cascade in a cascade object.
Usage
subset_cascade_time(cascade, start_time, end_time, drop = TRUE)
Arguments
cascade |
cascade, object to subset. |
start_time |
numeric, start time of the subset. |
end_time |
numeric, end time of the subset. |
drop |
logical, should empty sub-cascades be dropped? |
Value
An object of class cascade, where only events are included that have
times start_time
<= t < end_time
.
Examples
data(cascades)
sub_cascades <- subset_cascade_time(cascades, 10, 20, drop=TRUE)
Summarize a cascade object
Description
Generates summary statistics for single cascades and across cascades in a collection, contained in a cascades object.
Usage
## S3 method for class 'cascade'
summary(object, quiet = FALSE, ...)
Arguments
object |
object of class cascade to be summarized. |
quiet |
logical, if |
... |
Additional arguments passed to summary. |
Value
Prints cascade summary information to the screen
(if quiet = FALSE
). '# cascades'
is the number of cascades in
the object, '# nodes'
is the number of nodes in the system (nodes
that can theoretically experience an event), '# nodes in cascades'
is
the number of unique nodes of the system that experienced an event and
'# possible edges'
is the number of edges that are possible given
the cascade data (see count_possible_edges
for details.).
Additional summaries for each cascade are returned invisibly.
cascade), length
(length of the cascade as an integer of how many
nodes experienced and event) and n_ties
(number of tied event
times per cascade).
Examples
data(cascades)
summary(cascades)
Validation output from netinf source.
Description
Contains output from original netinf C++ implementation, executed on
cascades
. For testing purposes.
Usage
data(validation)
Format
An object of class data.frame
with 6 columns, containing:
- origin_node
Origin of diffusion edge.
- destination_node
Destination node of diffusion edge.
- volume
??
- marginal_gain
Marginal gain from edge.
- median_time_difference
Median time between events in origin and destination
- mean_time_difference
Mean time between events in origin and destination
Source
Output from netinf example program (https://github.com/snap-stanford/snap/tree/master/examples/netinf).