% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/outpack_tools.R
\name{orderly_metadata_extract}
\alias{orderly_metadata_extract}
\title{Extract metadata from orderly packets}
\usage{
orderly_metadata_extract(
  expr = NULL,
  name = NULL,
  location = NULL,
  allow_remote = NULL,
  fetch_metadata = FALSE,
  extract = NULL,
  options = NULL,
  root = NULL
)
}
\arguments{
\item{expr}{The query expression. A \code{NULL} expression matches everything.}

\item{name}{Optionally, the name of the packet to scope the query on. This
will be intersected with \code{scope} arg and is a shorthand way of running
\code{scope = list(name = "name")}}

\item{location}{Optional vector of locations to pull from. We
might in future expand this to allow wildcards or exceptions.}

\item{allow_remote}{Logical, indicating if we should allow packets
to be found that are not currently unpacked (i.e., are known
only to a location that we have metadata from). If this is
\code{TRUE}, then in conjunction with \code{\link[=orderly_dependency]{orderly_dependency()}}
you might pull a large quantity of data.  The default is \code{NULL}. This is
\code{TRUE} if remote locations are listed explicitly as a character
vector in the \code{location} argument, or if you have specified
\code{fetch_metadata = TRUE}, otherwise \code{FALSE}.}

\item{fetch_metadata}{Logical, indicating if we should pull
metadata immediately before the search. If \code{location} is given,
then we will pass this through to
\code{\link[=orderly_location_fetch_metadata]{orderly_location_fetch_metadata()}} to filter locations
to update.  If pulling many packets in sequence, you \emph{will} want
to update this option to \code{FALSE} after the first pull, otherwise
it will update the metadata between every packet, which will be
needlessly slow.}

\item{extract}{A character vector of columns to extract, possibly
named. See Details for the format.}

\item{options}{\strong{DEPRECATED}. Please don't use this any more, and
instead use the arguments \code{location}, \code{allow_remote} and
\code{fetch_metadata} directly.}

\item{root}{The path to the root directory, or \code{NULL} (the
default) to search for one from the current working
directory. This function does not require that the directory is
configured for orderly, and can be any \code{outpack} root (see
\code{\link[=orderly_init]{orderly_init()}} for details).}
}
\value{
A \code{data.frame}, the columns of which vary based on the
names of \code{extract}; see Details for more information.
}
\description{
Extract metadata from a group of packets.  This is an
\strong{experimental} high-level function for interacting with the
metadata in a way that we hope will be useful. We'll expand this a
bit as time goes on, based on feedback we get so let us know what
you think.  See Details for how to use this.
}
\details{
Extracting data from outpack metadata is challenging to do in a
way that works in data structures familiar to R users, because it
is naturally tree structured, and because not all metadata may be
present in all packets (e.g., a packet that does not depend on
another will not have a dependency section, and one that was run
in a context without git will not have git metadata). If you just
want the raw tree-structured data, you can always use
\code{\link[=orderly_metadata]{orderly_metadata()}} to load the full metadata for any
packet (even one that is not currently available on your computer,
just known about it) and the structure of the data will remain
fairly constant across orderly versions.

However, sometimes we want to extract data in order to ask
specific questions like:
\itemize{
\item what parameter combinations are available across a range of packets?
\item when were a particular set of packets used?
\item what files did these packets produce?
}

Later we'd like to ask even more complex questions like:
\itemize{
\item at what version did the file \code{graph.png} change?
\item what inputs changed between these versions?
}

...but being able to answer these questions requires a similar
approach to interrogating metadata across a range of packets.

The \code{orderly_metadata_extract} function aims to simplify the
process of pulling out bits of metadata and arranging it into a
\code{data.frame} (of sorts) for you.  It has a little mini-language in
the \code{extract} argument for doing some simple rewriting of results,
but you can always do this yourself.

In order to use function you need to know what metadata are
available; we will expand the vignette with more worked examples
here to make this easier to understand. The function works on
top-level keys, of which there are:
\itemize{
\item id: the packet id (this is always returned)
\item name: the packet name
\item parameters: a key-value pair of values, with string keys and
atomic values. There is no guarantee about presence of keys
between packets, or their types.
\item time: a key-value pair of times, with string keys and time
values (see \link{DateTimeClasses}; these are stored as seconds since
1970 in the actual metadata). At present \code{start} and \code{end} are
always present.
\item files: files present in each packet. This is a \code{data.frame} (per
packet), each with columns \code{path} (relative), \code{size} (in bytes)
and \code{hash}.
\item depends: dependencies used each packet. This is a \code{data.frame}
(per packet), each with columns \code{packet} (id), \code{query} (string,
used to find \code{packet}) and \code{files} (another \code{data.frame} with
columns \code{there} and \code{here} corresponding to filenames upstream
and in this packet, respectively)
\item git: either metadata about the state of git or \code{null}. If given
then \code{sha} and \code{branch} are strings, while \code{url} is an array of
strings/character vector (can have zero, one or more elements).
\item session: some information about the session that the packet was run in
(this is unstandardised, and even the orderly version may change)
\item custom: additional metadata added by its respective engine.  For
packets run by \code{orderly}, there will be an \code{orderly} field here,
which is itself a list:
\itemize{
\item artefacts: A \link{data.frame} with artefact information, containing
columns \code{description} (a string) and \code{paths} (a list column of paths).
\item shared: A \link{data.frame} of the copied shared resources with
their original name (\code{there}) and name as copied into the packet
(\code{here}).
\item role: A \link{data.frame} of identified roles of files, with columns \code{path}
and \code{role}.
\item description: A list of information from
\code{\link[=orderly_description]{orderly_description()}} with human-readable descriptions and
tags.
\item session: A list of information about the session as run,
with a list \code{platform} containing information about the platform
(R version as \code{version}, operating system as \code{os} and system name
as \code{system}) and \code{packages} containing columns \code{package} ,
\code{version} and \code{attached}.
}
}

The nesting here makes providing a universally useful data format
difficult; if considering files we have a \code{data.frame} with a
\code{files} column, which is a list of \code{data.frame}s; similar
nestedness applies to \code{depends} and the orderly custom
data. However, you should be able to fairly easily process the
data into the format you need it in.

The simplest extraction uses names of top-level keys:

\if{html}{\out{<div class="sourceCode">}}\preformatted{extract = c("name", "parameters", "files")
}\if{html}{\out{</div>}}

This creates a data.frame with columns corresponding to these
keys, one row per packet. Because \code{name} is always a string, it
will be a character vector, but because \code{parameters} and \code{files}
are more complex, these will be list columns.

You must not provide \code{id}; it is always returned and always first
as a character vector column.  If your extraction could possibly
return data from locations (i.e., you have \code{allow_remote = TRUE}
or have given a value for \code{location}) then we add a logical column
\code{local} which indicates if the packet is local to your archive,
meaning that you have all the files from it locally.

You can rename the columns by providing a name to entries within
\code{extract}, for example:

\if{html}{\out{<div class="sourceCode">}}\preformatted{extract = c("name", pars = "parameters", "files")
}\if{html}{\out{</div>}}

is the same as above, except that that the \code{parameters} column has
been renamed \code{pars}.

More interestingly, we can index into a structure like
\code{parameters}; suppose we want the value of the parameter \code{x}, we
could write:

\if{html}{\out{<div class="sourceCode">}}\preformatted{extract = c(x = "parameters.x")
}\if{html}{\out{</div>}}

which is allowed because for \emph{each packet} the \code{parameters}
element is a list.

However, we do not know what type \code{x} is (and it might vary
between packets). We can add that information ourselves though and write:

\if{html}{\out{<div class="sourceCode">}}\preformatted{extract = c(x = "parameters.x is number")
}\if{html}{\out{</div>}}

to create an numeric column. If any packet has a value of \code{x} that
is non-integer, your call to \code{orderly_metadata_extract} will fail
with an error, and if a packet lacks a value of \code{x}, a missing
value of the appropriate type will be added.

Note that this does not do any coercion to number, it will error
if a non-NULL non-numeric value is found.  Valid types for use
with \verb{is <type>} are \code{boolean}, \code{number} and \code{string} (note that
these differ slightly from R's names because we want to emphasise
that these are \emph{scalar} quantities; also note that there is no
\code{integer} here as this may produce unexpected errors with
integer-like numeric values). You can also use \code{list} but this is
the default.  Things in the schema that are known to be scalar
atomics (such as \code{name}) will be automatically simplified.

You can index into the array-valued elements (\code{files} and
\code{depends}) in the same way as for the object-valued elements:

\if{html}{\out{<div class="sourceCode">}}\preformatted{extract = c(file_path = "files.path", file_hash = "files.hash")
}\if{html}{\out{</div>}}

would get you a list column of file names per packet and another
of hashes, but this is probably less useful than the \code{data.frame}
you'd get from extracting just \code{files} because you no longer have
the hash information aligned.

You can index fairly deeply; it should be possible to get the
orderly "display name" with:

\if{html}{\out{<div class="sourceCode">}}\preformatted{extract = c(display = "custom.orderly.description.display is string")
}\if{html}{\out{</div>}}

If the path you need to extract has a dot in it (most likely a
package name for a plugin, such as \code{custom.orderly.db}) you need
to escape the dot with a backslash (so, \verb{custom.orderly\\.db}). You
will probably need two slashes or use a raw string (in recent
versions of R).
}
\section{Custom 'orderly' metadata}{


Within \code{custom.orderly}, additional fields can be extracted. The
format of this is subject to change, both in the stored metadata
and schema (in the short term) and in the way we deserialise it.
It is probably best not to rely on this right now, and we will
expand this section when you can.
}

\examples{
path <- orderly_example()

# Generate a bunch of packets:
suppressMessages({
  orderly_run("data", echo = FALSE, root = path)
  for (n in c(2, 4, 6, 8)) {
    orderly_run("parameters", list(max_cyl = n), echo = FALSE, root = path)
  }
})

# Without a query, we get a summary over all packets; this will
# often be too much:
orderly_metadata_extract(root = path)

# Pass in a query to limit things:
meta <- orderly_metadata_extract(quote(name == "parameters"), root = path)
meta

# The parameters are present as a list column:
meta$parameters

# You can also lift values from the parameters into columns of their own:
orderly_metadata_extract(
  quote(name == "parameters"),
  extract = c(max_cyl = "parameters.max_cyl is number"),
  root = path)
}
