% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dictionary_dtm.R
\name{dictionary_dtm}
\alias{dictionary_dtm}
\title{Making DTM/TDM for Groups of Words}
\usage{
dictionary_dtm(x, dictionary, type = "dtm", simple_sum = FALSE,
  return_dictionary = FALSE, checks = TRUE)
}
\arguments{
\item{x}{an object of class DocumentTermMatrix or TermDocumentMatrix created by
\code{\link[chinese.misc]{corp_or_dtm}} or \code{tm::DocumentTermMatrix} or 
\code{tm::TermDocumentMatrix}. But it can also be a numeric matrix and you have to specify its type, 
see below.}

\item{dictionary}{a dictionary telling the function how you group the words. It can be a list, matrix, data.frame 
or character vector. Please see details for how to set this argument.}

\item{type}{if x is a matrix, you have to tell whether it represents a document term matrix or a term document 
matrix. Character starting with "D" or "d" for document term matrix, and that with "T" or "t" for term document 
matrix. The default is "dtm".}

\item{simple_sum}{if it is \code{FALSE} (default), a DTM/TDM will be returned. If \code{TRUE}, you will not 
see the term frequency of each word in each text. Rather, a numeric vector is returned, each of its element 
represents the sum of the corresponding group of words in the corpus as a whole.}

\item{return_dictionary}{if \code{TRUE}, a modified dictionary is returned, which only contains words that
do exist in the DTM/TDM. The default is \code{FALSE}.}

\item{checks}{The default is \code{TRUE}. This will check whether \code{x} and \code{dictionary} is valid.
For \code{dictionary}, if the input is not a list of characters, the function will manage to convert. You should not set 
this to \code{FALSE} unless you do believe that your input is OK.}
}
\value{
if \code{return_dictionary = FALSE}, an object of class DocumentTermMatrix or TermDocumentMatrix is 
returned; if \code{TRUE}, a list is returned, the 1st element is the DTM/TDM, and the 2nd
element is a named list of words. However, if \code{simple_sum = TRUE}, the DTM/TDM in the above two 
situations will be replaced by a vector.
}
\description{
A dictionary has several groups of words. Sometimes what we want is not the term frequency of this or that single word, 
but rather the total sum of words that belong to the same group. 
Given a dictionary, this function can save you a lot of time because 
it sums up the frequencies of all groups of words and you do not need to do it manually.
}
\details{
The argument \code{dictionary} can be set in different ways:

\itemize{
  \item (1) list: if it is a list, each element represents a group of words. The element should be a character vector; if it
is not, the function will manage to convert. However, the length of the element should be > 0 and has 
to contain at least 1 non-NA word.
  \item (2) matrix or data.frame: each entry of the input should be character; if it is not, the function will manage to convert.
At least one of the entries should not be \code{NA}. Each column (not row) represents a group of words.
  \item (3) character vector: it represents one group of words.
  \item (4) Note: you do not need to worry about two same words existing in the same group, because the function
will only count one of them. Neither should you worry about that the words in a certain group do not really
exist in the DTM/TDM, because the function will simply ignore those non-existent words. If none of the words 
of that group exists, the group will still appear in the final result, although the total frequencies of that group 
are all 0's. By setting \code{return_dictionary = TRUE}, you can see which words do exist.
}
}
\examples{
x <- c(
  "Hello, what do you want to drink and eat?", 
  "drink a bottle of milk", 
  "drink a cup of coffee", 
  "drink some water", 
  "eat a cake", 
  "eat a piece of pizza"
)
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
D1 <- list(
  aa <- c("drink", "eat"),
  bb <- c("cake", "pizza"),
  cc <- c("cup", "bottle")
)
y1 <- dictionary_dtm(dtm, D1, return_dictionary = TRUE)
#
# NA, duplicated words, non-existent words, 
# non-character elements do not affect the
# result.
D2 <-list(
  has_na <- c("drink", "eat", NA),
  this_is_factor <- factor(c("cake", "pizza")),
  this_is_duplicated <- c("cup", "bottle", "cup", "bottle"), 
  do_not_exist <- c("tiger", "dream")
)
y2 <- dictionary_dtm(dtm, D2, return_dictionary = TRUE)
#
# You can read into a data.frame 
# dictionary from a csv file.
# Each column represents a group.
D3 <- data.frame(
  aa <- c("drink", "eat", NA, NA),
  bb <- c("cake", "pizza", NA, NA),
  cc <- c("cup", "bottle", NA, NA),
  dd <- c("do", "to", "of", "and")
)
y3 <- dictionary_dtm(dtm, D3, simple_sum = TRUE)
#
# If it is a matrix:
mt <- t(as.matrix(dtm))
y4 <- dictionary_dtm(mt, D3, type = "t", return_dictionary = TRUE)
}
