Help for package quanteda.textstats

Version:

0.97.2

Title:

Textual Statistics for the Quantitative Analysis of Textual Data

Description:

Textual statistics functions formerly in the 'quanteda' package. Textual statistics for characterizing and comparing textual data. Includes functions for measuring term and document frequency, the co-occurrence of words, similarity and distance between features and documents, feature entropy, keyword occurrence, readability, and lexical diversity. These functions extend the 'quanteda' package and are specially designed for sparse textual data.

License:

GPL-3

Depends:

R (≥ 3.5.0)

Imports:

quanteda (≥ 4.0.0), Matrix (≥ 1.5-0), methods, nsyllable, proxyC (≥ 0.1.4), Rcpp (≥ 0.12.12), stringi

LinkingTo:

Rcpp, RcppArmadillo (≥ 0.7.600.1.0), quanteda

Suggests:

entropy, ExPosition, proxy, rmarkdown, spelling, svs, testthat, knitr, covr

URL:

https://quanteda.io

Encoding:

UTF-8

BugReports:

https://github.com/quanteda/quanteda.textstats/issues

LazyData:

TRUE

Language:

en-GB

RoxygenNote:

7.3.2

NeedsCompilation:

yes

Packaged:

2024-09-03 10:13:18 UTC; kbenoit

Author:

Kenneth Benoit

[cre, aut, cph], Kohei Watanabe

[aut], Haiyan Wang

[aut], Jiong Wei Lua [aut], Jouni Kuha

[aut], European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)

Maintainer:

Kenneth Benoit <kbenoit@lse.ac.uk>

Repository:

CRAN

Date/Publication:

2024-09-03 12:20:04 UTC

quanteda.textstats: Textual Statistics for the Quantitative Analysis of Textual Data

Description

Author(s)

Maintainer: Kenneth Benoit kbenoit@lse.ac.uk (ORCID) [copyright holder]

Authors:

Kohei Watanabe watanabe.kohei@gmail.com (ORCID)
Haiyan Wang whyinsa@yahoo.com (ORCID)
Jiong Wei Lua J.W.Lua@lse.ac.uk
Jouni Kuha j.kuha@lse.ac.uk (ORCID)

Other contributors:

European Research Council (ERC-2011-StG 283794-QUANTESS) [funder]

textstat_simil/dist coercion methods

Description

Coercion methods for objects created by textstat_simil() and textstat_dist().

Usage

## S3 method for class 'textstat_proxy'
as.list(x, sorted = TRUE, n = NULL, diag = FALSE, ...)

## S3 method for class 'textstat_proxy'
as.data.frame(
  x,
  row.names = NULL,
  optional = FALSE,
  diag = FALSE,
  upper = FALSE,
  ...
)

Arguments

x

any R object.

sorted

sort results in descending order if TRUE

n

the top n highest-ranking items will be returned. If n is NULL, return all items.

diag

logical; if FALSE, exclude the item's comparison with itself

...

additional arguments to be passed to or from methods.

row.names

NULL or a character vector giving the row names for the data frame. Missing values are not allowed.

optional

logical. If TRUE, setting row names and converting column names (to syntactic names: see make.names) is optional. Note that all of R's base package as.data.frame() methods use optional only for column names treatment, basically with the meaning of data.frame(*, check.names = !optional). See also the make.names argument of the matrix method.

upper

logical; if TRUE, return pairs as both (A, B) and (B, A)

Value

as.data.list for a textstat_simil or textstat_dist object returns a list equal in length to the columns of the simil or dist object, with the rows and their values as named elements. By default, this list excludes same-time pairs (when diag = FALSE) and sorts the values in descending order (when sorted = TRUE).

as.data.frame for a textstat_simil or textstat_dist object returns a data.frame of pairwise combinations and the and their similarity or distance value.

as.matrix method for textstat_simil_sparse

Description

as.matrix method for textstat_simil_sparse

Usage

## S4 method for signature 'textstat_simil_sparse'
as.matrix(x, omitted = NA, ...)

## S4 method for signature 'textstat_simil_symm_sparse'
as.matrix(x, omitted = NA, ...)

Arguments

x

an object returned by textstat_simil when min_simil > 0

omitted

value that will replace the omitted cells

...

unused

Value

a matrix object

Check arguments passed to other functions via ...

Description

Check arguments passed to other functions via ...

Usage

check_dots(..., method = NULL)

Arguments

...

dots to check

method

the names of functions ... is passed to

Compute lexical diversity from a dfm or tokens

Description

Internal functions used in textstat_lexdiv(), for computing lexical diversity measures on dfms or tokens objects

Usage

compute_lexdiv_dfm_stats(x, measure = NULL, log.base = 10)

compute_lexdiv_tokens_stats(
  x,
  measure = c("MATTR", "MSTTR"),
  MATTR_window,
  MSTTR_segment
)

Arguments

x

a dfm object

measure

a list of lexical diversity measures.

log.base

a numeric value defining the base of the logarithm (for measures using logs)

MATTR_window

a numeric value defining the size of the moving window for computation of the Moving-Average Type-Token Ratio (Covington & McFall, 2010)

MSTTR_segment

a numeric value defining the size of the each segment for the computation of the the Mean Segmental Type-Token Ratio (Johnson, 1944)

Details

compute_lexdiv_dfm_stats in an internal function that computes the lexical diversity measures from a dfm input.

compute_lexdiv_tokens_stats in an internal function that computes the lexical diversity measures from a dfm input.

Value

a data.frame with a document column containing the input document name, followed by columns with the lexical diversity statistic, in the order in which they were supplied as the measure argument.

Compute the Moving-Average Type-Token Ratio (MATTR)

Description

From a tokens object, computes the Moving-Average Type-Token Ratio (MATTR) from Covington & McFall (2010), averaging all of the sequential moving windows of tokens of size MATTR_window across the text, returning the average as the MATTR.

Usage

compute_mattr(x, MATTR_window = 100L)

Arguments

x

a tokens object

MATTR_window

integer; the size of the moving window for computation of TTR, between 1 and the number of tokens of the document

Compute the Mean Segmental Type-Token Ratio (MSTTR)

Description

Compute the Mean Segmental Type-Token Ratio (Johnson 1944) for a tokens input.

Usage

compute_msttr(x, MSTTR_segment)

Arguments

x

input tokens

MSTTR_segment

a numeric value defining the size of the each segment for the computation of the the Mean Segmental Type-Token Ratio (Johnson, 1944)

Word lists for readability statistics

Description

data_char_wordlists provides word lists used in some readability indexes; it is a named list of character vectors where each list element corresponds to a different readability index.

Usage

data_char_wordlists

Format

A list of length two:

DaleChall: The long Dale-Chall list of 3,000 familiar (English) words needed to compute the Dale-Chall Readability Formula.
Spache: The revised Spache word list (see Klare 1975, 73; Spache 1974) needed to compute the Spache Revised Formula of readability (Spache 1953).

References

Chall, J.S., & Dale, E. (1995). Readability Revisited: The New Dale-Chall Readability Formula. Brookline Books.

Dale, E. & Chall, J.S. (1948). A Formula for Predicting Readability. Educational Research Bulletin, 27(1): 11–20.

Dale, E. & Chall, J.S. (1948). A Formula for Predicting Readability: Instructions. Educational Research Bulletin, 27(2): 37–54.

Klare, G.R. (1975). Assessing Readability. Reading Research Quarterly 10(1), 62–102.

Spache, G. (1953). A New Readability Formula for Primary-Grade Reading Materials. The Elementary School Journal, 53, 410–413.

Spache, G. (1974). Good reading for poor readers. (Rvd. 9th Ed.) Champaign, Illinois: Garrard, 1974.

Split a dfm's hyphenated features into constituent parts

Description

Takes a dfm that contains features with hyphenated words, such as "split-second" and turns them into features that split the elements in the same was as tokens(x, remove_hyphens = TRUE) would have done.

Usage

dfm_split_hyphenated_features(x)

Arguments

x

input dfm

convert same-value pairs to NA in a textstat_proxy object

Description

Converts the diagonal, or the same-pair equivalent in an object where the columns have been selected, to NA.

Usage

diag2na(x)

Arguments

x

the return from textstat_simil() or textstat_dist()

Value

sparse Matrix format with same-pair values replaced with NA

Internal function to extract docvars

Description

Internal function to extract docvars

Usage

get_docvars(x, field = NULL, user = TRUE, system = FALSE, drop = FALSE)

Arguments

x

an object from which docvars are extracted

field

name of docvar fields

user

if TRUE, return user variables

system

if TRUE, return system variables

drop

if TRUE, convert data.frame with one variable to a vector

Return the first or last part of a textstat_proxy object

Description

For a similarity or distance object computed via textstat_simil or textstat_dist, returns the first or last n rows.

Usage

## S3 method for class 'textstat_proxy'
head(x, n = 6L, ...)

## S3 method for class 'textstat_proxy'
tail(x, n = 6L, ...)

Arguments

x

a textstat_simil/textstat_dist object

n

a single, positive integer. If positive, size for the resulting object: number of first/last documents for the dfm. If negative, all but the n last/first number of documents of x.

...

unused

Value

A matrix corresponding to the subset defined by n.

Count the Scrabble letter values of text

Description

Tally the Scrabble letter values of text given a user-supplied function, such as the sum (default) or mean of the character values.

Usage

nscrabble(x, FUN = sum)

Arguments

x

a character vector

FUN

function to be applied to the character values in the text; default is sum, but could also be mean or a user-supplied function. Missing values are automatically removed.

Value

a (named) integer vector of Scrabble letter values, computed using FUN, corresponding to the input text(s)

Note

Character values are only defined for non-accented Latin a-z, A-Z letters. Lower-casing is unnecessary.

We would be happy to add more languages to this extremely useful function if you send us the values for your language!

Author(s)

Kenneth Benoit

Examples

nscrabble(c("muzjiks", "excellency"))
nscrabble(quanteda::data_corpus_inaugural[1:5], mean)

nsyllable methods for tokens

Description

Extends nsyllable() methods for tokens objects.

Usage

## S3 method for class 'tokens'
nsyllable(
  x,
  language = "en",
  syllable_dictionary = nsyllable::data_syllables_en,
  use.names = FALSE
)

Arguments

x

character vector whose syllables will be counted. This will count all syllables in a character vector without regard to separating tokens, so it is recommended that x be individual terms.

language

specify the language for syllable counts by ISO 639-1 code. The default is English, using the data object data_syllables_en, an English pronunciation dictionary from CMU.

syllable_dictionary

optional named integer vector of syllable counts where the names are lower case tokens. This can be used to override the language setting, when set to NULL (the default). If a syllable dictionary is supplied, this will override the language argument.

use.names

logical; if TRUE, assign the tokens as the names of the syllable count vector

Examples

Identify and score multi-word expressions

Description

Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.

Usage

textstat_collocations(
  x,
  method = "lambda",
  size = 2,
  min_count = 2,
  smoothing = 0.5,
  tolower = TRUE,
  ...
)

Arguments

x

a character, corpus, or tokens object whose collocations will be scored. The tokens object should include punctuation, and if any words have been removed, these should have been removed with padding = TRUE. While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized.

method

association measure for detecting collocations. Currently this is limited to "lambda". See Details.

size

integer; the length of the collocations to be scored

min_count

numeric; minimum frequency of collocations that will be scored

smoothing

numeric; a smoothing parameter added to the observed counts (default is 0.5)

tolower

logical; if TRUE, form collocations as lower-cased combinations

...

additional arguments passed to tokens()

Details

Documents are grouped for the purposes of scoring, but collocations will not span sentences. If x is a tokens object and some tokens have been removed, this should be done using ⁠[tokens_remove](x, pattern, padding = TRUE)⁠ so that counts will still be accurate, but the pads will prevent those collocations from being scored.

The lambda computed for a size = K-word target multi-word expression the coefficient for the K-way interaction parameter in the saturated log-linear model fitted to the counts of the terms forming the set of eligible multi-word expressions. This is the same as the "lambda" computed in Blaheta and Johnson's (2001), where all multi-word expressions are considered (rather than just verbs, as in that paper). The z is the Wald z-statistic computed as the quotient of lambda and the Wald statistic for lambda as described below.

In detail:

Consider a K-word target expression x, and let z be any K-word expression. Define a comparison function c(x,z)=(j_{1}, \dots, j_{K})=c such that the kth element of c is 1 if the kth word in z is equal to the kth word in x, and 0 otherwise. Let c_{i}=(j_{i1}, \dots, j_{iK}), i=1, \dots, 2^{K}=M, be the possible values of c(x,z), with c_{M}=(1,1, \dots, 1). Consider the set of c(x,z_{r}) across all expressions z_{r} in a corpus of text, and let n_{i}, for i=1,\dots,M, denote the number of the c(x,z_{r}) which equal c_{i}, plus the smoothing constant smoothing. The n_{i} are the counts in a 2^{K} contingency table whose dimensions are defined by the c_{i}.

\lambda: The K-way interaction parameter in the saturated loglinear model fitted to the n_{i}. It can be calculated as

\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}

where b_{i} is the number of the elements of c_{i} which are equal to 1.

Wald test z-statistic z is calculated as:

z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}

Value

textstat_collocations returns a data.frame of collocations and their scores and statistics. This consists of the collocations, their counts, length, and \lambda and z statistics. When size is a vector, then count_nested counts the lower-order collocations that occur within a higher-order collocation (but this does not affect the statistics).

Author(s)

Kenneth Benoit, Jouni Kuha, Haiyan Wang, and Kohei Watanabe

References

Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

Examples

library("quanteda")
corp <- data_corpus_inaugural[1:2]
head(cols <- textstat_collocations(corp, size = 2, min_count = 2), 10)
head(cols <- textstat_collocations(corp, size = 3, min_count = 2), 10)

# extracting multi-part proper nouns (capitalized terms)
toks1 <- tokens(data_corpus_inaugural)
toks2 <- tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE)
toks3 <- tokens_select(toks2, pattern = "^([A-Z][a-z\\-]{2,})", valuetype = "regex",
                       case_insensitive = FALSE, padding = TRUE)
tstat <- textstat_collocations(toks3, size = 3, tolower = FALSE)
head(tstat, 10)

# vectorized size
txt <- c(". . . . a b c . . a b c . . . c d e",
         "a b . . a b . . a b . . a b . a b",
         "b c d . . b c . b c . . . b c")
textstat_collocations(txt, size = 2:3)

# compounding tokens from collocations
toks <- tokens("This is the European Union.")
colls <- tokens("The new European Union is not the old European Union.") %>%
    textstat_collocations(size = 2, min_count = 1, tolower = FALSE)
colls
tokens_compound(toks, colls, case_insensitive = FALSE)

#' # from a collocations object
(coll <- textstat_collocations(tokens("a b c a b d e b d a b")))
phrase(coll)

Compute entropies of documents or features

Description

Compute entropies of documents or features

Usage

textstat_entropy(x, margin = c("documents", "features"), base = 2)

Arguments

x

a dfm

margin

character indicating for which margin to compute entropy

base

base for logarithm function

Value

a data.frame of entropies for the given document or feature

Examples

library("quanteda")
textstat_entropy(data_dfm_lbgexample)
textstat_entropy(data_dfm_lbgexample, "features")

Tabulate feature frequencies

Description

Produces counts and document frequencies summaries of the features in a dfm, optionally grouped by a docvars variable or other supplied grouping variable.

Usage

textstat_frequency(
  x,
  n = NULL,
  groups = NULL,
  ties_method = c("min", "average", "first", "random", "max", "dense"),
  ...
)

Arguments

x

a dfm object

n

(optional) integer specifying the top n features to be returned, within group if groups is specified

groups

grouping variable for sampling, equal in length to the number of documents. This will be evaluated in the docvars data.frame, so that docvars may be referred to by name without quoting. This also changes previous behaviours for groups. See news(Version >= "3.0", package = "quanteda") for details.

ties_method

character string specifying how ties are treated. See base::rank() for details. Unlike that function, however, the default is "min", so that frequencies of 10, 10, 11 would be ranked 1, 1, 3.

...

additional arguments passed to dfm_group(). This can be useful in passing force = TRUE, for instance, if you are grouping a dfm that has been weighted.

Value

a data.frame containing the following variables:

feature: (character) the feature
frequency: count of the feature
rank: rank of the feature, where 1 indicates the greatest frequency
docfreq: document frequency of the feature, as a count (the number of documents in which this feature occurred at least once)
docfreq: document frequency of the feature, as a count
group: (only if groups is specified) the label of the group. If the features have been grouped, then all counts, ranks, and document frequencies are within group. If groups is not specified, the group column is omitted from the returned data.frame.

textstat_frequency returns a data.frame of features and their term and document frequencies within groups.

Examples

library("quanteda")
set.seed(20)
dfmat1 <- dfm(tokens(c("a a b b c d", "a d d d", "a a a")))

textstat_frequency(dfmat1)
textstat_frequency(dfmat1, groups = c("one", "two", "one"), ties_method = "first")
textstat_frequency(dfmat1, groups = c("one", "two", "one"), ties_method = "average")

dfmat2 <- corpus_subset(data_corpus_inaugural, President == "Obama") %>%
   tokens(remove_punct = TRUE) %>%
   tokens_remove(stopwords("en")) %>%
   dfm()
tstat1 <- textstat_frequency(dfmat2)
head(tstat1, 10)

dfmat3 <- head(data_corpus_inaugural) %>%
   tokens(remove_punct = TRUE) %>%
   tokens_remove(stopwords("en")) %>%
   dfm()
textstat_frequency(dfmat3, n = 2, groups = President)


## Not run: 
# plot 20 most frequent words
library("ggplot2")
ggplot(tstat1[1:20, ], aes(x = reorder(feature, frequency), y = frequency)) +
    geom_point() +
    coord_flip() +
    labs(x = NULL, y = "Frequency")

# plot relative frequencies by group
dfmat3 <- data_corpus_inaugural %>%
    corpus_subset(Year > 2000) %>%
    tokens(remove_punct = TRUE) %>%
    tokens_remove(stopwords("en")) %>%
    dfm() %>%
    dfm_group(groups = President) %>%
    dfm_weight(scheme = "prop")

# calculate relative frequency by president
tstat2 <- textstat_frequency(dfmat3, n = 10, groups = President)

# plot frequencies
ggplot(data = tstat2, aes(x = factor(nrow(tstat2):1), y = frequency)) +
    geom_point() +
    facet_wrap(~ group, scales = "free") +
    coord_flip() +
    scale_x_discrete(breaks = nrow(tstat2):1,
                       labels = tstat2$feature) +
    labs(x = NULL, y = "Relative frequency")

## End(Not run)

Calculate keyness statistics

Description

Calculate "keyness", a score for features that occur differentially across different categories. Here, the categories are defined by reference to a "target" document index in the dfm, with the reference group consisting of all other documents.

Usage

textstat_keyness(
  x,
  target = 1L,
  measure = c("chi2", "exact", "lr", "pmi"),
  sort = TRUE,
  correction = c("default", "yates", "williams", "none"),
  ...
)

Arguments

x

a dfm containing the features to be examined for keyness

target

the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference

measure

(signed) association measure to be used for computing keyness. Currently available: "chi2"; "exact" (Fisher's exact test); "lr" for the likelihood ratio; "pmi" for pointwise mutual information. Note that the "exact" test is very computationally intensive and therefore much slower than the other methods.

sort

logical; if TRUE sort features scored in descending order of the measure, otherwise leave in original feature order

correction

if "default", Yates correction is applied to "chi2"; William's correction is applied to "lr"; and no correction is applied for the "exact" and "pmi" measures. Specifying a value other than the default can be used to override the defaults, for instance to apply the Williams correction to the chi2 measure. Specifying a correction for the "exact" and "pmi" measures has no effect and produces a warning.

...

not used

Value

a data.frame of computed statistics and associated p-values, where the features scored name each row, and the number of occurrences for both the target and reference groups. For measure = "chi2" this is the chi-squared value, signed positively if the observed value in the target exceeds its expected value; for measure = "exact" this is the estimate of the odds ratio; for measure = "lr" this is the likelihood ratio G2 statistic; for "pmi" this is the pointwise mutual information statistics.

textstat_keyness returns a data.frame of features and their keyness scores and frequency counts.

References

Bondi, M. & Scott, M. (eds) (2010). Keyness in Texts. Amsterdam, Philadelphia: John Benjamins.

Stubbs, M. (2010). Three Concepts of Keywords. In Keyness in Texts, Bondi, M. & Scott, M. (eds): 1–42. Amsterdam, Philadelphia: John Benjamins.

Scott, M. & Tribble, C. (2006). Textual Patterns: Keyword and Corpus Analysis in Language Education. Amsterdam: Benjamins: 55.

Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1): 61–74.

Examples

library("quanteda")

# compare pre- v. post-war terms using grouping
period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war")
dfmat1 <- tokens(data_corpus_inaugural) %>%
    dfm() %>%
    dfm_group(groups = period)
head(dfmat1) # make sure 'post-war' is in the first row
head(tstat1 <- textstat_keyness(dfmat1), 10)
tail(tstat1, 10)

# compare pre- v. post-war terms using logical vector
dfmat2 <- dfm(tokens(data_corpus_inaugural))
head(textstat_keyness(dfmat2, docvars(data_corpus_inaugural, "Year") >= 1945), 10)

# compare Trump 2017 to other post-war preseidents
dfmat3 <- dfm(tokens(corpus_subset(data_corpus_inaugural, period == "post-war")))
head(textstat_keyness(dfmat3, target = "2017-Trump"), 10)

# using the likelihood ratio method
head(textstat_keyness(dfm_smooth(dfmat3), measure = "lr", target = "2017-Trump"), 10)

Calculate lexical diversity

Description

Calculate the lexical diversity of text(s).

Usage

textstat_lexdiv(
  x,
  measure = c("TTR", "C", "R", "CTTR", "U", "S", "K", "I", "D", "Vm", "Maas", "MATTR",
    "MSTTR", "all"),
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_hyphens = FALSE,
  log.base = 10,
  MATTR_window = 100L,
  MSTTR_segment = 100L,
  ...
)

Arguments

x

an dfm or tokens input object for whose documents lexical diversity will be computed

measure

a character vector defining the measure to compute

remove_numbers

logical; if TRUE remove features or tokens that consist only of numerals (the Unicode "Number" ⁠[N]⁠ class)

remove_punct

logical; if TRUE remove all features or tokens that consist only of the Unicode "Punctuation" ⁠[P]⁠ class)

remove_symbols

logical; if TRUE remove all features or tokens that consist only of the Unicode "Punctuation" ⁠[S]⁠ class)

remove_hyphens

logical; if TRUE split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes two features or tokens "self" and "storage". Default is FALSE to preserve such words as is, with the hyphens.

log.base

a numeric value defining the base of the logarithm (for measures using logarithms)

MATTR_window

a numeric value defining the size of the moving window for computation of the Moving-Average Type-Token Ratio (Covington & McFall, 2010)

MSTTR_segment

a numeric value defining the size of the each segment for the computation of the the Mean Segmental Type-Token Ratio (Johnson, 1944)

...

not used directly

Details

textstat_lexdiv calculates the lexical diversity of documents using a variety of indices.

In the following formulas, N refers to the total number of tokens, V to the number of types, and f_v(i, N) to the numbers of types occurring i times in a sample of length N.

"TTR":

The ordinary Type-Token Ratio:

TTR = \frac{V}{N}

"C":

Herdan's C (Herdan, 1960, as cited in Tweedie & Baayen, 1998; sometimes referred to as LogTTR):

C = \frac{\log{V}}{\log{N}}

"R":

Guiraud's Root TTR (Guiraud, 1954, as cited in Tweedie & Baayen, 1998):

R = \frac{V}{\sqrt{N}}

"CTTR":

Carroll's Corrected TTR:

CTTR = \frac{V}{\sqrt{2N}}

"U":

Dugast's Uber Index (Dugast, 1978, as cited in Tweedie & Baayen, 1998):

U = \frac{(\log{N})^2}{\log{N} - \log{V}}

"S":

Summer's index:

S = \frac{\log{\log{V}}}{\log{\log{N}}}

"K":

Yule's K (Yule, 1944, as presented in Tweedie & Baayen, 1998, Eq. 16) is calculated by:

K = 10^4 \times \left[ -\frac{1}{N} + \sum_{i=1}^{V} f_v(i, N) \left( \frac{i}{N} \right)^2 \right]

"I":

Yule's I (Yule, 1944) is calculated by:

I = \frac{V^2}{M_2 - V}

M_2 = \sum_{i=1}^{V} i^2 * f_v(i, N)

"D":

Simpson's D (Simpson 1949, as presented in Tweedie & Baayen, 1998, Eq. 17) is calculated by:

D = \sum_{i=1}^{V} f_v(i, N) \frac{i}{N} \frac{i-1}{N-1}

"Vm":

Herdan's V_m (Herdan 1955, as presented in Tweedie & Baayen, 1998, Eq. 18) is calculated by:

V_m = \sqrt{ \sum_{i=1}^{V} f_v(i, N) (i/N)^2 - \frac{i}{V} }

"Maas":

Maas' indices (a, \log{V_0} & \log{}_{e}{V_0}):

a^2 = \frac{\log{N} - \log{V}}{\log{N}^2}

\log{V_0} = \frac{\log{V}}{\sqrt{1 - \frac{\log{V}}{\log{N}}^2}}

The measure was derived from a formula by Mueller (1969, as cited in Maas, 1972). \log{}_{e}{V_0} is equivalent to \log{V_0}, only with e as the base for the logarithms. Also calculated are a, \log{V_0} (both not the same as before) and V' as measures of relative vocabulary growth while the text progresses. To calculate these measures, the first half of the text and the full text will be examined (see Maas, 1972, p. 67 ff. for details). Note: for the current method (for a dfm) there is no computation on separate halves of the text.

"MATTR":

The Moving-Average Type-Token Ratio (Covington & McFall, 2010) calculates TTRs for a moving window of tokens from the first to the last token, computing a TTR for each window. The MATTR is the mean of the TTRs of each window.

"MSTTR":

Mean Segmental Type-Token Ratio (sometimes referred to as Split TTR) splits the tokens into segments of the given size, TTR for each segment is calculated and the mean of these values returned. When this value is < 1.0, it splits the tokens into equal, non-overlapping sections of that size. When this value is > 1, it defines the segments as windows of that size. Tokens at the end which do not make a full segment are ignored.

Value

A data.frame of documents and their lexical diversity scores.

Author(s)

Kenneth Benoit and Jiong Wei Lua. Many of the formulas have been reimplemented from functions written by Meik Michalke in the koRpus package.

References

Covington, M.A. & McFall, J.D. (2010). Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR) Journal of Quantitative Linguistics, 17(2), 94–100. doi:10.1080/09296171003643098

Herdan, G. (1955). A New Derivation and Interpretation of Yule's 'Characteristic' K. Zeitschrift für angewandte Mathematik und Physik, 6(4): 332–334.

Maas, H.D. (1972). Über den Zusammenhang zwischen Wortschatzumfang und Länge eines Textes. Zeitschrift für Literaturwissenschaft und Linguistik, 2(8), 73–96.

McCarthy, P.M. & Jarvis, S. (2007). vocd: A Theoretical and Empirical Evaluation. Language Testing, 24(4), 459–488. doi:10.1177/0265532207080767

McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment. Behaviour Research Methods, 42(2), 381–392.

Michalke, M. (2014). koRpus: An R Package for Text Analysis (Version 0.05-4). Available from https://reaktanz.de/?c=hacking&s=koRpus.

Simpson, E.H. (1949). Measurement of Diversity. Nature, 163: 688. doi:10.1038/163688a0

Tweedie. F.J. and Baayen, R.H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5), 323–352. doi:10.1023/A:1001749303137

Yule, G. U. (1944) The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.

Examples

library("quanteda")

txt <- c("Anyway, like I was sayin', shrimp is the fruit of the sea. You can
          barbecue it, boil it, broil it, bake it, saute it.",
         "There's shrimp-kabobs,
          shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's
          pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup,
          shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp
          sandwich.")
tokens(txt) %>%
    textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
dfm(tokens(txt)) %>%
    textstat_lexdiv(measure = c("TTR", "CTTR", "K"))

toks <- tokens(corpus_subset(data_corpus_inaugural, Year > 2000))
textstat_lexdiv(toks, c("CTTR", "TTR", "MATTR"), MATTR_window = 100)

[Experimental] Compute document/feature proximity

Description

This is an underlying function for textstat_dist and textstat_simil but returns TsparseMatrix.

Usage

textstat_proxy(
  x,
  y = NULL,
  margin = c("documents", "features"),
  method = c("cosine", "correlation", "jaccard", "ejaccard", "dice", "edice", "hamann",
    "simple matching", "euclidean", "chisquared", "hamming", "kullback", "manhattan",
    "maximum", "canberra", "minkowski"),
  p = 2,
  min_proxy = NULL,
  rank = NULL,
  use_na = FALSE
)

Arguments

y

if a dfm object is provided, proximity between documents or features in x and y is computed.

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

character; the method identifying the similarity or distance measure to be used; see Details.

p

The power of the Minkowski distance.

min_proxy

the minimum proximity value to be recoded.

rank

an integer value specifying top-n most proximity values to be recorded.

use_na

if TRUE, return NA for proximity to empty vectors. Note that use of NA makes the proximity matrices denser.

textstat_simil/dist classes

Description

Sparse classes for similarity and distance matrices created by textstat_simil() and textstat_dist().

Print/show method for objects created by textstat_simil and textstat_dist.

Usage

validate_min_simil(object)

## S4 method for signature 'textstat_proxy'
show(object)

Arguments

object

the textstat_proxy object to be printed

Slots

.Data: a sparse Matrix object, symmetric if selection is NULL
method: the method used for computing similarity or distance
min_simil: numeric; a threshold for the similarity values below which similarity values are not computed
margin: identifies the margin of the dfm on which similarity or difference was computed: "documents" for documents or "features" for word/term features.
type: either "textstat_simil" or "textstat_dist"
selection: target units, if any

Calculate readability

Description

Calculate the readability of text(s) using one of a variety of computed indexes.

Usage

textstat_readability(
  x,
  measure = "Flesch",
  remove_hyphens = TRUE,
  min_sentence_length = 1,
  max_sentence_length = 10000,
  intermediate = FALSE,
  ...
)

Arguments

x

a character or corpus object containing the texts

measure

character vector defining the readability measure to calculate. Matches are case-insensitive. See other valid measures under Details.

remove_hyphens

if TRUE, treat constituent words in hyphenated as separate terms, for purposes of computing word lengths, e.g. "decision-making" as two terms of lengths 8 and 6 characters respectively, rather than as a single word of 15 characters

min_sentence_length, max_sentence_length

set the minimum and maximum sentence lengths (in tokens, excluding punctuation) to include in the computation of readability. This makes it easy to exclude "sentences" that may not really be sentences, such as section titles, table elements, and other cruft that might be in the texts following conversion.

For finer-grained control, consider filtering sentences prior first, including through pattern-matching, using corpus_trim().

intermediate

if TRUE, include intermediate quantities in the output

...

not used

Details

The following readability formulas have been implemented, where

Nw = n_{w} = number of words
Nc = n_{c} = number of characters
Nst = n_{st} = number of sentences
Nsy = n_{sy} = number of syllables
Nwf = n_{wf} = number of words matching the Dale-Chall List of 3000 "familiar words"
ASL = Average Sentence Length: number of words / number of sentences
AWL = Average Word Length: number of characters / number of words
AFW = Average Familiar Words: count of words matching the Dale-Chall list of 3000 "familiar words" / number of all words
Nwd = n_{wd} = number of "difficult" words not matching the Dale-Chall list of "familiar" words

"ARI":

Automated Readability Index (Senter and Smith 1967)

0.5 ASL + 4.71 AWL - 21.34

"ARI.Simple":

A simplified version of Senter and Smith's (1967) Automated Readability Index.

ASL + 9 AWL

"Bormuth.MC":

Bormuth's (1969) Mean Cloze Formula.

0.886593 - 0.03640 \times AWL + 0.161911 \times AFW - 0.21401 \times ASL - 0.000577 \times ASL^2 - 0.000005 \times ASL^3

"Bormuth.GP":

Bormuth's (1969) Grade Placement score.

4.275 + 12.881M - 34.934M^2 + 20.388 M^3 + 26.194 CCS - 2.046 CCS^2 - 11.767 CCS^3 - 42.285(M \times CCS) + 97.620(M \times CCS)^2 - 59.538(M \times CCS)^2

where M is the Bormuth Mean Cloze Formula as in "Bormuth" above, and CCS is the Cloze Criterion Score (Bormuth, 1968).

"Coleman":

Coleman's (1971) Readability Formula 1.

1.29 \times \frac{100 \times n_{wsy=1}}{n_{w}} - 38.45

where n_{wsy=1} = Nwsy1 = the number of one-syllable words. The scaling by 100 in this and the other Coleman-derived measures arises because the Coleman measures are calculated on a per 100 words basis.

"Coleman.C2":

Coleman's (1971) Readability Formula 2.

1.16 \times \frac{100 \times n_{wsy=1}}{ Nw + 1.48 \times \frac{100 \times n_{st}}{n_{w}} - 37.95}

"Coleman.Liau.ECP":

Coleman-Liau Estimated Cloze Percent (ECP) (Coleman and Liau 1975).

141.8401 - 0.214590 \times 100 \times AWL + 1.079812 \times \frac{n_{st} \times 100}{n_{w}}

"Coleman.Liau.grade":

Coleman-Liau Grade Level (Coleman and Liau 1975).

-27.4004 \times \mathtt{Coleman.Liau.ECP} \times 100 + 23.06395

"Coleman.Liau.short":

Coleman-Liau Index (Coleman and Liau 1975).

5.88 \times AWL + 29.6 \times \frac{n_{st}}{n_{w}} - 15.8

"Dale.Chall":

The New Dale-Chall Readability formula (Chall and Dale 1995).

64 - (0.95 \times 100 \times \frac{n_{wd}}{n_{w}}) - (0.69 \times ASL)

"Dale.Chall.Old":

The original Dale-Chall Readability formula (Dale and Chall (1948).

0.1579 \times 100 \times \frac{n_{wd}}{n_{w}} + 0.0496 \times ASL [+ 3.6365]

The additional constant 3.6365 is only added if (Nwd / Nw) > 0.05.

"Dale.Chall.PSK":

The Powers-Sumner-Kearl Variation of the Dale and Chall Readability formula (Powers, Sumner and Kearl, 1958).

0.1155 \times 100 \frac{n_{wd}}{n_{w}}) + (0.0596 \times ASL) + 3.2672

"Danielson.Bryan":

Danielson-Bryan's (1963) Readability Measure 1.

(1.0364 \times \frac{n_{c}}{n_{blank}}) + (0.0194 \times \frac{n_{c}}{n_{st}}) - 0.6059

where n_{blank} = Nblank = the number of blanks.

"Danielson.Bryan2":

Danielson-Bryan's (1963) Readability Measure 2.

131.059- (10.364 \times \frac{n_{c}}{n_{blank}}) + (0.0194 \times \frac{n_{c}}{n_{st}})

where n_{blank} = Nblank = the number of blanks.

"Dickes.Steiwer":

Dickes-Steiwer Index (Dicks and Steiwer 1977).

235.95993 - (7.3021 \times AWL) - (12.56438 \times ASL) - (50.03293 \times TTR)

where TTR is the Type-Token Ratio (see textstat_lexdiv())

"DRP":

Degrees of Reading Power.

(1 - Bormuth.MC) * 100

where Bormuth.MC refers to Bormuth's (1969) Mean Cloze Formula (documented above)

"ELF":

Easy Listening Formula (Fang 1966):

\frac{n_{wsy>=2}}{n_{st}}

where n_{wsy>=2} = Nwmin2sy = the number of words with 2 syllables or more.

"Farr.Jenkins.Paterson":

Farr-Jenkins-Paterson's Simplification of Flesch's Reading Ease Score (Farr, Jenkins and Paterson 1951).

-31.517 - (1.015 \times ASL) + (1.599 \times \frac{n_{wsy=1}}{n_{w}})

where n_{wsy=1} = Nwsy1 = the number of one-syllable words.

"Flesch":

Flesch's Reading Ease Score (Flesch 1948).

206.835 - (1.015 \times ASL) - (84.6 \times \frac{n_{sy}}{n_{w}})

"Flesch.PSK":

The Powers-Sumner-Kearl's Variation of Flesch Reading Ease Score (Powers, Sumner and Kearl, 1958).

(0.0778 \times ASL) + (4.55 \times \frac{n_{sy}}{n_{w}}) - 2.2029

"Flesch.Kincaid":

Flesch-Kincaid Readability Score (Flesch and Kincaid 1975).

0.39 \times ASL + 11.8 \times \frac{n_{sy}}{n_{w}} - 15.59

"FOG":

Gunning's Fog Index (Gunning 1952).

0.4 \times (ASL + 100 \times \frac{n_{wsy>=3}}{n_{w}})

where n_{wsy>=3} = Nwmin3sy = the number of words with 3-syllables or more. The scaling by 100 arises because the original FOG index is based on just a sample of 100 words)

"FOG.PSK":

The Powers-Sumner-Kearl Variation of Gunning's Fog Index (Powers, Sumner and Kearl, 1958).

3.0680 \times (0.0877 \times ASL) +(0.0984 \times 100 \times \frac{n_{wsy>=3}}{n_{w}})

where n_{wsy>=3} = Nwmin3sy = the number of words with 3-syllables or more. The scaling by 100 arises because the original FOG index is based on just a sample of 100 words)

"FOG.NRI":

The Navy's Adaptation of Gunning's Fog Index (Kincaid, Fishburne, Rogers and Chissom 1975).

(\frac{(n_{wsy<3} + 3 \times n_{wsy=3})}{(100 \times \frac{N_{st}}{N_{w}})} - 3) / 2

where n_{wsy<3} = Nwless3sy = the number of words with less than 3 syllables, and n_{wsy=3} = Nw3sy = the number of 3-syllable words. The scaling by 100 arises because the original FOG index is based on just a sample of 100 words)

"FORCAST":

FORCAST (Simplified Version of FORCAST.RGL) (Caylor and Sticht 1973).

20 - \frac{n_{wsy=1} \times 150)}{(n_{w} \times 10)}

where n_{wsy=1} = Nwsy1 = the number of one-syllable words. The scaling by 150 arises because the original FORCAST index is based on just a sample of 150 words.

"FORCAST.RGL":

FORCAST.RGL (Caylor and Sticht 1973).

20.43 - 0.11 \times \frac{n_{wsy=1} \times 150)}{(n_{w} \times 10)}

where n_{wsy=1} = Nwsy1 = the number of one-syllable words. The scaling by 150 arises because the original FORCAST index is based on just a sample of 150 words.

"Fucks":

Fucks' (1955) Stilcharakteristik (Style Characteristic).

AWL * ASL

"Linsear.Write":

Linsear Write (Klare 1975).

\frac{[(100 - (\frac{100 \times n_{wsy<3}}{n_{w}})) + (3 \times \frac{100 \times n_{wsy>=3}}{n_{w}})]}{(100 \times \frac{n_{st}}{n_{w}})}

where n_{wsy<3} = Nwless3sy = the number of words with less than 3 syllables, and n_{wsy>=3} = Nwmin3sy = the number of words with 3-syllables or more. The scaling by 100 arises because the original Linsear.Write measure is based on just a sample of 100 words)

"LIW":

Björnsson's (1968) Läsbarhetsindex (For Swedish Texts).

ASL + \frac{100 \times n_{wsy>=7}}{n_{w}}

where n_{wsy>=7} = Nwmin7sy = the number of words with 7-syllables or more. The scaling by 100 arises because the Läsbarhetsindex index is based on just a sample of 100 words)

"nWS":

Neue Wiener Sachtextformeln 1 (Bamberger and Vanecek 1984).

19.35 \times \frac{n_{wsy>=3}}{n_{w}} + 0.1672 \times ASL + 12.97 \times \frac{b_{wchar>=6}}{n_{w}} - 3.27 \times \frac{n_{wsy=1}}{n_{w}} - 0.875

where n_{wsy>=3} = Nwmin3sy = the number of words with 3 syllables or more, n_{wchar>=6} = Nwmin6char = the number of words with 6 characters or more, and n_{wsy=1} = Nwsy1 = the number of one-syllable words.

"nWS.2":

Neue Wiener Sachtextformeln 2 (Bamberger and Vanecek 1984).

20.07 \times \frac{n_{wsy>=3}}{n_{w}} + 0.1682 \times ASL + 13.73 \times \frac{n_{wchar>=6}}{n_{w}} - 2.779

where n_{wsy>=3} = Nwmin3sy = the number of words with 3 syllables or more, and n_{wchar>=6} = Nwmin6char = the number of words with 6 characters or more.

"nWS.3":

Neue Wiener Sachtextformeln 3 (Bamberger and Vanecek 1984).

29.63 \times \frac{n_{wsy>=3}}{n_{w}} + 0.1905 \times ASL - 1.1144

where n_{wsy>=3} = Nwmin3sy = the number of words with 3 syllables or more.

"nWS.4":

Neue Wiener Sachtextformeln 4 (Bamberger and Vanecek 1984).

27.44 \times \frac{n_{wsy>=3}}{n_{w}} + 0.2656 \times ASL - 1.693

where n_{wsy>=3} = Nwmin3sy = the number of words with 3 syllables or more.

"RIX":

Anderson's (1983) Readability Index.

\frac{n_{wsy>=7}}{n_{st}}

where n_{wsy>=7} = Nwmin7sy = the number of words with 7-syllables or more.

"Scrabble":

Scrabble Measure.

Mean Scrabble Letter Values of All Words

. Scrabble values are for English. There is no reference for this, as we created it experimentally. It's not part of any accepted readability index!

"SMOG":

Simple Measure of Gobbledygook (SMOG) (McLaughlin 1969).

1.043 \times \sqrt{n_{wsy>=3}} \times \frac{30}{n_{st}} + 3.1291

where n_{wsy>=3} = Nwmin3sy = the number of words with 3 syllables or more. This measure is regression equation D in McLaughlin's original paper.

"SMOG.C":

SMOG (Regression Equation C) (McLaughlin's 1969)

0.9986 \times \sqrt{Nwmin3sy \times \frac{30}{n_{st}} + 5} + 2.8795

where n_{wsy>=3} = Nwmin3sy = the number of words with 3 syllables or more. This measure is regression equation C in McLaughlin's original paper.

"SMOG.simple":

Simplified Version of McLaughlin's (1969) SMOG Measure.

\sqrt{Nwmin3sy \times \frac{30}{n_{st}}} + 3

"SMOG.de":

Adaptation of McLaughlin's (1969) SMOG Measure for German Texts.

\sqrt{Nwmin3sy \times \frac{30}{n_{st}}-2}

"Spache":

Spache's (1952) Readability Measure.

0.121 \times ASL + 0.082 \times \frac{n_{wnotinspache}}{n_{w}} + 0.659

where n_{wnotinspache} = Nwnotinspache = number of unique words not in the Spache word list.

"Spache.old":

Spache's (1952) Readability Measure (Old).

0.141 \times ASL + 0.086 \times \frac{n_{wnotinspache}}{n_{w}} + 0.839

where n_{wnotinspache} = Nwnotinspache = number of unique words not in the Spache word list.

"Strain":

Strain Index (Solomon 2006).

n_{sy} / \frac{n_{st}}{3} /10

The scaling by 3 arises because the original Strain index is based on just the first 3 sentences.

"Traenkle.Bailer":

Tränkle & Bailer's (1984) Readability Measure 1.

224.6814 - (79.8304 \times AWL) - (12.24032 \times ASL) - (1.292857 \times 100 \times \frac{n_{prep}}{n_{w}}

where n_{prep} = Nprep = the number of prepositions. The scaling by 100 arises because the original Tränkle & Bailer index is based on just a sample of 100 words.

"Traenkle.Bailer2":

Tränkle & Bailer's (1984) Readability Measure 2.

Tränkle.Bailer2 = 234.1063 - (96.11069 \times AWL ) - (2.05444 \times 100 \times \frac{n_{prep}}{n_{w}}) - (1.02805 \times 100 \times \frac{n_{conj}}{n_{w}}

where n_{prep} = Nprep = the number of prepositions, n_{conj} = Nconj = the number of conjunctions, The scaling by 100 arises because the original Tränkle & Bailer index is based on just a sample of 100 words)

"Wheeler.Smith":

Wheeler & Smith's (1954) Readability Measure.

ASL \times 10 \times \frac{n_{wsy>=2}}{n_{words}}

where n_{wsy>=2} = Nwmin2sy = the number of words with 2 syllables or more.

"meanSentenceLength":

Average Sentence Length (ASL).

\frac{n_{w}}{n_{st}}

"meanWordSyllables":

Average Word Syllables (AWL).

\frac{n_{sy}}{n_{w}}

Value

textstat_readability returns a data.frame of documents and their readability scores.

Author(s)

Kenneth Benoit, re-engineered from Meik Michalke's koRpus package.

References

Anderson, J. (1983). Lix and rix: Variations on a little-known readability index. Journal of Reading, 26(6), 490–496. ⁠https://www.jstor.org/stable/40031755⁠

Bamberger, R. & Vanecek, E. (1984). Lesen-Verstehen-Lernen-Schreiben. Wien: Jugend und Volk.

Björnsson, C. H. (1968). Läsbarhet. Stockholm: Liber.

Bormuth, J.R. (1969). Development of Readability Analysis.

Bormuth, J.R. (1968). Cloze test readability: Criterion reference scores. Journal of educational measurement, 5(3), 189–196. ⁠https://www.jstor.org/stable/1433978⁠

Caylor, J.S. (1973). Methodologies for Determining Reading Requirements of Military Occupational Specialities. ⁠https://eric.ed.gov/?id=ED074343⁠

Caylor, J.S. & Sticht, T.G. (1973). Development of a Simple Readability Index for Job Reading Material ⁠https://archive.org/details/ERIC_ED076707⁠

Coleman, E.B. (1971). Developing a technology of written instruction: Some determiners of the complexity of prose. Verbal learning research and the technology of written instruction, 155–204.

Coleman, M. & Liau, T.L. (1975). A Computer Readability Formula Designed for Machine Scoring. Journal of Applied Psychology, 60(2), 283. doi:10.1037/h0076540

Dale, E. and Chall, J.S. (1948). A Formula for Predicting Readability: Instructions. Educational Research Bulletin, 37-54. ⁠https://www.jstor.org/stable/1473169⁠

Chall, J.S. and Dale, E. (1995). Readability Revisited: The New Dale-Chall Readability Formula. Brookline Books.

Dickes, P. & Steiwer, L. (1977). Ausarbeitung von Lesbarkeitsformeln für die Deutsche Sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie 9(1), 20–28.

Danielson, W.A., & Bryan, S.D. (1963). Computer Automation of Two Readability Formulas. Journalism Quarterly, 40(2), 201–206. doi:10.1177/107769906304000207

DuBay, W.H. (2004). The Principles of Readability.

Fang, I. E. (1966). The "Easy listening formula". Journal of Broadcasting & Electronic Media, 11(1), 63–68. doi:10.1080/08838156609363529

Farr, J. N., Jenkins, J.J., & Paterson, D.G. (1951). Simplification of Flesch Reading Ease Formula. Journal of Applied Psychology, 35(5): 333. doi:10.1037/h0057532

Flesch, R. (1948). A New Readability Yardstick. Journal of Applied Psychology, 32(3), 221. doi:10.1037/h0057532

Fucks, W. (1955). Der Unterschied des Prosastils von Dichtern und anderen Schriftstellern. Sprachforum, 1, 233-244.

Gunning, R. (1952). The Technique of Clear Writing. New York: McGraw-Hill.

Klare, G.R. (1975). Assessing Readability. Reading Research Quarterly, 10(1), 62-102. doi:10.2307/747086

Kincaid, J. P., Fishburne Jr, R.P., Rogers, R.L., & Chissom, B.S. (1975). Derivation of New Readability Formulas (Automated Readability Index, FOG count and Flesch Reading Ease Formula) for Navy Enlisted Personnel.

McLaughlin, G.H. (1969). SMOG Grading: A New Readability Formula. Journal of Reading, 12(8), 639-646.

Michalke, M. (2014). koRpus: An R Package for Text Analysis (Version 0.05-4). Available from https://reaktanz.de/?c=hacking&s=koRpus.

Powers, R.D., Sumner, W.A., and Kearl, B.E. (1958). A Recalculation of Four Adult Readability Formulas. Journal of Educational Psychology, 49(2), 99. doi:10.1037/h0043254

Senter, R. J., & Smith, E. A. (1967). Automated readability index. Wright-Patterson Air Force Base. Report No. AMRL-TR-6620.

*Solomon, N. W. (2006). Qualitative Analysis of Media Language. India.

Spache, G. (1953). "A new readability formula for primary-grade reading materials." The Elementary School Journal, 53, 410–413. ⁠https://www.jstor.org/stable/998915⁠

Tränkle, U. & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von Lesbarkeitsformeln für die deutsche Sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 16(3), 231–244.

Wheeler, L.R. & Smith, E.H. (1954). A Practical Readability Formula for the Classroom Teacher in the Primary Grades. Elementary English, 31, 397–399. ⁠https://www.jstor.org/stable/41384251⁠

*Nimaldasan is the pen name of N. Watson Solomon, Assistant Professor of Journalism, School of Media Studies, SRM University, India.

Examples

txt <- c(doc1 = "Readability zero one. Ten, Eleven.",
         doc2 = "The cat in a dilapidated tophat.")
textstat_readability(txt, measure = "Flesch")
textstat_readability(txt, measure = c("FOG", "FOG.PSK", "FOG.NRI"))

textstat_readability(quanteda::data_corpus_inaugural[48:58],
                     measure = c("Flesch.Kincaid", "Dale.Chall.old"))

Select rows of textstat objects by glob, regex or fixed patterns

Description

Users can subset output object of textstat_collocations, textstat_keyness or textstat_frequency based on "glob", "regex" or "fixed" patterns using this method.

Usage

textstat_select(
  x,
  pattern = NULL,
  selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE
)

Arguments

x

a textstat object

pattern

see quanteda::pattern

selection

whether to "keep" or "remove" the rows that match the pattern

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

Examples

library("quanteda")

period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war")
dfmat <- tokens(data_corpus_inaugural) %>%
    dfm() %>%
    dfm_group(groups = period)
tstat <- textstat_keyness(dfmat)
textstat_select(tstat, 'america*')

Similarity and distance computation between documents or features

Description

These functions compute matrixes of distances and similarities between documents or features from a dfm and return a matrix of similarities or distances in a sparse format. These methods are fast and robust because they operate directly on the sparse dfm objects. The output can easily be coerced to an ordinary matrix, a data.frame of pairwise comparisons, or a dist format.

Usage

textstat_simil(
  x,
  y = NULL,
  selection = NULL,
  margin = c("documents", "features"),
  method = c("correlation", "cosine", "jaccard", "ejaccard", "dice", "edice", "hamann",
    "simple matching"),
  min_simil = NULL,
  ...
)

textstat_dist(
  x,
  y = NULL,
  selection = NULL,
  margin = c("documents", "features"),
  method = c("euclidean", "manhattan", "maximum", "canberra", "minkowski"),
  p = 2,
  ...
)

Arguments

x, y

a dfm objects; y is an optional target matrix matching x in the margin on which the similarity or distance will be computed.

selection

(deprecated - use y instead).

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

character; the method identifying the similarity or distance measure to be used; see Details.

min_simil

numeric; a threshold for the similarity values below which similarity values will not be returned

...

unused

p

The power of the Minkowski distance.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", and "hamann".

textstat_dist options are: "euclidean" (default), "manhattan", "maximum", "canberra", and "minkowski".

Value

A sparse matrix from the Matrix package that will be symmetric unless y is specified.

Conversion to other data types

The output objects from textstat_simil() and textstat_dist() can be transformed easily into a list format using as.list(), which returns a list for each unique element of the second of the pairs, a data.frame using as.data.frame(), which returns pairwise scores, as.dist()for a dist object, or as.matrix() to convert it into an ordinary matrix.

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in ⁠[dfm_weight](x, "prop")⁠.

Examples

# similarities for documents
library("quanteda")
dfmat <- corpus_subset(data_corpus_inaugural, Year > 2000) %>%
    tokens(remove_punct = TRUE) %>%
    tokens_remove(stopwords("english")) %>%
    dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
as.list(tstat1)
as.list(tstat1, diag = TRUE)

# min_simil
(tstat2 <- textstat_simil(dfmat, method = "cosine", margin = "documents", min_simil = 0.6))
as.matrix(tstat2)

# similarities for for specific documents
textstat_simil(dfmat, dfmat["2017-Trump", ], margin = "documents")
textstat_simil(dfmat, dfmat["2017-Trump", ], method = "cosine", margin = "documents")
textstat_simil(dfmat, dfmat[c("2009-Obama", "2013-Obama"), ], margin = "documents")

# compute some term similarities
tstat3 <- textstat_simil(dfmat, dfmat[, c("fair", "health", "terror")], method = "cosine",
                         margin = "features")
head(as.matrix(tstat3), 10)
as.list(tstat3, n = 6)


# distances for documents
(tstat4 <- textstat_dist(dfmat, margin = "documents"))
as.matrix(tstat4)
as.list(tstat4)
as.dist(tstat4)

# distances for specific documents
textstat_dist(dfmat, dfmat["2017-Trump", ], margin = "documents")
(tstat5 <- textstat_dist(dfmat, dfmat[c("2009-Obama" , "2013-Obama"), ], margin = "documents"))
as.matrix(tstat5)
as.list(tstat5)

## Not run: 
# plot a dendrogram after converting the object into distances
plot(hclust(as.dist(tstat4)))

## End(Not run)

Summarize documents as syntactic and lexical feature counts

Description

Count syntactic and lexical features of documents such as tokens, types, sentences, and character categories.

Usage

textstat_summary(x, ...)

Arguments

x

corpus to be summarized

...

additional arguments passed through to dfm()

Details

Count the total number of characters, tokens and sentences as well as special tokens such as numbers, punctuation marks, symbols, tags and emojis.

chars = number of characters; equal to nchar()
sents = number of sentences; equal ntoken(tokens(x), what = "sentence")
tokens = number of tokens; equal to ntoken()
types = number of unique tokens; equal to ntype()
puncts = number of punctuation marks (⁠^\p{P}+$⁠)
numbers = number of numeric tokens (⁠^\p{Sc}{0,1}\p{N}+([.,]*\p{N})*\p{Sc}{0,1}$⁠)
symbols = number of symbols (⁠^\p{S}$⁠)
tags = number of tags; sum of pattern_username and pattern_hashtag in quanteda::quanteda_options()
emojis = number of emojis (⁠^\p{Emoji_Presentation}+$⁠)

Examples

if (Sys.info()["sysname"] != "SunOS") {
library("quanteda")
corp <- data_corpus_inaugural[1:5]
textstat_summary(corp)
toks <- tokens(corp)
textstat_summary(toks)
dfmat <- dfm(toks)
textstat_summary(dfmat)
}

quanteda.textstats: Textual Statistics for the Quantitative Analysis of Textual Data

Description

Author(s)

See Also

textstat_simil/dist coercion methods

Description

Usage

Arguments

Value

as.matrix method for textstat_simil_sparse

Description

Usage

Arguments

Value

Check arguments passed to other functions via ...

Description

Usage

Arguments

Compute lexical diversity from a dfm or tokens

Description

Usage

Arguments

Details

Value

Compute the Moving-Average Type-Token Ratio (MATTR)

Description

Usage

Arguments

Compute the Mean Segmental Type-Token Ratio (MSTTR)

Description

Usage

Arguments

Word lists for readability statistics

Description

Usage

Format

References

Split a dfm's hyphenated features into constituent parts

Description

Usage

Arguments

convert same-value pairs to NA in a textstat_proxy object

Description

Usage

Arguments

Value

Internal function to extract docvars

Description

Usage

Arguments

Return the first or last part of a textstat_proxy object

Description

Usage

Arguments

Value

Count the Scrabble letter values of text

Description

Usage

Arguments

Value

Note

Author(s)

Examples

nsyllable methods for tokens

Description

Usage

Arguments

Examples

Identify and score multi-word expressions

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Compute entropies of documents or features

Description

Usage