Title: | Estimating Speakers of Texts |
---|---|
Description: | Estimates the authors or speakers of texts. Methods developed in Huang, Perry, and Spirling (2020) <doi:10.1017/pan.2019.49>. The model is built on a Bayesian framework in which the distinctiveness of each speaker is defined by how different, on average, the speaker's terms are to everyone else in the corpus of texts. An optional cross-validation method is implemented to select the subset of terms that generate the most accurate speaker predictions. Once a set of terms is selected, the model can be estimated. Speaker distinctiveness and term influence can be recovered from parameters in the model using package functions. Once fitted, the model can be used to predict authorship of new texts. |
Authors: | Christian Baehr [aut, cre, cph], Arthur Spirling [aut, cph], Leslie Huang [aut] |
Maintainer: | Christian Baehr <[email protected]> |
License: | GPL-3 |
Version: | 0.1 |
Built: | 2025-02-18 05:12:46 UTC |
Source: | https://github.com/cran/stylest2 |
A dataset of text from English novels by Jane Austen, George Eliot, and Elizabeth Gaskell.
data(novels)
data(novels)
A dataframe with 21 rows and 3 variables.
Novel excerpts obtained from Project Gutenberg full texts in the public domain in the USA. http://gutenberg.org
A dataset of text from English novels by Jane Austen, George Eliot, and Elizabeth Gaskell. It has been tokenized and processed as a document-feature matrix in quanteda.
data(novels_dfm)
data(novels_dfm)
A quanteda dfm
with a document variable titled "author".
Novel excerpts obtained from Project Gutenberg full texts in the public domain in the USA. http://gutenberg.org
stylest2 provides a set of functions for fitting a model of speaker distinctiveness, including tools for selecting the optimal vocabulary for the model and predicting the most likely speaker (author) of a new text.
This function generates a model of speaker/author attribution, given a document-feature matrix.
stylest2_fit( dfm, smoothing = 0.5, terms = NULL, term_weights = NULL, fill_weight = NULL )
stylest2_fit( dfm, smoothing = 0.5, terms = NULL, term_weights = NULL, fill_weight = NULL )
dfm |
a quanteda |
smoothing |
the smoothing parameter value for smoothing the dfm. Should be a numeric scalar, default to 0.5. |
terms |
If not |
term_weights |
Named vector of distances (or any weights) per term in the vocab. Names should correspond to the term. |
fill_weight |
Numeric value to fill in as weight for any term which does
not have a weight specified in |
An S3 object, a model with with each term that occurs in the text, the frequency of use for each author, and the frequency of that terms' occurrence through the texts.
data(novels_dfm) stylest2_fit(dfm = novels_dfm)
data(novels_dfm) stylest2_fit(dfm = novels_dfm)
This function generates predicted probabilities of authorship for a set of texts. It takes as an input a document-feature matrix of texts for which authorship is to be predicted, as well as a stylest2 model containing potential authors.
stylest2_predict( dfm, model, speaker_odds = FALSE, term_influence = FALSE, prior = NULL )
stylest2_predict( dfm, model, speaker_odds = FALSE, term_influence = FALSE, prior = NULL )
dfm |
a quanteda |
model |
A stylest2 model. |
speaker_odds |
Should the model return log odds of authorship for each text, in addition to posterior probabilities? |
term_influence |
Should the model return the influence of each term in determining authorship over the prediction set, in addition to returning posterior probabilities? |
prior |
Prior probability, defaults to |
A list object:
data(novels_dfm) mod <- stylest2_fit(novels_dfm) stylest2_predict(dfm=novels_dfm, model=mod)
data(novels_dfm) mod <- stylest2_fit(novels_dfm) stylest2_predict(dfm=novels_dfm, model=mod)
K-fold cross validation to determine the optimal cutoff on the term frequency distribution under which to drop terms.
stylest2_select_vocab( dfm, smoothing = 0.5, cutoffs = c(50, 60, 70, 80, 90, 99), nfold = 5, terms = NULL, term_weights = NULL, fill = FALSE, fill_weight = NULL, suppress_warning = TRUE )
stylest2_select_vocab( dfm, smoothing = 0.5, cutoffs = c(50, 60, 70, 80, 90, 99), nfold = 5, terms = NULL, term_weights = NULL, fill = FALSE, fill_weight = NULL, suppress_warning = TRUE )
dfm |
a quanteda |
smoothing |
the smoothing parameter value for smoothing the dfm. Should be a numeric scalar, default to 0.5. |
cutoffs |
a numeric vector of cutoff candidates. |
nfold |
number of folds for the cross-validation |
terms |
If not |
term_weights |
Named vector of distances (or any weights) per term in the vocab. Names should correspond to the term. |
fill |
Should missing values in term weights be filled? Defaults to FALSE. |
fill_weight |
Numeric value to fill in as weight for any term which does
not have a weight specified in |
suppress_warning |
TRUE/FALSE, indicate whether to suppress warnings from
|
List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation.
data(novels_dfm) stylest2_select_vocab(dfm=novels_dfm)
data(novels_dfm) stylest2_select_vocab(dfm=novels_dfm)
A function to select terms for inclusion in a stylest2 model, based on a document-feature matrix of texts to predict and a specified cutoff.
stylest2_terms(dfm, cutoff)
stylest2_terms(dfm, cutoff)
dfm |
a quanteda |
cutoff |
a single numeric value - the quantile of term frequency under which to drop terms. |
A character vector of terms falling above the term frequency cutoff.
data(novels_dfm) best_cut <- stylest2_select_vocab(dfm=novels_dfm) stylest2_terms(dfm = novels_dfm, cutoff=best_cut$cutoff_pct_best)
data(novels_dfm) best_cut <- stylest2_select_vocab(dfm=novels_dfm) stylest2_terms(dfm = novels_dfm, cutoff=best_cut$cutoff_pct_best)