% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/lexical.R
\name{lexical_summary}
\alias{lexical_summary}
\title{lexical_summary}
\usage{
lexical_summary(dtm, corpus, variable = NULL, unit = c("document",
  "global"))
}
\arguments{
\item{dtm}{A \code{DocumentTermMatrix} containing the terms to summarize,
which may have been stemmed.}

\item{corpus}{A \code{Corpus} object containing the original texts from which
\code{dtm} was constructed.}

\item{variable}{An optional vector with one element per document indicating to which
category it belongs. If `NULL, per-document measures are returned.}

\item{unit}{When \code{variable} is not \code{NULL}, defines the way measures are aggregated
(see below).}
}
\value{
A \code{table} object with the following information for each document or
each category of documents in the corpus:
\itemize{
\item total number of terms
\item number and percent of unique terms (i.e. appearing at least once)
number and percent of hapax legomena (i.e. terms appearing once and only once)
\item total number of words
\item number and percent of long words (defined as at least seven characters)
\item number and percent of very long words (defined as at least ten characters)
\item average word length
}
}
\description{
Build a lexical summary table, optionally over a variable.
}
\details{
\emph{Words} are defined as the forms of two or more characters present in the texts
before stemming and stopword removal. On the contrary, unique \emph{terms} are extracted
from \code{dtm}, which means they do not include words that were removed from it, and that
words different in the original text might become identical terms if stemming was performed.
Please note that percentages for terms and words are computed with regard
respectively to the total number of terms and of words, so the denominators are not the
same for all measures.

When \code{variable} is not \code{NULL}, \code{unit} defines two different ways of
aggregating per-document statistics into per-category measures:
\itemize{
\item "document": values computed for each document are simply averaged for each category.
\item "global": values are computed for each category taken as a whole: word counts are summed
for each category, and ratios and averages are calculated for this level only, from
the summed counts.
}

This distinction does not make sense when \code{variable=NULL}: in this case, "level"
in the above explanation corresponds to "document", and two columns are provided about
the whole corpus.
\itemize{
\item "Corpus mean" is simply the average value of measures over all documents
\item "Corpus total" is the sum of the number of terms, the percentage of terms (ratio of
the summed numbers of terms) and the average word length in the corpus when taken as a
single document.
}
}
\examples{

file <- system.file("texts", "reut21578-factiva.xml", package="tm.plugin.factiva")
corpus <- import_corpus(file, "factiva", language="en")
dtm <- build_dtm(corpus)
lexical_summary(dtm, corpus)

}
