% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/extract_tables.R
\name{extract_tables}
\alias{extract_tables}
\title{extract_tables}
\usage{
extract_tables(file, pages = NULL, area = NULL, columns = NULL,
  guess = TRUE, method = c("decide", "lattice", "stream"),
  output = "matrix", password = NULL, encoding = NULL, copy = FALSE,
  ...)
}
\arguments{
\item{file}{A character string specifying the path or URL to a PDF file.}

\item{pages}{An optional integer vector specifying pages to extract from.}

\item{area}{An optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page. As a convenience, a list of length 1 can be used to extract the same area from all (specified) pages. Only specify \code{area} xor \code{columns}.}

\item{columns}{An optional list, of length equal to the number of pages specified, where each entry contains a numeric vector of horizontal (x) coordinates separating columns of data for the corresponding page. As a convenience, a list of length 1 can be used to specify the same columns for all (specified) pages. Only specify \code{area} xor \code{columns}.}

\item{guess}{A logical indicating whether to guess the locations of tables on each page. If \code{FALSE}, \code{area} or \code{columns} must be specified; if \code{TRUE}, columns is ignored.}

\item{method}{A string identifying the prefered method of table extraction.
\itemize{
  \item \code{method = "decide"} (default) automatically decide (for each page) whether spreadsheet-like formatting is present and "lattice" is appropriate
  \item \code{method = "lattice"} use Tabula's spreadsheet extraction algorithm
  \item \code{method = "stream"} use Tabula's basic extraction algorithm
}}

\item{output}{A function to coerce the Java response object (a Java ArrayList of Tabula Tables) to some output format. The default method, \dQuote{matrices}, returns a list of character matrices. See Details for other options.}

\item{password}{Optionally, a character string containing a user password to access a secured PDF.}

\item{encoding}{Optionally, a character string specifying an encoding for the text, to be passed to the assignment method of \code{\link[base]{Encoding}}.}

\item{copy}{Specifies whether the original local file(s) should be copied to
\code{tempdir()} before processing. \code{FALSE} by default. The argument is
ignored if \code{file} is URL.}

\item{\dots}{These are additional arguments passed to the internal functions dispatched by \code{method}.}
}
\value{
By default, a list of character matrices. This can be changed by specifying an alternative value of \code{method} (see Details).
}
\description{
Extract tables from a file
}
\details{
This function mimics the behavior of the Tabula command line utility. It returns a list of R character matrices containing tables extracted from a file by default. This response behavior can be changed by using the following options.
\itemize{
  \item \code{output = "character"} returns a list of single-element character vectors, where each vector is a tab-delimited, line-separate string of concatenated table cells.
  \item \code{output = "data.frame"} attempts to coerce the structure returned by \code{method = "character"} into a list of data.frames and returns character strings where this fails.
  \item \code{output = "csv"} writes the tables to comma-separated (CSV) files using Tabula's CSVWriter method in the same directory as the original PDF. \code{method = "tsv"} does the same but with tab-separated (TSV) files using Tabula's TSVWriter and \code{method = "json"} does the same using Tabula's JSONWriter method. Any of these three methods return the path to the directory containing the extract table files. 
  \item \code{output = "asis"} returns the Java object reference, which can be useful for debugging or for writing a custom parser.
}
\code{\link{extract_areas}} implements this functionality in an interactive mode allowing the user to specify extraction areas for each page.
}
\examples{
\donttest{
# simple demo file
f <- system.file("examples", "data.pdf", package = "tabulizer")

# extract all tables
extract_tables(f)

# extract tables from only second page
extract_tables(f, pages = 2)

# extract areas from a page
## full table
extract_tables(f, pages = 2, area = list(c(126, 149, 212, 462)))
## part of the table
extract_tables(f, pages = 2, area = list(c(126, 284, 174, 417)))

# return data.frames
extract_tables(f, pages = 2, output = "data.frame")
}
}
\references{
\href{http://tabula.technology/}{Tabula}
}
\seealso{
\code{\link{extract_areas}}, \code{\link{get_page_dims}}, \code{\link{make_thumbnails}}, \code{\link{split_pdf}}
}
\author{
Thomas J. Leeper <thosjleeper@gmail.com>
}
