% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tbl_dbi.R
\name{diagnose_category.tbl_dbi}
\alias{diagnose_category.tbl_dbi}
\title{Diagnose data quality of categorical variables in the DBMS}
\usage{
\method{diagnose_category}{tbl_dbi}(
  .data,
  ...,
  top = 10,
  type = c("rank", "n")[1],
  in_database = TRUE,
  collect_size = Inf
)
}
\arguments{
\item{.data}{a tbl_dbi.}

\item{...}{one or more unquoted expressions separated by commas.
You can treat variable names like they are positions.
Positive values select variables; negative values to drop variables.
If the first expression is negative, diagnose_category() will automatically
start with all variables.
These arguments are automatically quoted and evaluated in a context where
column names represent column positions.
They support unquoting and splicing.}

\item{top}{an integer. Specifies the upper top rank to extract.
Default is 10.}

\item{type}{a character string specifying how result are extracted.
Default is "rank" that extract top n ranks by decreasing frequency. 
In this case, if there are ties in rank, more rows than the number specified 
by the top argument are returned.
"n" extract top n rows by decreasing frequency. 
If there are too many rows to be returned because there are too many ties, 
you can adjust the returned rows appropriately by using "n".}

\item{in_database}{Specifies whether to perform in-database operations. 
If TRUE, most operations are performed in the DBMS. if FALSE, 
table data is taken in R and operated in-memory.}

\item{collect_size}{a integer. The number of data samples from the DBMS to R. 
Applies only if in_database = FALSE.}
}
\value{
an object of tbl_df.
}
\description{
The diagnose_category() produces information for 
diagnosing the quality of the character(CHAR, VARCHAR, VARCHAR2, etc.) 
column of the DBMS table through tbl_dbi.
}
\details{
The scope of the diagnosis is the occupancy status of the levels
in categorical data. If a certain level of occupancy is close to 100%,
then the removal of this variable in the forecast model will have to be
considered. Also, if the occupancy of all levels is close to 0%, this
variable is likely to be an identifier.
You can use grouped_df as the group_by() function.
}
\section{Categorical diagnostic information}{

The information derived from the categorical data diagnosis is as follows.

\itemize{
\item variables : variable names
\item levels: level names
\item N : number of observation
\item freq : number of observation at the levels
\item ratio : percentage of observation at the levels
\item rank : rank of occupancy ratio of levels
}

See vignette("diagonosis") for an introduction to these concepts.
}

\examples{
# If you have the 'DBI' and 'RSQLite' packages installed, perform the code block:
if (FALSE) {
library(dplyr)

# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy jobchange to the DBMS with a table named TB_JOBCHANGE
copy_to(con_sqlite, jobchange, name = "TB_JOBCHANGE", overwrite = TRUE)

# Using pipes ---------------------------------
# Diagnosis of all categorical variables
con_sqlite \%>\% 
  tbl("TB_JOBCHANGE") \%>\% 
  diagnose_category()
  
# Positive values select variables
con_sqlite \%>\% 
  tbl("TB_JOBCHANGE") \%>\% 
  diagnose_category(company_type, job_chnge)
  
# Negative values to drop variables, and In-memory mode
con_sqlite \%>\% 
  tbl("TB_JOBCHANGE") \%>\% 
  diagnose_category(-company_type, -job_chnge, in_database = FALSE)
  
# Positions values select variables, and In-memory mode and collect size is 200
con_sqlite \%>\% 
  tbl("TB_JOBCHANGE") \%>\% 
  diagnose_category(7, in_database = FALSE, collect_size = 200) 
  
# Negative values to drop variables
con_sqlite \%>\% 
  tbl("TB_JOBCHANGE") \%>\% 
  diagnose_category(-7)
  
# Top rank levels with top argument
con_sqlite \%>\% 
  tbl("TB_JOBCHANGE") \%>\% 
  diagnose_category(top = 2)

# Using pipes & dplyr -------------------------
# Extraction of level that is more than 60\% of categorical data
con_sqlite \%>\% 
  tbl("TB_JOBCHANGE") \%>\% 
  diagnose_category()  \%>\%
  filter(ratio >= 60)
  
# Using group_by() ---------------------------- 
con_sqlite \%>\% 
  tbl("TB_JOBCHANGE") \%>\% 
  group_by(job_chnge) \%>\% 
  diagnose_category(company_type) 
  
# Using type argument -------------------------
 dfm <- data.frame(alpabet = c(rep(letters[1:5], times = 5), "c")) 
 
# copy dfm to the DBMS with a table named TB_EXAMPLE
copy_to(con_sqlite, dfm, name = "TB_EXAMPLE", overwrite = TRUE)  
 
# extract rows that less than equal rank 10
# default of top argument is 10
con_sqlite \%>\% 
  tbl("TB_EXAMPLE") \%>\% 
  diagnose_category()
   
# extract rows that less than equal rank 2
con_sqlite \%>\% 
  tbl("TB_EXAMPLE") \%>\% 
  diagnose_category(top = 2, type = "rank")
   
# extract rows that less than equal rank 2
# default of type argument is "rank"
con_sqlite \%>\% 
  tbl("TB_EXAMPLE") \%>\% 
  diagnose_category(top = 2)
 
# extract only 2 rows
con_sqlite \%>\% 
  tbl("TB_EXAMPLE") \%>\% 
  diagnose_category(top = 2, type = "n")

# Disconnect DBMS   
DBI::dbDisconnect(con_sqlite)
}

}
\seealso{
\code{\link{diagnose_category.data.frame}}, \code{\link{diagnose.tbl_dbi}}, 
\code{\link{diagnose_category.tbl_dbi}}, \code{\link{diagnose_numeric.tbl_dbi}}, 
\code{\link{diagnose_outlier.tbl_dbi}}.
}
