---
title: "Staggered DiD with Nonlinear Outcomes: Methods and Usage"
author: "NonlinearDiD Package"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Staggered DiD with Nonlinear Outcomes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width  = 7,
  fig.height = 4.5,
  warning    = FALSE,
  message    = FALSE
)
library(NonlinearDiD)
library(ggplot2)
set.seed(42)
```

# Introduction

The **NonlinearDiD** package implements difference-in-differences estimators for
staggered treatment adoption with **binary, count, and other nonlinear outcomes**.

## Why Nonlinear Outcomes Require Special Methods

The landmark Callaway & Sant'Anna (2021) framework assumes that parallel trends
holds on the **probability scale**:

$$E[Y_{it}(0) - Y_{it-1}(0) \mid G = g] = E[Y_{it}(0) - Y_{it-1}(0) \mid G = \infty]$$

For continuous outcomes (wages, test scores), this is natural. But for binary
outcomes (employment, hospitalization, default), parallel trends in probabilities
implies **parallel trends in log-odds** only under a very specific functional
form restriction.

Roth & Sant'Anna (2023) show this creates two distinct problems:

1. **Sensitivity to functional form**: Whether pre-treatment trends appear
   parallel depends entirely on which scale you test.

2. **Treatment effect heterogeneity conflation**: In nonlinear models, the
   "treatment effect" includes both the true effect and Jensen's inequality
   adjustments.

This package addresses both problems.

---

# Installation

```r
# From CRAN (forthcoming)
install.packages("NonlinearDiD")

# Development version from GitHub
remotes::install_github("example/NonlinearDiD")
```

---

# Binary Outcomes: Staggered DiD

## Simulating Data

```{r simulate}
dat <- sim_binary_panel(
  n            = 800,
  nperiods     = 8,
  prop_treated = 0.6,
  n_cohorts    = 3,
  true_att     = c(0.15, 0.25, 0.20),  # heterogeneous treatment effects
  base_prob    = 0.3,
  seed         = 42
)

head(dat)
cat("Treatment cohorts:", table(dat$g[dat$period == 1]), "\n")
cat("Baseline outcome rate:", round(mean(dat$y[dat$D == 0]), 3), "\n")
```

## Estimating ATT(g,t) with Logit Model

```{r attgt}
res <- nonlinear_attgt(
  data          = dat,
  yname         = "y",
  tname         = "period",
  idname        = "id",
  gname         = "g",
  xformla       = ~ x1 + x2,
  outcome_model = "logit",
  estimand      = "att",
  control_group = "nevertreated",
  doubly_robust = TRUE
)

summary(res)
```

The `nonlinear_attgt()` function estimates a separate ATT for each
(group, time) pair. The key difference from the linear case: we model the
**change in log-odds** for control units as the counterfactual, not the
change in probability.

## Event-Study Aggregation

```{r eventstudy, fig.cap="Event-study plot for binary outcome DiD"}
agg_dynamic <- nonlinear_aggte(res, type = "dynamic")
print(agg_dynamic)
plot(agg_dynamic)
```

## Group-Level ATT

```{r groupatt}
agg_group <- nonlinear_aggte(res, type = "group")
print(agg_group)
```

The group-level ATT tells us: cohort 4 experienced the largest treatment effect
(they adopted treatment earlier, when baseline rates were lower).

## Pre-Trends Test

```{r pretest}
pt <- nonlinear_pretest(res, plot = FALSE)
print(pt)
```

The pre-trends test examines whether ATT(g,t) = 0 for all pre-treatment
periods. **Critical caveat**: A failure to reject pre-trends does NOT validate
parallel trends — it only fails to find evidence against it. For binary
outcomes, the test is conducted on the same scale as the estimator.

---

# Odds-Ratio DiD

For some binary outcome applications, the **odds-ratio DiD** is preferable
because:
- It is invariant to the choice of reference group
- It does not require parallel trends in probabilities (only in log-odds)
- It has a natural multiplicative interpretation

```{r ordata}
# Use simple 2-period case for clarity
dat2 <- dat[dat$period %in% c(3, 5), ]

res_or <- odds_ratio_did(
  data           = dat2,
  yname          = "y",
  tname          = "period",
  idname         = "id",
  treat_period   = 5,
  control_period = 3,
  gname          = "g"
)

print(res_or)
```

An OR-DiD of 1 means no treatment effect. Values > 1 indicate the treatment
increased the odds of Y = 1.

---

# Count Outcomes: Poisson DiD

For count outcomes (patents, hospital visits, crimes), we assume parallel
trends on the **log scale** (multiplicative parallel trends):

$$\frac{E[Y_{it}(0)]}{E[Y_{it-1}(0)]} = \frac{E[Y_{jt}(0)]}{E[Y_{jt-1}(0)]}$$

```{r countdata}
count_dat <- sim_count_panel(
  n            = 600,
  nperiods     = 6,
  prop_treated = 0.5,
  true_rr      = c(1.5, 2.0, 1.3),  # rate ratios per cohort
  base_rate    = 10,
  seed         = 7
)

summary(count_dat$y)
```

```{r countdid}
# Staggered Poisson DiD
res_count <- nonlinear_attgt(
  data          = count_dat,
  yname         = "y",
  tname         = "period",
  idname        = "id",
  gname         = "g",
  outcome_model = "poisson",
  estimand      = "att",
  control_group = "nevertreated"
)

agg_count <- nonlinear_aggte(res_count, type = "dynamic")
plot(agg_count)
```

```{r pois2x2}
# Simple 2x2 Poisson DiD (Wooldridge QMLE)
count_sub <- count_dat[count_dat$period %in% c(2, 4), ]
res_pois  <- count_did_poisson(
  count_sub,
  yname          = "y",
  tname          = "period",
  idname         = "id",
  treat_period   = 4,
  control_period = 2,
  gname          = "g"
)
print(res_pois)
```

---

# Doubly-Robust Estimation

The doubly-robust estimator combines an outcome model with a propensity
score model. It is consistent if **either** (but not necessarily both)
is correctly specified.

```{r drlogit}
res_dr <- binary_did_dr(
  data           = dat[dat$period %in% c(3, 5), ],
  yname          = "y",
  tname          = "period",
  idname         = "id",
  treat_period   = 5,
  control_period = 3,
  gname          = "g",
  xformla        = ~ x1 + x2,
  outcome_model  = "logit"
)
print(res_dr)
```

---

# Nonparametric Bounds

When we are uncertain about functional form, we can compute **sharp bounds**
on the ATT. Under parallel trends alone (with no functional form assumption),
the ATT for binary outcomes is **point identified**. But under weaker
assumptions (e.g., only monotone treatment response), we get intervals.

```{r bounds}
bounds <- nonlinear_bounds(
  data          = dat,
  yname         = "y",
  tname         = "period",
  idname        = "id",
  gname         = "g",
  bound_type    = "manski",  # widest (no assumptions)
  control_group = "nevertreated"
)

# Show post-treatment bounds
post_bounds <- bounds[bounds$post, ]
head(post_bounds[, c("group", "time", "lb", "ub", "identified")])
```

---

# Comparison: Linear vs. Nonlinear DiD

A critical question: does it matter? When baseline probabilities are
far from 0 and 1, the answer is **yes** — the linear DiD can be severely
biased.

```{r comparison, eval = FALSE}
# Linear comparison (for illustration)
res_linear <- nonlinear_attgt(
  data          = dat,
  yname         = "y",
  tname         = "period",
  idname        = "id",
  gname         = "g",
  outcome_model = "linear",
  estimand      = "att"
)

agg_lin  <- nonlinear_aggte(res_linear, type = "dynamic")
agg_logit <- nonlinear_aggte(res, type = "dynamic")

# The two produce different estimates when baseline rates are moderate
cat("Linear overall ATT:",  round(agg_lin$overall_att, 4), "\n")
cat("Logit overall ATT:",   round(agg_logit$overall_att, 4), "\n")
cat("True ATT (avg):    ",  round(mean(c(0.15, 0.25, 0.20)), 4), "\n")
```

---

# Methodological Details

## Identifying Assumption

For the logit-scale ATT(g,t) estimator, the key identifying assumption is:

**Parallel Trends on the Logit Scale**: For all $t \geq 2$ and all treated
cohorts $g$:

$$[\text{logit}(E[Y_{it}(0)]) - \text{logit}(E[Y_{it-1}(0)])] \cdot \mathbf{1}_{G_i = g}$$
$$= [\text{logit}(E[Y_{it}(0)]) - \text{logit}(E[Y_{it-1}(0)])] \cdot \mathbf{1}_{G_i = \infty}$$

This is a **stronger assumption** than CS2021's linear parallel trends when
treatment effects are heterogeneous (it restricts both the means and the
full distribution).

## DR Estimator

The doubly-robust estimator for binary outcomes uses the influence function:

$$\hat{\psi}^{DR}_{g,t} = \frac{1}{n} \sum_i \left[
  \frac{D_i}{\hat{p}} (\Delta Y_i - \hat{\mu}(X_i)) -
  \frac{(1-D_i)\hat{\pi}(X_i)}{1-\hat{\pi}(X_i)} \cdot \frac{1}{\hat{p}} (\Delta Y_i - \hat{\mu}(X_i))
\right]$$

where $\Delta Y_i = Y_{it} - Y_{it-1}$, $\hat{\pi}$ is the propensity score,
and $\hat{\mu}$ is the outcome regression on controls.

---

# References

Callaway, B., & Sant'Anna, P. H. C. (2021). Difference-in-differences with
multiple time periods. *Journal of Econometrics*, 225(2), 200-230.

Roth, J., & Sant'Anna, P. H. C. (2023). When is parallel trends sensitive to
functional form? *Econometrica*, 91(2), 737-747.

Wooldridge, J. M. (2023). Simple approaches to nonlinear difference-in-
differences with panel data. *The Econometrics Journal*, 26(3), 31-66.

Sant'Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences
estimators. *Journal of Econometrics*, 219(1), 101-122.

Manski, C. F. (1990). Nonparametric bounds on treatment effects.
*American Economic Review*, 80(2), 319-323.