---
title: "Getting Started: OFH Synthetic Cohort Generation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started: OFH Synthetic Cohort Generation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## Overview

This vignette shows how to generate synthetic cohort datasets for method development before using real health data.

The package-style API supports:

- configurable cohort size
- reproducible generation via seed
- optional ICD-10 / OPCS4 / BNF code restrictions
- configurable dataset coverage, record density, and field-level generation probabilities
- control over whether to save CSVs and/or return R objects

## 1. Load the package

```{r load-api, eval = FALSE}
library(ofhsyn)
```

## 2. Generate a basic cohort

```{r basic-run, eval = FALSE}
out <- generate_ofh_cohort(
  n = 1000,
  seed = 123
)

names(out)
```

This returns a named list of data frames and writes CSVs to an output folder in your current working directory.

To return objects only (without writing CSV files):

```{r objects-only, eval = FALSE}
out_objects_only <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  save_csv = FALSE,
  return_objects = TRUE
)
```

If you run this interactively, the generated data frames are also available in your R environment (for example `questionnaire_data`, `clinic_measurements_data`, `nhse_inpat_data`).

## 3. Restrict to specific code lists

```{r code-lists, eval = FALSE}
out <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  icd10 = c(
    I210 = "STEMI of anterolateral wall",
    I500 = "Congestive heart failure"
  ),
  opcs4 = c(
    K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"
  ),
  bnf_codes = data.frame(
    BNFCode = c("0212000B0", "0601023A0"),
    BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
    Formulation = c("tablets", "tablets"),
    Strength = c("20 mg", "500 mg"),
    stringsAsFactors = FALSE
  )
)
```

You can also provide code files:

- ICD10/OPCS4 files must include both `code` and `description`
- For ICD10/OPCS4: use CSV (`code,description`) or tab-separated TXT (`code<TAB>description`)
- For BNF: use CSV with `BNFCode`, `BNFName`, `Formulation` (optional `Strength`)

```{r code-files, eval = FALSE}
out <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  icd10_file = "icd10_codes.txt",
  opcs4_file = "opcs4_codes.txt",
  bnf_codes_file = "bnf_medications.csv"
)
```

## 4. Configure dataset generation probabilities

```{r probabilities, eval = FALSE}
out_custom <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  proportions = list(
    nhse_outpat = 0.25,
    nhse_inpat = 0.20,
    nhse_ed = 0.30,
    nhse_primcare_meds = 0.75
  ),
  record_multipliers = list(
    nhse_outpat = 1.2,
    nhse_inpat = 1.1,
    nhse_ed = 1.3
  ),
  code_config = list(
    nhse_outpat_data = list(diag_4_02_missing_prob = 0.70),
    nhse_inpat_data = list(single_diag_prob = 0.85)
  )
)
```

## 5. Use the OOP interface directly

```{r oop-run, eval = FALSE}
syn <- OFHCohortSynthesizer$new(project_root = ".", seed = 123)

syn$set_code_pools(
  icd10 = c(I210 = "STEMI of anterolateral wall"),
  opcs4 = c(K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"),
  bnf_meds = data.frame(
    BNFCode = c("0212000B0", "0601023A0"),
    BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
    Formulation = c("tablets", "tablets"),
    Strength = c("20 mg", "500 mg"),
    stringsAsFactors = FALSE
  )
)

out <- syn$run_all(n = 800)
```

## 6. Practical tips for researchers

- Start with small `n` (for example, 200 to 1000) while developing.
- Fix `seed` for reproducibility during method testing.
- Check row counts and `pid` linkage assumptions in your analysis scripts.
- Expand code lists as your phenotype definitions evolve.

## 7. Notes

- Some datasets are intentional subsets of the full cohort.
- Questionnaire output includes a small v1 proportion by design.
- Primary care meds include prescribed-but-not-dispensed rows.
