# leakR

Welcome to **leakR**, an R package designed to help researchers, data scientists, and machine learning practitioners rigorously detect and diagnose data leakage in their workflows.

Data leakage is a pervasive yet often overlooked issue that undermines the integrity and reproducibility of predictive models by allowing unintended information to "leak" between training and testing phases. leakR provides a modular, extensible toolkit for detecting the most common and impactful forms of leakage, starting with tabular data contamination, target leakage, and temporal misalignments, while laying the foundation for a universal leakage detection framework across diverse data domains.

## Installation

### From CRAN (Recommended)

```r
install.packages("leakr")
```

### From GitHub (Development Version)

For the latest features and bug fixes:
  
  ```r
# Install devtools if you don't have it
install.packages("devtools")

# Install leakR from GitHub
devtools::install_github("cherylisabella/leakR")
```

## Quick Start

```r
library(leakr)

# Basic audit of your dataset
report <- leakr_audit(iris, target = "Species")

# View summary of issues found
leakr_summarise(report)

# Generate diagnostic visualizations
leakr_plot(report)

# Access detailed results
print(report)
```

## Main Functions

| Function | Purpose |
|----------|---------|
| `leakr_audit()` | Main auditing function - detects leakage across your dataset |
| `leakr_summarise()` | Generate human-readable summaries of detected issues |
| `leakr_plot()` | Create diagnostic visualizations highlighting problems |
| `leakr_from_caret()` | Import and audit caret workflow objects |
| `leakr_from_tidymodels()` | Import and audit tidymodels workflow objects |
| `leakr_from_mlr3()` | Import and audit mlr3 workflow objects |

## Learn More

Get started with the comprehensive vignettes:

```r
# Getting started guide
vignette("getting-started", package = "leakr")

# Advanced detection techniques
vignette("advanced-detection", package = "leakr") 

# Framework integration examples
vignette("framework-integration", package = "leakr")
```

## Why leakR?

- **Automates leakage detection**, filling a key methodological gap
- **Designed for clarity, reproducibility, and transparent ML research**
- **Modular architecture** supports gradual expansion (time series, NLP, images)
- **Useful for both academic and industry workflows**

## What leakR Detects

- **Train/test contamination** - Overlapping records between training and test sets
- **Target leakage** - Features that contain information about the target variable that wouldn't be available at prediction time
- **Duplicate rows/records** - Exact and near-duplicate observations that can inflate performance metrics
- **Temporal misalignments** - Time-based data leaks in time series analysis

## Key Features

- **Visual summaries** of suspicious patterns and leakage hotspots
- **Detailed leakage reports** suitable for audits, peer review, or publications
- **Clean APIs** for seamless integration into existing ML workflows
- **Example vignettes** demonstrating real leakage phenomena with code illustrations
- **Framework integration** with caret, tidymodels, and mlr3

## Development Roadmap

- **Phase 1**: Core tabular leakage detectors ✓
- **Phase 2**: Time series leakage detection (in progress)
- **Phase 3**: Domain-specific extensions (NLP, image pipelines)
- **Phase 4**: Pipeline integration and multi-language support

## Citation

If you use leakR in your research, please cite:

```
@Manual{leakr2025,
  title = {leakR: Data Leakage Detection Tools for Machine Learning},
  author = {Cheryl Isabella Lim},
  year = {2025},
  note = {R package version 0.1.0},
  url = {https://github.com/cherylisabella/leakR},
}
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

#
leakR is currently under development. Feedback and contributions are welcome from the community!