Skip to content

Cross-Validation Strategies

Cross-validation (CV) ensures your bias predictor generalizes to unseen data. WIFA-UQ supports multiple CV strategies suited to different data structures and validation goals.

Overview

Cross-validation estimates how well your model will perform on new data by systematically holding out portions of your dataset for testing.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Cross-Validation Process                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Full Dataset                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ Case 1 │ Case 2 │ Case 3 │ Case 4 │ Case 5 │ ... │ Case N │        │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  Fold 1: ████████████████████████████████████████  │░░░░░░░░│              │
│  Fold 2: ████████████████████████████  │░░░░░░░░│  ██████████              │
│  Fold 3: ████████████████  │░░░░░░░░│  ██████████████████████              │
│  Fold 4: ████████  │░░░░░░░░│  ██████████████████████████████              │
│  Fold 5: │░░░░░░░░│  ████████████████████████████████████████              │
│                                                                             │
│          ████ = Training data    ░░░░ = Test data                          │
│                                                                             │
│  Metrics computed on each test fold, then averaged                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Available Strategies

Strategy splitting_mode Best For Preserves
K-Fold (ordered) kfold Time series, sequential data Temporal ordering
K-Fold (shuffled) kfold_shuffled i.i.d. data, general use Nothing (random splits)
Leave-One-Group-Out LeaveOneGroupOut Multi-farm, grouped data Group integrity
Leave-One-Out LeaveOneOut Small datasets Maximum test coverage

Quick Start

cross_validation:
  splitting_mode: kfold_shuffled    # CV strategy
  n_splits: 5                       # Number of folds
  random_state: 42                  # Reproducibility

K-Fold Cross-Validation

Divides data into K equal folds; each fold serves as test set once.

K-Fold Ordered (kfold)

Preserves the original data order when creating folds. Use when temporal structure matters.

cross_validation:
  splitting_mode: kfold
  n_splits: 5

Visualization:

Original order: [Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec]

Fold 1: Train=[Mar-Dec]           Test=[Jan, Feb]
Fold 2: Train=[Jan-Feb, May-Dec]  Test=[Mar, Apr]
Fold 3: Train=[Jan-Apr, Jul-Dec]  Test=[May, Jun]
Fold 4: Train=[Jan-Jun, Sep-Dec]  Test=[Jul, Aug]
Fold 5: Train=[Jan-Aug, Nov-Dec]  Test=[Sep, Oct]
Fold 6: Train=[Jan-Oct]           Test=[Nov, Dec]

When to use: - Data has temporal structure (seasons, campaigns) - You want adjacent cases in the same fold - Testing generalization to different time periods

K-Fold Shuffled (kfold_shuffled)

Randomly shuffles data before splitting. The default choice for most applications.

cross_validation:
  splitting_mode: kfold_shuffled
  n_splits: 5
  random_state: 42                  # For reproducibility

Visualization:

Original: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Shuffled: [7, 2, 9, 4, 1, 8, 3, 10, 6, 5]

Fold 1: Train=[9,4,1,8,3,10,6,5]  Test=[7, 2]
Fold 2: Train=[7,2,1,8,3,10,6,5]  Test=[9, 4]
...

When to use: - Data points are independent (i.i.d.) - No specific grouping structure - General-purpose validation - Most common choice

Choosing n_splits

Dataset Size Recommended n_splits Notes
< 50 cases 3-5 Larger test folds for reliable estimates
50-200 cases 5-10 Standard choice
> 200 cases 5-10 More folds give similar results

Trade-offs:

  • More splits: More training data per fold, but smaller test sets (higher variance in metrics)
  • Fewer splits: Larger test sets (more stable metrics), but less training data

Holds out entire groups (e.g., wind farms) for testing. Essential when you have grouped data and want to test generalization to new groups.

Configuration

cross_validation:
  splitting_mode: LeaveOneGroupOut
  group_key: wind_farm              # Column defining groups

Visualization:

Dataset with 3 wind farms:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Farm A: [A1, A2, A3, ..., A50]                                             │
│  Farm B: [B1, B2, B3, ..., B40]                                             │
│  Farm C: [C1, C2, C3, ..., C60]                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Fold 1: Train=[Farm B + Farm C]    Test=[Farm A]   (150 cases total)
Fold 2: Train=[Farm A + Farm C]    Test=[Farm B]
Fold 3: Train=[Farm A + Farm B]    Test=[Farm C]

Required when: - Training on multiple wind farms - Cases within a farm are correlated (same turbines, similar terrain) - Goal is predicting performance at new farms

Why it matters:

Without LOGO, cases from the same farm appear in both train and test sets, leading to data leakage and overly optimistic metrics:

❌ KFold with multi-farm data:
   Train: [A1, A3, B1, B3, C1, C3, ...]
   Test:  [A2, A4, B2, B4, C2, C4, ...]

   Problem: Model learns farm-specific patterns from A1, A3 and
            "predicts" A2, A4 using that leaked information.

✓ LOGO with multi-farm data:
   Train: [All of Farm B, All of Farm C]
   Test:  [All of Farm A]

   Correct: Model must generalize to Farm A without seeing ANY Farm A data.

Multi-Farm Configuration

Complete multi-farm setup with LOGO:

# Multi-farm configuration
farms:
  - name: farm_alpha
    system_config: /path/to/alpha/system.yaml
  - name: farm_beta
    system_config: /path/to/beta/system.yaml
  - name: farm_gamma
    system_config: /path/to/gamma/system.yaml

cross_validation:
  splitting_mode: LeaveOneGroupOut
  group_key: wind_farm

Group Key Selection

The group_key must exist as a coordinate in your stacked dataset:

import xarray as xr

dataset = xr.load_dataset("results_stacked_hh.nc")
print(dataset.coords)
# Coordinates:
#   case_index    (case_index) int64 ...
#   wind_farm     (case_index) <U10 'alpha' 'alpha' ... 'gamma'  ← group_key
#   sample        (sample) int64 ...

Common group keys: - wind_farm: Different wind farm sites - campaign: Different measurement campaigns - season: Seasonal grouping (if cases per season are sufficient)


Leave-One-Out (LOO)

Extreme case where each sample is its own test fold. Maximizes training data but computationally expensive.

Configuration

cross_validation:
  splitting_mode: LeaveOneOut

Visualization:

Dataset: [1, 2, 3, 4, 5]

Fold 1: Train=[2,3,4,5]  Test=[1]
Fold 2: Train=[1,3,4,5]  Test=[2]
Fold 3: Train=[1,2,4,5]  Test=[3]
Fold 4: Train=[1,2,3,5]  Test=[4]
Fold 5: Train=[1,2,3,4]  Test=[5]

When to Use

  • Very small datasets (< 30 cases)
  • Need maximum training data per fold
  • Computational cost is acceptable (N model fits)

Limitations

  • Computationally expensive: N folds for N samples
  • High variance: Single-sample test sets are noisy
  • Not recommended for large datasets

Metrics Computed

Cross-validation computes these metrics on each test fold:

Metric Description Ideal Value
RMSE Root mean squared error Lower is better
MAE Mean absolute error Lower is better
Coefficient of determination Higher is better (max 1.0)
Bias Mean error (systematic offset) Close to 0

Output Format

# cv_metrics.yaml
mean_rmse: 0.0234
std_rmse: 0.0045
mean_mae: 0.0189
std_mae: 0.0038
mean_r2: 0.847
std_r2: 0.052
per_fold_metrics:
  fold_0:
    rmse: 0.0212
    mae: 0.0178
    r2: 0.891
  fold_1:
    rmse: 0.0256
    mae: 0.0201
    r2: 0.823
  # ...

Interpreting Results

Good generalization: - Low std_rmse relative to mean_rmse (< 20%) - Consistent R² across folds - No single fold dramatically worse than others

Signs of problems: - High variance across folds → insufficient data or data heterogeneity - One fold much worse → possible outliers or distinct subpopulation - Training metrics much better than CV metrics → overfitting


Comparison of Strategies

By Data Structure

Data Structure Recommended Strategy
Single farm, i.i.d. cases kfold_shuffled
Single farm, temporal data kfold
Multiple farms LeaveOneGroupOut
Very small dataset (< 30) LeaveOneOut
Seasonal patterns to preserve kfold or custom groups

By Validation Goal

Goal Recommended Strategy
General performance estimate kfold_shuffled
Generalization to new farms LeaveOneGroupOut
Generalization to new time periods kfold (ordered)
Maximum training data LeaveOneOut

Computational Cost

Strategy Number of Folds Relative Cost
kfold_shuffled (k=5) 5 Low
kfold (k=10) 10 Low-Medium
LeaveOneGroupOut # of groups Medium
LeaveOneOut # of samples High

API Usage

Running Cross-Validation

from wifa_uq.postprocessing.cross_validation import run_cross_validation
from wifa_uq.postprocessing.calibration import MinBiasCalibrator
from wifa_uq.postprocessing.error_prediction import BiasPredictor
import xarray as xr

database = xr.load_dataset("results_stacked_hh.nc")

# Configure CV
cv_config = {
    "splitting_mode": "kfold_shuffled",
    "n_splits": 5,
    "random_state": 42
}

# Run cross-validation
results = run_cross_validation(
    database=database,
    calibrator_class=MinBiasCalibrator,
    predictor_class=BiasPredictor,
    predictor_kwargs={"regressor_name": "XGB", "feature_names": ["ABL_height", "wind_veer"]},
    cv_config=cv_config
)

print(f"Mean RMSE: {results['mean_rmse']:.4f} ± {results['std_rmse']:.4f}")

Custom Cross-Validation Splitter

from sklearn.model_selection import StratifiedKFold
import numpy as np

# Create custom splitter (e.g., stratified by stability class)
stability_labels = np.digitize(
    database["lapse_rate"].isel(sample=0).values,
    bins=[-0.005, 0, 0.005]  # unstable, neutral, stable
)

custom_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Use in cross-validation
results = run_cross_validation(
    database=database,
    calibrator_class=MinBiasCalibrator,
    predictor_class=BiasPredictor,
    predictor_kwargs={...},
    cv_splitter=custom_cv,
    stratify_labels=stability_labels
)

Accessing Per-Fold Results

# Per-fold predictions and actuals
for fold_idx, fold_result in enumerate(results["fold_results"]):
    print(f"\nFold {fold_idx}:")
    print(f"  Test indices: {fold_result['test_indices'][:5]}...")
    print(f"  RMSE: {fold_result['rmse']:.4f}")
    print(f"  Predictions shape: {fold_result['predictions'].shape}")

Best Practices

1. Match Strategy to Data Structure

# Single farm
cross_validation:
  splitting_mode: kfold_shuffled
  n_splits: 5

# Multi-farm
cross_validation:
  splitting_mode: LeaveOneGroupOut
  group_key: wind_farm

2. Use Sufficient Folds

Too few folds → high variance in estimates:

# Compare different n_splits
for n_splits in [3, 5, 10]:
    results = run_cross_validation(..., cv_config={"n_splits": n_splits})
    print(f"n_splits={n_splits}: RMSE = {results['mean_rmse']:.4f} ± {results['std_rmse']:.4f}")

3. Set Random State for Reproducibility

cross_validation:
  splitting_mode: kfold_shuffled
  n_splits: 5
  random_state: 42    # Always set this!

4. Check for Leakage in Grouped Data

If you have grouped data but use K-Fold, verify metrics aren't artificially inflated:

# Compare KFold vs LOGO
kfold_results = run_cross_validation(..., cv_config={"splitting_mode": "kfold_shuffled"})
logo_results = run_cross_validation(..., cv_config={"splitting_mode": "LeaveOneGroupOut", "group_key": "wind_farm"})

print(f"KFold RMSE:  {kfold_results['mean_rmse']:.4f}")
print(f"LOGO RMSE:   {logo_results['mean_rmse']:.4f}")

# If LOGO RMSE >> KFold RMSE, there was likely leakage

5. Examine Fold-Level Variation

High variation suggests data heterogeneity:

import numpy as np

rmse_values = [fold['rmse'] for fold in results['fold_results']]
cv_coefficient = np.std(rmse_values) / np.mean(rmse_values)

if cv_coefficient > 0.3:
    print("WARNING: High CV coefficient suggests heterogeneous data")
    print("Consider: more data, different features, or stratified CV")

Troubleshooting

"LOGO gives much worse metrics than KFold"

This is expected and correct! LOGO tests true generalization to unseen groups. The gap indicates how much the model relies on group-specific patterns.

Solutions: - This represents realistic performance on new farms - Focus on features that generalize across farms - Consider farm-specific calibration as a separate step

"High variance across folds"

Causes: - Insufficient data per fold - Heterogeneous data (different conditions across folds) - Outliers affecting individual folds

Solutions: - Reduce number of splits - Use stratified splitting to balance conditions - Identify and examine problematic folds - Check for outliers in the data

"One fold is much worse than others"

Causes: - That fold contains unusual conditions - Outliers concentrated in that fold - Data quality issues in that subset

Solutions:

# Identify the bad fold
worst_fold_idx = np.argmax([f['rmse'] for f in results['fold_results']])
worst_fold = results['fold_results'][worst_fold_idx]

# Examine test cases in that fold
test_indices = worst_fold['test_indices']
test_features = database.isel(case_index=test_indices, sample=0)

# Check for anomalies
print(test_features[["ABL_height", "wind_veer"]].to_dataframe().describe())

Problem: With few groups (e.g., 2 farms), each fold uses only 1 farm for training.

Solutions: - Combine with nested CV if possible - Use repeated LOGO with bootstrapping - Consider KFold but acknowledge limitations - Acquire more farm data if possible


Advanced Topics

Nested Cross-Validation

For hyperparameter tuning without leakage:

Outer loop: Model evaluation (e.g., 5-fold)
  Inner loop: Hyperparameter selection (e.g., 3-fold)
from sklearn.model_selection import GridSearchCV, cross_val_score

# Outer CV for evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Inner CV for hyperparameter tuning (wrapped in GridSearchCV)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

param_grid = {"max_depth": [2, 3, 4], "n_estimators": [50, 100, 200]}

# This gives unbiased performance estimate with tuned hyperparameters
nested_scores = cross_val_score(
    GridSearchCV(XGBRegressor(), param_grid, cv=inner_cv),
    X, y, cv=outer_cv, scoring="neg_root_mean_squared_error"
)

Time-Series Cross-Validation

For strictly temporal data where future data shouldn't be used:

from sklearn.model_selection import TimeSeriesSplit

# Only uses past data to predict future
ts_cv = TimeSeriesSplit(n_splits=5)

for train_idx, test_idx in ts_cv.split(X):
    print(f"Train: {train_idx.min()}-{train_idx.max()}, Test: {test_idx.min()}-{test_idx.max()}")
# Train: 0-19, Test: 20-29
# Train: 0-29, Test: 30-39
# ...

See Also