Preprocessing
Preprocessing transforms raw atmospheric data into derived quantities that are useful for bias prediction. This page explains the preprocessing pipeline, available steps, and when to use them.
Overview
The preprocessing stage operates on raw NetCDF resource files (atmospheric profiles) and produces an enriched dataset with derived quantities like ABL height, wind veer, and lapse rate.
Raw Resource (resource.nc) Processed Resource
┌─────────────────────────┐ ┌─────────────────────────┐
│ wind_speed(time, height)│ │ wind_speed(time, height)│
│ wind_direction(...) │ ──────► │ wind_direction(...) │
│ potential_temperature │ │ potential_temperature │
│ k (TKE) │ │ k (TKE) │
└─────────────────────────┘ │ + ABL_height(time) │
│ + wind_veer(time, height│
│ + lapse_rate(time) │
│ + turbulence_intensity │
└─────────────────────────┘
Configuration
Enable preprocessing in your YAML config:
preprocessing:
run: true
steps:
- recalculate_params # Calculate derived quantities
If run: false or no steps are specified, the raw resource file is used directly.
Available Steps
recalculate_params
The primary preprocessing step. Calculates multiple derived quantities from vertical profiles.
What it computes:
| Variable | Computation Method | Required Inputs |
|---|---|---|
ABL_height |
Height where wind speed reaches 99% of max | wind_speed, height |
wind_veer |
dθ/dz (wind direction gradient) | wind_direction, height |
turbulence_intensity |
√(2k/3) / U | k, wind_speed |
lapse_rate |
dΘ/dz from capping inversion fitting | potential_temperature, height |
capping_inversion_strength |
Temperature jump at ABL top | potential_temperature, height |
capping_inversion_thickness |
Depth of capping inversion | potential_temperature, height |
Example:
preprocessing:
run: true
steps:
- recalculate_params
Derived Quantities Explained
ABL Height
Definition: The height at which wind speed first reaches 99% of the maximum velocity in the profile.
Physical meaning: Captures the top of the atmospheric boundary layer, including cases with low-level jets where the maximum wind speed occurs below the free atmosphere.
Algorithm:
1. Find max_wind_speed = max(U(z)) for each time step
2. Compute threshold = 0.99 × max_wind_speed
3. Find lowest height where U ≥ threshold
Fallback: If temperature data is available but velocity-based calculation fails, the temperature-based ABL height from capping inversion fitting is used.
Typical range: 200–2000 m depending on stability and time of day.
Wind Veer
Definition: The rate of change of wind direction with height (dθ/dz).
Physical meaning: Describes Ekman spiral effects and can indicate the presence of baroclinicity or complex atmospheric structures.
Algorithm:
1. Convert wind direction to radians
2. Unwrap angles to handle 0°/360° crossing
3. Compute gradient: veer = d(direction)/d(height)
4. Convert back to degrees per meter
Units: degrees per meter (deg/m)
Typical range: -0.01 to +0.05 deg/m (positive = veering, clockwise with height in Northern Hemisphere)
Lapse Rate
Definition: The vertical gradient of potential temperature (dΘ/dz).
Physical meaning: Indicates atmospheric stability. Positive values indicate stable stratification; near-zero indicates neutral conditions.
Algorithm: Uses the ci_fitting function from the WAYVE library to fit the capping inversion structure and extract the lapse rate below the ABL top.
Units: K/m (Kelvin per meter)
Typical range: -0.01 to +0.01 K/m - Stable: +0.003 to +0.01 K/m - Neutral: ≈ 0 K/m - Unstable: negative values
Turbulence Intensity
Definition: Ratio of turbulent velocity fluctuations to mean wind speed.
Physical meaning: Describes the intensity of turbulent mixing, which affects wake recovery rates.
Algorithm:
TI = √(2k/3) / U
where:
k = turbulent kinetic energy (m²/s²)
U = mean wind speed (m/s)
Units: dimensionless (often expressed as percentage)
Typical range: 0.02–0.20 (2%–20%)
Capping Inversion Strength
Definition: The temperature jump across the capping inversion at the ABL top.
Physical meaning: Stronger inversions indicate more stable conditions that suppress vertical mixing.
Units: K (Kelvin)
Typical range: 0–10 K
Capping Inversion Thickness
Definition: The vertical extent of the temperature inversion layer.
Physical meaning: Thin inversions create sharper ABL tops; thick inversions indicate gradual transitions.
Units: m (meters)
Typical range: 50–500 m
Input Requirements
For preprocessing to work correctly, your input NetCDF file should contain:
Required Variables
| Variable | Dimensions | Units | Description |
|---|---|---|---|
wind_speed |
(time, height) | m/s | Horizontal wind speed profile |
height |
(height,) | m | Vertical coordinate (AGL) |
Recommended Variables
| Variable | Dimensions | Units | Description |
|---|---|---|---|
wind_direction |
(time, height) | degrees | Wind direction profile |
potential_temperature |
(time, height) | K | Potential temperature profile |
k |
(time, height) | m²/s² | Turbulent kinetic energy |
Optional Variables
| Variable | Dimensions | Units | Description |
|---|---|---|---|
LMO |
(time,) | m | Monin-Obukhov length (stability) |
If LMO is not present, neutral stability (LMO = 10¹⁰ m) is assumed.
Example NetCDF Structure
A typical input resource file:
Dimensions:
time: 100
height: 50
Variables:
height (height): [0, 10, 20, 30, ..., 500] # meters AGL
wind_speed (time, height): float32
wind_direction (time, height): float32
potential_temperature (time, height): float32
k (time, height): float32 # TKE
Preprocessing API
For programmatic use:
from pathlib import Path
from wifa_uq.preprocessing.preprocessing import PreprocessingInputs
# Initialize preprocessor
preprocessor = PreprocessingInputs(
ref_resource_path=Path("resource.nc"),
output_path=Path("processed_resource.nc"),
steps=["recalculate_params"]
)
# Run the pipeline
output_path = preprocessor.run_pipeline()
print(f"Processed data saved to: {output_path}")
Handling Missing Data
The preprocessing pipeline handles missing inputs gracefully:
| Missing Variable | Behavior |
|---|---|
k |
TI calculation skipped |
wind_direction |
Wind veer calculation skipped |
potential_temperature |
Temperature-based calculations skipped |
height |
Error (required for any derived quantity) |
Warnings are printed for skipped calculations:
Skipping TI recalculation: 'k' or 'wind_speed' not found.
Skipping wind veer calculation: Missing 'wind_direction'
When to Use Preprocessing
Always Use When:
- Your resource file only contains raw vertical profiles
- You need ABL height, wind veer, or lapse rate as ML features
- Reference data comes from LES or mesoscale simulations
Skip Preprocessing When:
- Your resource file already contains all required derived quantities
- You're using pre-processed data from a previous run
- You only need features available in the raw data
Configuration Example: Skip Preprocessing
preprocessing:
run: false
# Use raw resource directly
paths:
reference_resource: already_processed_data.nc
Or use a previously processed file:
preprocessing:
run: false
paths:
# Point to existing processed file
reference_resource: previous_run/processed_physical_inputs.nc
Troubleshooting
"ValueError: ci_fitting failed"
Cause: The capping inversion fitting algorithm couldn't converge, often due to unusual temperature profiles.
Solution:
1. Check that potential_temperature increases with height (stable atmosphere)
2. Verify height coordinates are reasonable (0–1000+ m)
3. The preprocessor will still output other variables; only lapse_rate and capping inversion metrics will be missing
"Wind veer shows unrealistic values"
Cause: Wind direction data crossing 0°/360° boundary wasn't properly unwrapped.
Solution: This should be handled automatically. If issues persist, check that wind_direction is in degrees (0–360), not radians.
"ABL_height shows NaN values"
Cause: Wind speed profile is monotonic (always increasing or decreasing with height).
Solution: For profiles without a clear velocity maximum, ABL_height defaults to the maximum height in the profile. Consider using temperature-based ABL height instead.
"TI values seem too high/low"
Cause: TKE (k) might be in unexpected units or represent something different.
Solution: Verify TKE units are m²/s². Some models output k in different forms.
Best Practices
-
Validate input data before preprocessing:
python import xarray as xr ds = xr.load_dataset("resource.nc") print(ds) # Check dimensions and variables -
Inspect output after preprocessing:
python ds_out = xr.load_dataset("processed_physical_inputs.nc") print(ds_out["ABL_height"].min(), ds_out["ABL_height"].max()) -
Use consistent height grids across datasets when combining multiple farms
-
Document preprocessing choices in your workflow description for reproducibility
Integration with Database Generation
Preprocessed variables become available as features in the model error database:
# These features are available after preprocessing
error_prediction:
features:
- ABL_height # From recalculate_params
- wind_veer # From recalculate_params
- lapse_rate # From recalculate_params
- turbulence_intensity # From recalculate_params
- Blockage_Ratio # Added during database_gen
- Blocking_Distance # Added during database_gen
The database generator interpolates height-dependent variables (like wind_veer, turbulence_intensity) to hub height automatically.
See Also
- Configuration Reference — Full YAML options
- Database Generation — Next step in the pipeline
- Feature Engineering — Using derived features