The first time most people split data into training and validation sets, they shuffle everything together and randomly hold out a portion for testing. For many tasks, that’s fine. But the moment your data carries a time dimension, especially when you hit a special period like the COVID pandemic that upends the entire environment, this approach can quietly let your model cheat, and the good-looking scores will only make you more confident about it.
This note covers three things: the traps you fall into when time is a variable, how to handle a disruptive period like the pandemic, and a set of practices you can actually follow.
1. Random Split and Temporal Split Are Not Measuring the Same Thing
There are two common ways to split. A random split pools every year’s data together, shuffles it, and randomly divides it into train and test. A temporal split respects chronological order: train on the past, predict the future.
They are not testing the same ability:
| Random split | Temporal split | |
|---|---|---|
| Hidden assumption | Data is static; time does not matter | The environment evolves in one direction over time |
| What it tests | Interpolation: can it fill in the blanks? | Extrapolation: can it travel through time? |
| Realistic? | Leaks the future into the past | Matches real deployment: past predicts future |
| Score behavior | Tends to be optimistic | Usually lower, but more honest |
In one line: a random split tests whether the model can recite; a temporal split tests whether it actually understands.
2. The Most Common Traps When Time Is a Variable
Trap 1: Random shuffling causes time travel. Once later-year samples enter training while earlier-year samples sit in validation, the model has effectively seen tomorrow’s answer key. The symptom is familiar: offline scores look unusually clean, then degrade when the model faces a real future.
Trap 2: Treating independent samples as permission to shuffle time. Even when patients, transactions, or visits never repeat, the system around them still changes. What leaks is not the individual row; it is the future macro-environment. Independent samples do not make eras independent.
Trap 3: Future information sneaks into features or preprocessing. A temporal split is not enough if scaling, imputation, encoding, or summary features were learned from the full dataset. The rule is simple: every value may only depend on information available at prediction time.
Trap 4: Watching discrimination but ignoring calibration. AUC may stay stable while predicted probabilities drift. The model can still rank cases correctly while the absolute risks become unusable. Under temporal drift, you have to watch both.
3. How to Do Temporal Validation Properly
Split by time, not by row number. Pick a cutoff, train on what came before, and validate on what came after. That mirrors deployment: history predicts an unknown future.
Use walk-forward evaluation when one cutoff is too fragile. Keep sliding the training window forward and predict only the segment immediately after it. The names vary by field, but the principle is constant: the training set always sits in the validation set’s past.
Prefer external validation when the claim is external. Temporal validation asks whether the model generalizes across time. Geographic validation asks whether it generalizes across environments. Internal cross-validation alone cannot answer those questions.
4. How to Handle a Special Period Like the Pandemic
COVID was a textbook regime shift. Bed availability, staffing, discharge thresholds, and access to community care all changed at once. Inputs shifted, baseline outcome rates shifted, and even feature-outcome relationships shifted.
The tempting move is to train on the crisis because it looks like a useful stress case. But that can backfire.
Train in peacetime, test in wartime. We usually want the model to learn durable regularities, not the distortions of one damaged environment. During the pandemic, admission thresholds, discharge timing, and community care were warped by scarcity. If that period is mixed into training, the model may learn the wound of the era as if it were the disease of the patient.
The cleaner design is to train on relatively stable pre-pandemic data, then treat pandemic and post-pandemic years as stress-test arenas. If the model still identifies high-risk cases there, it has captured something more durable than the quirks of one year.
So keep this distinction in mind:
Disruptive data used for testing is a touchstone for the model’s mettle; used for training, it drags the model’s baseline off course.
When drift appears, recalibrate before rebuilding. A temporal split makes drift visible; a mixed split often averages it away. In many cases, shifting the overall risk level, and sometimes the calibration slope, is cheaper and cleaner than retraining a large model from scratch. Temporal validation is not just an evaluation trick. It is a rehearsal for how the model will age.
The takeaway: a random split tests recitation, a temporal split tests real understanding; disruptive data is a touchstone when you test on it and a poison when you train on it; samples can be independent, but eras are not.