Power_Outage_Prediction
By Alex Collier
Introduction to the Dataset and Prediction Question
The dataset we are using focuses on major power outages across various regions. It contains detailed records of outages, including the causes, durations, affected populations, and geographic information. This dataset offers insights into the factors that contribute to power outages and the severity of their impacts.
My central question is: "What causes most major power outages, and can we predict the most likely cause of an outage?"
Understanding the root causes of power outages and predicting them is crucial for improving power grid reliability, preparing for natural disasters, and reducing the socioeconomic impacts of outages. Power outages can disrupt daily life, damage economies, and threaten public safety. Utilities, policymakers, and researchers can leverage insights from this analysis to better allocate resources, design preventative measures, and strengthen infrastructure.
Dataset Variables and Descriptions
The original DataFrame contains 1534 rows, corresponding to 1534 outages, and 57 columns. Listed below are just some of the columns provided to us.
Column Name | Description |
---|---|
YEAR | Year when the outage occurred. |
MONTH | Month when the outage occurred. |
U.S._STATE | Full name of the state where the outage occurred. |
POSTAL.CODE | Two-letter postal abbreviation for the state. |
CLIMATE.REGION | Climate region associated with the location of the outage. |
ANOMALY.LEVEL | Severity level of climate anomalies during the outage. |
OUTAGE.START.DATE | Date when the outage began. |
OUTAGE.RESTORATION.DATE | Date when the outage was restored. |
CAUSE.CATEGORY | General category of the cause of the outage (e.g., severe weather). |
OUTAGE.DURATION | Total duration of the outage in minutes. |
DEMAND.LOSS.MW | Total megawatt loss in demand during the outage. |
CUSTOMERS.AFFECTED | Total number of customers affected by the outage. |
RES.PRICE | Residential electricity price in cents per kilowatt-hour. |
POPULATION | Population of the affected region. |
POPDEN_URBAN | Population density in urban areas (persons per square mile). |
POPDEN_RURAL | Population density in rural areas (persons per square mile). |
Data Cleaning and Preprocessing
The following steps were undertaken to clean and preprocess the dataset, ensuring its readiness for analysis and modeling:
-
Handling Missing Values
- Action Taken: Missing values in critical columns such as
CAUSE.CATEGORY
,OUTAGE.DURATION
, andCUSTOMERS.AFFECTED
were either filled with appropriate imputed values (like median for numeric columns) or removed if imputation wasn't feasible. - Reasoning: Missing values could bias analyses, especially when predicting causes or durations of outages. Imputation or removal ensured the integrity of the dataset for training machine learning models.
- Effect on Analysis: Reduced the dataset size slightly, but ensured clean input for models and accurate results.
- Action Taken: Missing values in critical columns such as
-
Correcting Data Types
- Action Taken: Converted columns like
OUTAGE.START.DATE
andOUTAGE.START.TIME
into a single datetime column (OUTAGE.START
). Similarly,OUTAGE.RESTORATION
was created as a combined column. - Reasoning: Ensured that timestamps could be used effectively in duration calculations and time-based analyses.
- Effect on Analysis: Enabled easy calculations of
OUTAGE.DURATION
by subtractingOUTAGE.START
fromOUTAGE.RESTORATION
.
- Action Taken: Converted columns like
Data Exploration
Univariate: Outages by Cause Category
This bar chart displays the distribution of power outages by their cause categories. I wanted to see the distribution of major cuases of power outages.
Bivariate: Outage Duration by Cause Category
The plot below shows the relation between outage duration and cause category. It shows that outages with the longest duration tend to be from a fuel supply emergency.
Interesting Aggregates: Average Outage Duration by Climate Region and Cause Category
This heatmap displays the average outage duration across various climate regions and cause categories. It highlights that outages caused by severe weather in the East North Central region tend to last significantly longer, emphasizing the need for targeted infrastructure resilience strategies in that area.
Prediction Problem and Type
This is a regression problem, as the goal is to predict the duration of a power outage (a continuous variable) based on various characteristics of the region, the cause of the outage, and population-specific metrics.
Response Variable
The response variable is OUTAGE.DURATION
, which measures the length of a power outage in minutes. This variable was chosen because understanding and predicting outage duration is crucial for resource allocation, planning restoration efforts, and minimizing disruption caused by power outages.
Metric for Evaluation
The primary evaluation metric is Mean Absolute Error (MAE), as it provides an interpretable measure of the average error in minutes, which is directly relevant to stakeholders. MAE was chosen over other metrics (e.g., RMSE) because it is less sensitive to large errors and better captures the typical error magnitude.
Features Used for Prediction
Features included in the model are CAUSE.CATEGORY
, CLIMATE.REGION
, NERC.REGION
, POPULATION
, and other region- or cause-specific metrics such as POPDEN_URBAN
and POPDEN_RURAL
. These features are selected because they are available at the time of prediction and provide critical context about the outage's circumstances. Features like restoration times are excluded as they would not be known before the outage occurs.
Hypothesis
The hypothesis is that the duration of a power outage can be accurately predicted using information about the region's climate, population metrics, and the cause of the outage. For instance, outages caused by severe weather in densely populated areas might have longer durations due to infrastructure challenges and resource constraints.
Baseline Model
My model is a multiclass classifier using the features CLIMATE.REGION and OUTAGE.DURATION to predict the cause of a major outage. This information is essential for energy companies and policymakers to better understand the underlying causes of outages and allocate resources effectively to mitigate future disruptions.
The features are:
- CLIMATE.REGION (nominal): Represents the geographic climate region of the outage, which can influence the type of weather or environmental factors leading to outages.
- OUTAGE.DURATION (quantitative): Measures the length of the outage in hours, providing insight into the severity of the event and potentially the type of cause.
The target variable was CAUSE.CATEGORY, which includes categories such as "severe weather," "intentional attack," and others. These were left as-is to allow the model to handle multiclass classification.
The performance of this model was moderate, with an overall accuracy of 65% on the test set. The model performed well for large categories like Severe Weather and Intentional Attack but struggled with less frequent causes such as Equipment Failure and Fuel Supply Emergency. While this provides a good starting point, improvements are needed to handle class imbalances and better predict underrepresented categories.
Final Model
My final model is a multiclass classifier designed to predict the cause of major power outages. I incorporated advanced features:
CLIMATE.REGION
, OUTAGE.DURATION
, DEMAND.LOSS.MW
, POP_DENSITY
, and an engineered
interaction feature (INTERACTION
). These features were selected based on their relevance to the data-generating process
and their potential to enhance the model’s predictive capabilities. For instance:
- CLIMATE.REGION: Nominal feature capturing geographic climate variations that influence the types of outage causes prevalent in each region.
- OUTAGE.DURATION: Quantitative feature reflecting the length of outages, which varies based on the cause and recovery resources.
- DEMAND.LOSS.MW: Quantitative feature representing the scale of energy loss, providing insights into the severity of the outage.
- POP_DENSITY: Quantitative feature accounting for how population density affects infrastructure stress and recovery times.
- INTERACTION: Engineered feature combining
DEMAND.LOSS.MW
andPOP_DENSITY
, highlighting the interplay between energy demand and population concentration.
These features improved the model by aligning closely with the real-world factors influencing power outages. For example, densely populated regions with high energy demands may experience longer durations during severe weather events due to infrastructure challenges. Adding features that capture these relationships enhances the model's ability to distinguish between outage causes effectively.
Modeling Algorithm and Hyperparameter Selection
I selected a Random Forest Classifier for the final model due to its ability to handle mixed feature types and capture non-linear relationships.
Hyperparameter tuning was performed using GridSearchCV
to identify the best configuration. The optimal hyperparameters were:
- max_depth: 25 (allowed 25 tree growth to capture complex patterns to avoid over fitting).
- n_estimators: 100 (increased the number of trees for better ensemble stability).
- min_samples_split: 5 (ensured fine-grained splits for detailed decision boundaries).
These hyperparameters improved the model's ability to generalize without overfitting, allowing it to perform well across all classes.
Performance Comparison
The Final Model achieved an accuracy of 0.86, a significant improvement over the Baseline Model's accuracy of 0.65. The F1 score for most classes exceeded 0.85. This demonstrates the model's enhanced ability to distinguish between outage causes, even in less frequent categories like "fuel supply emergency" and "islanding."
Visualization of Performance
A confusion matrix provides a detailed view of the model’s performance across all classes. The matrix highlights high true positive rates for dominant classes like "severe weather" and "intentional attack," while smaller categories such as "equipment failure" are correctly predicted in most cases. This reinforces the model's robustness in handling multiclass classification tasks.
The inclusion of advanced features and hyperparameter tuning addressed critical nuances in the data, leading to substantial performance gains. These insights can guide policymakers and utility companies in targeting infrastructure improvements and disaster mitigation strategies.