View on GitHub

Power_Outage_Prediction

A EECS398 project

Power_Outage_Prediction

By Alex Collier

Introduction to the Dataset and Prediction Question

The dataset we are using focuses on major power outages across various regions. It contains detailed records of outages, including the causes, durations, affected populations, and geographic information. This dataset offers insights into the factors that contribute to power outages and the severity of their impacts.

My central question is: "What causes most major power outages, and can we predict the most likely cause of an outage?"

Understanding the root causes of power outages and predicting them is crucial for improving power grid reliability, preparing for natural disasters, and reducing the socioeconomic impacts of outages. Power outages can disrupt daily life, damage economies, and threaten public safety. Utilities, policymakers, and researchers can leverage insights from this analysis to better allocate resources, design preventative measures, and strengthen infrastructure.

Dataset Variables and Descriptions

The original DataFrame contains 1534 rows, corresponding to 1534 outages, and 57 columns. Listed below are just some of the columns provided to us.

Column Name	Description
YEAR	Year when the outage occurred.
MONTH	Month when the outage occurred.
U.S._STATE	Full name of the state where the outage occurred.
POSTAL.CODE	Two-letter postal abbreviation for the state.
CLIMATE.REGION	Climate region associated with the location of the outage.
ANOMALY.LEVEL	Severity level of climate anomalies during the outage.
OUTAGE.START.DATE	Date when the outage began.
OUTAGE.RESTORATION.DATE	Date when the outage was restored.
CAUSE.CATEGORY	General category of the cause of the outage (e.g., severe weather).
OUTAGE.DURATION	Total duration of the outage in minutes.
DEMAND.LOSS.MW	Total megawatt loss in demand during the outage.
CUSTOMERS.AFFECTED	Total number of customers affected by the outage.
RES.PRICE	Residential electricity price in cents per kilowatt-hour.
POPULATION	Population of the affected region.
POPDEN_URBAN	Population density in urban areas (persons per square mile).
POPDEN_RURAL	Population density in rural areas (persons per square mile).

Data Cleaning and Preprocessing

The following steps were undertaken to clean and preprocess the dataset, ensuring its readiness for analysis and modeling:

Handling Missing Values
- Action Taken: Missing values in critical columns such as CAUSE.CATEGORY, OUTAGE.DURATION, and CUSTOMERS.AFFECTED were either filled with appropriate imputed values (like median for numeric columns) or removed if imputation wasn't feasible.
- Reasoning: Missing values could bias analyses, especially when predicting causes or durations of outages. Imputation or removal ensured the integrity of the dataset for training machine learning models.
- Effect on Analysis: Reduced the dataset size slightly, but ensured clean input for models and accurate results.
Correcting Data Types
- Action Taken: Converted columns like OUTAGE.START.DATE and OUTAGE.START.TIME into a single datetime column (OUTAGE.START). Similarly, OUTAGE.RESTORATION was created as a combined column.
- Reasoning: Ensured that timestamps could be used effectively in duration calculations and time-based analyses.
- Effect on Analysis: Enabled easy calculations of OUTAGE.DURATION by subtracting OUTAGE.START from OUTAGE.RESTORATION.

Data Exploration

Univariate: Outages by Cause Category

This bar chart displays the distribution of power outages by their cause categories. I wanted to see the distribution of major cuases of power outages.

Bivariate: Outage Duration by Cause Category

The plot below shows the relation between outage duration and cause category. It shows that outages with the longest duration tend to be from a fuel supply emergency.

Interesting Aggregates: Average Outage Duration by Climate Region and Cause Category

This heatmap displays the average outage duration across various climate regions and cause categories. It highlights that outages caused by severe weather in the East North Central region tend to last significantly longer, emphasizing the need for targeted infrastructure resilience strategies in that area.

Prediction Problem and Type

This is a regression problem, as the goal is to predict the duration of a power outage (a continuous variable) based on various characteristics of the region, the cause of the outage, and population-specific metrics.

Response Variable

The response variable is OUTAGE.DURATION, which measures the length of a power outage in minutes. This variable was chosen because understanding and predicting outage duration is crucial for resource allocation, planning restoration efforts, and minimizing disruption caused by power outages.

Metric for Evaluation

The primary evaluation metric is Mean Absolute Error (MAE), as it provides an interpretable measure of the average error in minutes, which is directly relevant to stakeholders. MAE was chosen over other metrics (e.g., RMSE) because it is less sensitive to large errors and better captures the typical error magnitude.

Features Used for Prediction

Features included in the model are CAUSE.CATEGORY, CLIMATE.REGION, NERC.REGION, POPULATION, and other region- or cause-specific metrics such as POPDEN_URBAN and POPDEN_RURAL. These features are selected because they are available at the time of prediction and provide critical context about the outage's circumstances. Features like restoration times are excluded as they would not be known before the outage occurs.

Hypothesis

The hypothesis is that the duration of a power outage can be accurately predicted using information about the region's climate, population metrics, and the cause of the outage. For instance, outages caused by severe weather in densely populated areas might have longer durations due to infrastructure challenges and resource constraints.

Baseline Model

My model is a multiclass classifier using the features CLIMATE.REGION and OUTAGE.DURATION to predict the cause of a major outage. This information is essential for energy companies and policymakers to better understand the underlying causes of outages and allocate resources effectively to mitigate future disruptions.

The features are:

CLIMATE.REGION (nominal): Represents the geographic climate region of the outage, which can influence the type of weather or environmental factors leading to outages.
OUTAGE.DURATION (quantitative): Measures the length of the outage in hours, providing insight into the severity of the event and potentially the type of cause.

The target variable was CAUSE.CATEGORY, which includes categories such as "severe weather," "intentional attack," and others. These were left as-is to allow the model to handle multiclass classification.

The performance of this model was moderate, with an overall accuracy of 65% on the test set. The model performed well for large categories like Severe Weather and Intentional Attack but struggled with less frequent causes such as Equipment Failure and Fuel Supply Emergency. While this provides a good starting point, improvements are needed to handle class imbalances and better predict underrepresented categories.

Final Model

My final model is a multiclass classifier designed to predict the cause of major power outages. I incorporated advanced features: CLIMATE.REGION, OUTAGE.DURATION, DEMAND.LOSS.MW, POP_DENSITY, and an engineered interaction feature (INTERACTION). These features were selected based on their relevance to the data-generating process and their potential to enhance the model’s predictive capabilities. For instance:

CLIMATE.REGION: Nominal feature capturing geographic climate variations that influence the types of outage causes prevalent in each region.
OUTAGE.DURATION: Quantitative feature reflecting the length of outages, which varies based on the cause and recovery resources.
DEMAND.LOSS.MW: Quantitative feature representing the scale of energy loss, providing insights into the severity of the outage.
POP_DENSITY: Quantitative feature accounting for how population density affects infrastructure stress and recovery times.
INTERACTION: Engineered feature combining DEMAND.LOSS.MW and POP_DENSITY, highlighting the interplay between energy demand and population concentration.

These features improved the model by aligning closely with the real-world factors influencing power outages. For example, densely populated regions with high energy demands may experience longer durations during severe weather events due to infrastructure challenges. Adding features that capture these relationships enhances the model's ability to distinguish between outage causes effectively.

Modeling Algorithm and Hyperparameter Selection

I selected a Random Forest Classifier for the final model due to its ability to handle mixed feature types and capture non-linear relationships. Hyperparameter tuning was performed using GridSearchCV to identify the best configuration. The optimal hyperparameters were:

max_depth: 25 (allowed 25 tree growth to capture complex patterns to avoid over fitting).
n_estimators: 100 (increased the number of trees for better ensemble stability).
min_samples_split: 5 (ensured fine-grained splits for detailed decision boundaries).

These hyperparameters improved the model's ability to generalize without overfitting, allowing it to perform well across all classes.

Performance Comparison

The Final Model achieved an accuracy of 0.86, a significant improvement over the Baseline Model's accuracy of 0.65. The F1 score for most classes exceeded 0.85. This demonstrates the model's enhanced ability to distinguish between outage causes, even in less frequent categories like "fuel supply emergency" and "islanding."

Visualization of Performance

A confusion matrix provides a detailed view of the model’s performance across all classes. The matrix highlights high true positive rates for dominant classes like "severe weather" and "intentional attack," while smaller categories such as "equipment failure" are correctly predicted in most cases. This reinforces the model's robustness in handling multiclass classification tasks.

The inclusion of advanced features and hyperparameter tuning addressed critical nuances in the data, leading to substantial performance gains. These insights can guide policymakers and utility companies in targeting infrastructure improvements and disaster mitigation strategies.