Regression Analysis: Prediction from Data

What Is Regression Analysis and Why Do Engineers Need It?

Imagine you are monitoring an electric motor on a production line. You have sensor data: temperature, rotation speed, current, vibration. Can you predict the remaining motor life from these readings? Can you determine which factor affects performance the most?

Regression analysis is the mathematical tool that answers these questions. It builds a mathematical model describing the relationship between independent variables (inputs: temperature, load...) and a dependent variable (output: remaining life, energy consumption...).

Simply put: regression draws the best line (or curve) through the data points.

Simple Linear Regression: The Simplest Predictive Model

Simple linear regression describes a relationship between one independent variable x and a dependent variable y:

y = B0 + B1*x + e

where:

B0 = Intercept -- the value of y when x = 0
B1 = Slope -- how much y changes per unit change in x
e = Random error -- what the model does not explain

Industrial example: The relationship between motor load (%) and power consumption (kW):

import numpy as np

# Measurement data from a factory motor
load_percent = np.array([20, 30, 40, 50, 60, 70, 80, 90, 100])
power_kw = np.array([4.2, 5.8, 7.1, 8.9, 10.5, 12.3, 13.8, 15.6, 17.2])

# Calculate regression coefficients using least squares
n = len(load_percent)
x_mean = np.mean(load_percent)
y_mean = np.mean(power_kw)

# Slope B1
numerator = np.sum((load_percent - x_mean) * (power_kw - y_mean))
denominator = np.sum((load_percent - x_mean) ** 2)
beta_1 = numerator / denominator

# Intercept B0
beta_0 = y_mean - beta_1 * x_mean

print(f"Equation: Power = {beta_0:.2f} + {beta_1:.4f} * Load")
print(f"Interpretation: each 1% load increase adds {beta_1:.4f} kW")

# Prediction at 75% load
load_new = 75
power_predicted = beta_0 + beta_1 * load_new
print(f"\nPrediction at {load_new}% load: {power_predicted:.2f} kW")

Least Squares Method: How Do We Find the Best Line?

Among millions of possible lines, we want the one that minimizes the Sum of Squared Errors (SSE):

SSE = sum of (yi - y_hat_i)^2 = sum of (yi - B0 - B1*xi)^2

Why squared rather than absolute values?

Squares penalize large errors more heavily
The function is differentiable -- the minimum can be found analytically
Connected to the Gaussian distribution of errors

# Step-by-step least squares calculation
y_predicted = beta_0 + beta_1 * load_percent
residuals = power_kw - y_predicted

SSE = np.sum(residuals ** 2)
SST = np.sum((power_kw - y_mean) ** 2)  # total sum of squares

print("Residuals (errors):")
for i in range(n):
    print(f"  Load {load_percent[i]}%: actual={power_kw[i]:.1f}, "
          f"predicted={y_predicted[i]:.2f}, error={residuals[i]:+.2f} kW")

print(f"\nSum of Squared Errors (SSE): {SSE:.4f}")
print(f"Total Sum of Squares (SST): {SST:.4f}")

R-Squared: Is the Model Any Good?

R-squared measures the proportion of variance in the data explained by the model:

R^2 = 1 - (SSE / SST)

R-squared Value	Interpretation
0.95 -- 1.00	Excellent -- model explains most variance
0.80 -- 0.95	Good -- useful for prediction
0.50 -- 0.80	Moderate -- useful for trend understanding
< 0.50	Weak -- additional variables needed

R_squared = 1 - (SSE / SST)
print(f"R-squared = {R_squared:.4f}")
print(f"The model explains {R_squared*100:.1f}% of power consumption variance")

if R_squared > 0.95:
    print("Assessment: Excellent!")
elif R_squared > 0.80:
    print("Assessment: Good")
else:
    print("Assessment: Needs improvement")

Important warning: A high R-squared does not necessarily mean the model is correct. Always inspect the residuals visually -- if they are randomly scattered around zero the model is appropriate; if a pattern appears you may need a nonlinear model.

Multiple Regression: More Than One Independent Variable

In industrial reality, the output rarely depends on a single variable. Multiple regression uses several independent variables:

y = B0 + B1*x1 + B2*x2 + B3*x3 + ... + e

Industrial example: Predicting motor power consumption from load, ambient temperature, and humidity:

import numpy as np

# Data from 12 measurements
# [Load%, Temperature C, Humidity%]
X_data = np.array([
    [40, 25, 50], [50, 28, 55], [60, 30, 60],
    [70, 32, 45], [80, 35, 70], [45, 22, 40],
    [55, 27, 65], [65, 33, 55], [75, 36, 50],
    [85, 38, 75], [50, 24, 45], [90, 40, 60]
])
y_data = np.array([7.2, 8.8, 10.3, 12.1, 14.0, 7.8, 9.5, 11.5, 13.2, 15.1, 8.5, 16.0])

# Add column of ones for B0
n = len(y_data)
X_matrix = np.column_stack([np.ones(n), X_data])

# Solve least squares using linear algebra
# beta = (X^T X)^(-1) X^T y
XtX = X_matrix.T @ X_matrix
Xty = X_matrix.T @ y_data
beta = np.linalg.solve(XtX, Xty)

print("Multiple regression coefficients:")
print(f"  B0 (intercept): {beta[0]:.4f}")
print(f"  B1 (load): {beta[1]:.4f} kW per 1%")
print(f"  B2 (temperature): {beta[2]:.4f} kW per 1 C")
print(f"  B3 (humidity): {beta[3]:.4f} kW per 1%")

# Calculate R-squared
y_pred = X_matrix @ beta
SSE = np.sum((y_data - y_pred) ** 2)
SST = np.sum((y_data - np.mean(y_data)) ** 2)
R2 = 1 - SSE / SST
print(f"\nR-squared = {R2:.4f}")

# Prediction: load 72%, temperature 31 C, humidity 55%
new_input = np.array([1, 72, 31, 55])
prediction = new_input @ beta
print(f"\nPrediction (72%, 31 C, 55%): {prediction:.2f} kW")

Which variable matters most? Look at coefficient magnitudes after standardization -- the largest standardized coefficient points to the most influential variable.

Polynomial Regression: When the Relationship Is Curved

Not all relationships are linear. Polynomial regression fits curves:

y = B0 + B1*x + B2*x^2 + B3*x^3 + ...

Industrial example: Pump efficiency versus flow rate -- the relationship is always curved (an arch):

import numpy as np

# Centrifugal pump efficiency data
flow_rate = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])  # m3/h
efficiency = np.array([45, 62, 74, 82, 87, 89, 88, 84, 77, 65])   # %

# Linear regression (for comparison)
X_lin = np.column_stack([np.ones(10), flow_rate])
beta_lin = np.linalg.lstsq(X_lin, efficiency, rcond=None)[0]
y_lin = X_lin @ beta_lin
R2_lin = 1 - np.sum((efficiency - y_lin)**2) / np.sum((efficiency - np.mean(efficiency))**2)

# Quadratic polynomial regression
X_poly = np.column_stack([np.ones(10), flow_rate, flow_rate**2])
beta_poly = np.linalg.lstsq(X_poly, efficiency, rcond=None)[0]
y_poly = X_poly @ beta_poly
R2_poly = 1 - np.sum((efficiency - y_poly)**2) / np.sum((efficiency - np.mean(efficiency))**2)

print(f"Linear regression: R-squared = {R2_lin:.4f}")
print(f"Quadratic regression: R-squared = {R2_poly:.4f}")
print(f"\nEfficiency equation: eta = {beta_poly[0]:.2f} + {beta_poly[1]:.4f}Q + {beta_poly[2]:.6f}Q^2")

# Optimal flow rate (where derivative = 0)
Q_optimal = -beta_poly[1] / (2 * beta_poly[2])
eta_max = beta_poly[0] + beta_poly[1]*Q_optimal + beta_poly[2]*Q_optimal**2
print(f"\nOptimal flow rate: {Q_optimal:.1f} m3/h")
print(f"Maximum efficiency: {eta_max:.1f}%")

Warning: Do not increase the polynomial degree beyond what is needed -- a very high degree causes overfitting, where the model follows noise instead of the true trend.

Residual Analysis: Validating the Model

Residuals are the differences between actual and predicted values. Analyzing them reveals model problems:

import numpy as np

# Residuals from the quadratic pump model
residuals = efficiency - y_poly

print("Residual analysis:")
print(f"  Mean: {np.mean(residuals):.4f} (should be near zero)")
print(f"  Standard deviation: {np.std(residuals):.4f}")
print(f"  Max absolute: {np.max(np.abs(residuals)):.4f}")

# Approximate normality test
skewness = np.mean(((residuals - np.mean(residuals)) / np.std(residuals)) ** 3)
print(f"  Skewness: {skewness:.4f} (should be near zero)")

# Pattern check
print("\nResiduals per point:")
for i in range(len(residuals)):
    bar = "+" * int(abs(residuals[i]) * 5)
    sign = "+" if residuals[i] > 0 else "-"
    print(f"  Q={flow_rate[i]:3d}: {residuals[i]:+.2f} {sign}{bar}")

Healthy residual rules:

Mean close to zero
No visible pattern (random scatter)
Approximately constant variance (homoscedasticity)
No large outliers

Industrial Application: Building a Predictive Maintenance Model

The end goal: predicting when a motor needs maintenance based on sensor data:

import numpy as np

# Training data: 20 motors with varying ages
np.random.seed(42)
n_motors = 20

# Independent variables
vibration_rms = np.random.uniform(1.5, 8.0, n_motors)    # mm/s
temperature = np.random.uniform(60, 120, n_motors)        # C
current_deviation = np.random.uniform(0, 15, n_motors)    # % from reference
hours_running = np.random.uniform(1000, 15000, n_motors)  # operating hours

# Dependent variable: days until failure (true equation + noise)
days_to_failure = (500
    - 30 * vibration_rms
    - 1.5 * temperature
    + 0.01 * hours_running
    - 5 * current_deviation
    + np.random.normal(0, 20, n_motors))
days_to_failure = np.maximum(days_to_failure, 1)  # cannot be negative

# Build the model
X = np.column_stack([
    np.ones(n_motors),
    vibration_rms,
    temperature,
    current_deviation,
    hours_running
])

beta = np.linalg.lstsq(X, days_to_failure, rcond=None)[0]

# Evaluate the model
y_pred = X @ beta
SSE = np.sum((days_to_failure - y_pred)**2)
SST = np.sum((days_to_failure - np.mean(days_to_failure))**2)
R2 = 1 - SSE/SST

print("Predictive maintenance model:")
print(f"  Days = {beta[0]:.1f}")
print(f"       {beta[1]:+.2f} * vibration (mm/s)")
print(f"       {beta[2]:+.2f} * temperature (C)")
print(f"       {beta[3]:+.2f} * current deviation (%)")
print(f"       {beta[4]:+.4f} * running hours")
print(f"  R-squared = {R2:.4f}")

# Prediction for a new motor
new_motor = np.array([1, 4.5, 85, 8, 6000])
prediction = new_motor @ beta
print(f"\nPrediction (vibration=4.5, temp=85, deviation=8%, hours=6000):")
print(f"  Remaining days: {prediction:.0f}")

if prediction < 30:
    print("  Status: Urgent maintenance required!")
elif prediction < 90:
    print("  Status: Schedule maintenance within one month")
else:
    print("  Status: Stable -- monitoring is sufficient")

Summary and Practical Rules

Type	Equation	When to Use
Simple linear	`y = B0 + B1*x`	One variable, linear relationship
Multiple	`y = B0 + B1x1 + B2x2 + ...`	Several variables, linear relationship
Polynomial	`y = B0 + B1x + B2x^2 + ...`	Curved relationship

Metric	Meaning	Good Value
R-squared	Proportion of variance explained	> 0.80
Residuals	Model errors	Random, zero-mean
RMSE	Root mean squared error	As small as possible

Practical tip: On the factory floor, a simple model that is understood and explainable is better than a complex model with slightly higher accuracy. The operator needs to trust the model -- and trust comes from understanding. Start with linear regression and add complexity only when you genuinely need it.