Industrial Data Collection: The First Step Toward Intelligence
Why Every Smart Industrial Revolution Starts with Data
Imagine a cement plant with 200 sensors measuring temperature, pressure, and vibration every second — that is 17 million readings per day. But is all this data useful? Only if you collect it properly, clean it from errors, and store it in a way that enables analysis.
Industrial data collection is the foundation of every AI system in a factory. Without clean and reliable data, even the smartest algorithms will produce wrong results — as the saying goes: "Garbage In, Garbage Out."
Types of Industrial Sensors and Data Sources
In any modern factory, data comes from diverse sources:
Direct Physical Sensors:
- Temperature: Thermocouples (Type K or J) for furnace temperatures up to 1200 degrees C
- Pressure: Pressure transducers for monitoring pipe and boiler pressure
- Vibration: Accelerometers for detecting bearing and motor faults
- Flow: Flow meters for measuring liquid and gas flow rates
- Level: Level sensors for raw material and product tanks
Digital Sources:
- SCADA and PLC systems controlling production lines
- ERP systems recording work orders and inventory
- Quality logs and inspection reports
- Machine Vision cameras
| Sensor Type | Typical Sampling Rate | Daily Data Volume |
|---|---|---|
| Temperature (Thermocouple) | 1-10 Hz | ~5 MB |
| Vibration (Accelerometer) | 1-50 kHz | ~2 GB |
| Vision Camera | 30 fps | ~50 GB |
| Flow Meter | 1 Hz | ~1 MB |
| Power Meter | 1 Hz | ~2 MB |
Sampling Rate: How Often Should We Read the Sensor?
Consider monitoring a furnace temperature that changes slowly — one reading every 10 seconds is perfectly adequate. But if you are monitoring vibration on a motor spinning at 3000 RPM, you need thousands of readings per second to catch any anomaly.
The Nyquist Theorem states: the sampling rate must be at least twice the highest frequency in the signal. In practice, we use 5 to 10 times the maximum frequency for reliable results.
# Calculating the required sampling rate
motor_rpm = 3000 # Motor speed
fundamental_freq = motor_rpm / 60 # 50 Hz
# To detect harmonics up to the 10th order
max_freq = fundamental_freq * 10 # 500 Hz
# Applying Nyquist with safety margin
sampling_rate = max_freq * 2.56 # 1280 Hz (ISO standard)
print(f"Fundamental frequency: {fundamental_freq} Hz")
print(f"Max frequency of interest: {max_freq} Hz")
print(f"Required sampling rate: {sampling_rate} Hz")
print(f"Readings per minute: {sampling_rate * 60:,.0f}")
Practical Guidelines:
- Slow processes (temperature, level): 0.1 - 1 Hz
- Medium processes (pressure, flow): 1 - 100 Hz
- Fast processes (vibration, acoustics): 1 - 50 kHz
Data Quality: The Hidden Enemy
In industrial reality, data is rarely perfect. Imagine a temperature sensor suddenly reporting -999.0 — did the temperature actually drop, or did the sensor disconnect?
Missing Values
These occur when sensor communication drops or the device temporarily fails.
import numpy as np
# Temperature data with missing values
readings = [78.5, 79.1, None, None, 80.3, 77.8, None, 81.2, 80.5, 79.9]
# Method 1: Linear Interpolation
def interpolate_missing(data):
result = data.copy()
for i, val in enumerate(result):
if val is None:
# Find nearest valid values
prev_val = next((result[j] for j in range(i-1, -1, -1)
if result[j] is not None), None)
next_val = next((result[j] for j in range(i+1, len(result))
if result[j] is not None), None)
if prev_val and next_val:
result[i] = (prev_val + next_val) / 2
elif prev_val:
result[i] = prev_val # Forward fill
return result
clean_data = interpolate_missing(readings)
print(f"Original data: {readings}")
print(f"After cleaning: {clean_data}")
Outliers
Readings far from the expected range — they could be sensor errors or real events worth investigating.
def detect_outliers_iqr(data):
"""Detect outliers using the Interquartile Range (IQR)"""
clean = [x for x in data if x is not None]
q1 = np.percentile(clean, 25)
q3 = np.percentile(clean, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = [(i, v) for i, v in enumerate(data)
if v is not None and (v < lower_bound or v > upper_bound)]
return outliers, lower_bound, upper_bound
# Pressure data with an outlier
pressure = [4.2, 4.3, 4.1, 4.4, 12.8, 4.2, 4.3, 4.1, 4.5, 4.2]
outliers, low, high = detect_outliers_iqr(pressure)
print(f"Bounds: [{low:.1f}, {high:.1f}]")
print(f"Outliers: {outliers}")
Noise
Small random fluctuations in sensor readings. We use filters to smooth the signal:
def moving_average_filter(data, window=5):
"""Moving average filter for smoothing data"""
filtered = []
for i in range(len(data)):
start = max(0, i - window // 2)
end = min(len(data), i + window // 2 + 1)
window_data = [x for x in data[start:end] if x is not None]
filtered.append(sum(window_data) / len(window_data) if window_data else None)
return filtered
| Quality Issue | Common Cause | Treatment Method |
|---|---|---|
| Missing values | Communication loss | Linear interpolation / Forward fill |
| Outliers | Sensor malfunction | IQR or Z-score |
| Noise | Electrical interference | Moving average filter |
| Drift | Sensor degradation | Periodic calibration |
| Time delay | Slow network | NTP synchronization |
Data Labeling
To train AI models, we need labeled data — every reading or image must be tagged with a description like "normal" or "faulty."
Consider training a model to detect bearing faults from vibration data. You need to tell the model: "this signal = healthy bearing" and "this signal = worn bearing."
Labeling Methods:
- Manual: An expert reviews and classifies data — accurate but slow and expensive
- Semi-automatic: The expert labels a small sample, then a preliminary model labels the rest
- Rule-based: For example: "if vibration exceeds 10 mm/s = fault"
- From maintenance logs: Linking recorded failures with sensor data at the same timestamps
The ETL Process: Extract, Transform, Load
ETL is the pipeline that moves data from its raw sources into the analytical database.
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Extract │───>│ Transform │───>│ Load │
├──────────┤ ├──────────────┤ ├──────────────┤
│ PLC/SCADA│ │ Clean │ │ Historian DB │
│ Sensors │ │ Unify units │ │ Time-series │
│ ERP │ │ Compute KPIs │ │ Data Lake │
│ CSV/API │ │ Flag outliers│ │ │
└──────────┘ └──────────────┘ └──────────────┘
# Simplified industrial ETL pipeline
class IndustrialETL:
def extract(self, source):
"""Extract data from multiple sources"""
if source == "opc_ua":
return self.read_opc_ua() # Industrial standard
elif source == "modbus":
return self.read_modbus()
elif source == "csv":
return self.read_csv_files()
def transform(self, raw_data):
"""Clean and transform data"""
data = self.remove_duplicates(raw_data)
data = self.fill_missing_values(data)
data = self.convert_units(data) # PSI to bar, F to C
data = self.flag_outliers(data)
data = self.calculate_kpis(data) # OEE, energy consumption
return data
def load(self, clean_data, target):
"""Load into target database"""
if target == "historian":
self.write_to_historian(clean_data)
elif target == "data_lake":
self.write_parquet(clean_data) # Columnar compressed storage
Historian Databases
In the factory world, we do not use standard databases like MySQL for sensor data — we use Time-Series Databases optimized for this purpose.
| Feature | Standard Database (SQL) | Time-Series DB (Historian) |
|---|---|---|
| Write speed | Thousands/sec | Millions/sec |
| Compression | Standard | Specialized 10:1 to 100:1 |
| Time-based queries | Slow | Highly optimized |
| Aggregation | Manual | Automatic (minute/hour/day) |
| Data retention | No policy | Automatic (age-based deletion) |
Popular Historian Databases:
- InfluxDB: Open-source, excellent for mid-size projects
- TimescaleDB: Built on PostgreSQL, combines SQL with time-series
- OSIsoft PI: The industrial standard in large plants
- Siemens Historian: Integrated with Siemens systems
From Sensor to Decision: The Full Picture
Trace the complete journey of a single sensor reading:
Temperature Sensor -> PLC -> OPC-UA Server -> ETL Pipeline
-> Cleaning and Quality Check -> Historian DB
-> Live Dashboard
-> AI Model for Fault Prediction
-> Automatic Maintenance Alert
Time from Reading to Decision:
- Live monitoring: < 1 second
- Anomaly detection: 1-5 seconds
- Predictive maintenance: minutes to hours (batch analysis)
Practical Tips for Engineers
- Start with existing data: Most factories already have PLCs recording data — look for it before buying new sensors
- Quality over quantity: 100 clean readings are better than a million error-filled ones
- Document everything: Which sensor, which unit, which sampling rate — because you will forget in 6 months
- Plan for storage: Vibration data at 10 kHz fills disks fast — use compression and retention policies
- Test your pipeline on real data: Simulated data will not contain the problems you will face in production