Industrial Data Collection: The First Step Toward Intelligence

Why Every Smart Industrial Revolution Starts with Data

Imagine a cement plant with 200 sensors measuring temperature, pressure, and vibration every second — that is 17 million readings per day. But is all this data useful? Only if you collect it properly, clean it from errors, and store it in a way that enables analysis.

Industrial data collection is the foundation of every AI system in a factory. Without clean and reliable data, even the smartest algorithms will produce wrong results — as the saying goes: "Garbage In, Garbage Out."

Types of Industrial Sensors and Data Sources

In any modern factory, data comes from diverse sources:

Direct Physical Sensors:

Temperature: Thermocouples (Type K or J) for furnace temperatures up to 1200 degrees C
Pressure: Pressure transducers for monitoring pipe and boiler pressure
Vibration: Accelerometers for detecting bearing and motor faults
Flow: Flow meters for measuring liquid and gas flow rates
Level: Level sensors for raw material and product tanks

Digital Sources:

SCADA and PLC systems controlling production lines
ERP systems recording work orders and inventory
Quality logs and inspection reports
Machine Vision cameras

Sensor Type	Typical Sampling Rate	Daily Data Volume
Temperature (Thermocouple)	1-10 Hz	~5 MB
Vibration (Accelerometer)	1-50 kHz	~2 GB
Vision Camera	30 fps	~50 GB
Flow Meter	1 Hz	~1 MB
Power Meter	1 Hz	~2 MB

Sampling Rate: How Often Should We Read the Sensor?

Consider monitoring a furnace temperature that changes slowly — one reading every 10 seconds is perfectly adequate. But if you are monitoring vibration on a motor spinning at 3000 RPM, you need thousands of readings per second to catch any anomaly.

The Nyquist Theorem states: the sampling rate must be at least twice the highest frequency in the signal. In practice, we use 5 to 10 times the maximum frequency for reliable results.

# Calculating the required sampling rate
motor_rpm = 3000          # Motor speed
fundamental_freq = motor_rpm / 60  # 50 Hz
# To detect harmonics up to the 10th order
max_freq = fundamental_freq * 10   # 500 Hz
# Applying Nyquist with safety margin
sampling_rate = max_freq * 2.56    # 1280 Hz (ISO standard)

print(f"Fundamental frequency: {fundamental_freq} Hz")
print(f"Max frequency of interest: {max_freq} Hz")
print(f"Required sampling rate: {sampling_rate} Hz")
print(f"Readings per minute: {sampling_rate * 60:,.0f}")

Practical Guidelines:

Slow processes (temperature, level): 0.1 - 1 Hz
Medium processes (pressure, flow): 1 - 100 Hz
Fast processes (vibration, acoustics): 1 - 50 kHz

Data Quality: The Hidden Enemy

In industrial reality, data is rarely perfect. Imagine a temperature sensor suddenly reporting -999.0 — did the temperature actually drop, or did the sensor disconnect?

Missing Values

These occur when sensor communication drops or the device temporarily fails.

import numpy as np

# Temperature data with missing values
readings = [78.5, 79.1, None, None, 80.3, 77.8, None, 81.2, 80.5, 79.9]

# Method 1: Linear Interpolation
def interpolate_missing(data):
    result = data.copy()
    for i, val in enumerate(result):
        if val is None:
            # Find nearest valid values
            prev_val = next((result[j] for j in range(i-1, -1, -1)
                           if result[j] is not None), None)
            next_val = next((result[j] for j in range(i+1, len(result))
                           if result[j] is not None), None)
            if prev_val and next_val:
                result[i] = (prev_val + next_val) / 2
            elif prev_val:
                result[i] = prev_val  # Forward fill
    return result

clean_data = interpolate_missing(readings)
print(f"Original data:  {readings}")
print(f"After cleaning: {clean_data}")

Outliers

Readings far from the expected range — they could be sensor errors or real events worth investigating.

def detect_outliers_iqr(data):
    """Detect outliers using the Interquartile Range (IQR)"""
    clean = [x for x in data if x is not None]
    q1 = np.percentile(clean, 25)
    q3 = np.percentile(clean, 75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    outliers = [(i, v) for i, v in enumerate(data)
                if v is not None and (v < lower_bound or v > upper_bound)]
    return outliers, lower_bound, upper_bound

# Pressure data with an outlier
pressure = [4.2, 4.3, 4.1, 4.4, 12.8, 4.2, 4.3, 4.1, 4.5, 4.2]
outliers, low, high = detect_outliers_iqr(pressure)
print(f"Bounds: [{low:.1f}, {high:.1f}]")
print(f"Outliers: {outliers}")

Noise

Small random fluctuations in sensor readings. We use filters to smooth the signal:

def moving_average_filter(data, window=5):
    """Moving average filter for smoothing data"""
    filtered = []
    for i in range(len(data)):
        start = max(0, i - window // 2)
        end = min(len(data), i + window // 2 + 1)
        window_data = [x for x in data[start:end] if x is not None]
        filtered.append(sum(window_data) / len(window_data) if window_data else None)
    return filtered

Quality Issue	Common Cause	Treatment Method
Missing values	Communication loss	Linear interpolation / Forward fill
Outliers	Sensor malfunction	IQR or Z-score
Noise	Electrical interference	Moving average filter
Drift	Sensor degradation	Periodic calibration
Time delay	Slow network	NTP synchronization

Data Labeling

To train AI models, we need labeled data — every reading or image must be tagged with a description like "normal" or "faulty."

Consider training a model to detect bearing faults from vibration data. You need to tell the model: "this signal = healthy bearing" and "this signal = worn bearing."

Labeling Methods:

Manual: An expert reviews and classifies data — accurate but slow and expensive
Semi-automatic: The expert labels a small sample, then a preliminary model labels the rest
Rule-based: For example: "if vibration exceeds 10 mm/s = fault"
From maintenance logs: Linking recorded failures with sensor data at the same timestamps

The ETL Process: Extract, Transform, Load

ETL is the pipeline that moves data from its raw sources into the analytical database.

┌──────────┐    ┌──────────────┐    ┌──────────────┐
│ Extract  │───>│  Transform   │───>│    Load      │
├──────────┤    ├──────────────┤    ├──────────────┤
│ PLC/SCADA│    │ Clean        │    │ Historian DB │
│ Sensors  │    │ Unify units  │    │ Time-series  │
│ ERP      │    │ Compute KPIs │    │ Data Lake    │
│ CSV/API  │    │ Flag outliers│    │              │
└──────────┘    └──────────────┘    └──────────────┘

# Simplified industrial ETL pipeline
class IndustrialETL:
    def extract(self, source):
        """Extract data from multiple sources"""
        if source == "opc_ua":
            return self.read_opc_ua()    # Industrial standard
        elif source == "modbus":
            return self.read_modbus()
        elif source == "csv":
            return self.read_csv_files()

    def transform(self, raw_data):
        """Clean and transform data"""
        data = self.remove_duplicates(raw_data)
        data = self.fill_missing_values(data)
        data = self.convert_units(data)       # PSI to bar, F to C
        data = self.flag_outliers(data)
        data = self.calculate_kpis(data)      # OEE, energy consumption
        return data

    def load(self, clean_data, target):
        """Load into target database"""
        if target == "historian":
            self.write_to_historian(clean_data)
        elif target == "data_lake":
            self.write_parquet(clean_data)     # Columnar compressed storage

Historian Databases

In the factory world, we do not use standard databases like MySQL for sensor data — we use Time-Series Databases optimized for this purpose.

Feature	Standard Database (SQL)	Time-Series DB (Historian)
Write speed	Thousands/sec	Millions/sec
Compression	Standard	Specialized 10:1 to 100:1
Time-based queries	Slow	Highly optimized
Aggregation	Manual	Automatic (minute/hour/day)
Data retention	No policy	Automatic (age-based deletion)

Popular Historian Databases:

InfluxDB: Open-source, excellent for mid-size projects
TimescaleDB: Built on PostgreSQL, combines SQL with time-series
OSIsoft PI: The industrial standard in large plants
Siemens Historian: Integrated with Siemens systems

From Sensor to Decision: The Full Picture

Trace the complete journey of a single sensor reading:

Temperature Sensor -> PLC -> OPC-UA Server -> ETL Pipeline
    -> Cleaning and Quality Check -> Historian DB
    -> Live Dashboard
    -> AI Model for Fault Prediction
    -> Automatic Maintenance Alert

Time from Reading to Decision:

Live monitoring: < 1 second
Anomaly detection: 1-5 seconds
Predictive maintenance: minutes to hours (batch analysis)

Practical Tips for Engineers

Start with existing data: Most factories already have PLCs recording data — look for it before buying new sensors
Quality over quantity: 100 clean readings are better than a million error-filled ones
Document everything: Which sensor, which unit, which sampling rate — because you will forget in 6 months
Plan for storage: Vibration data at 10 kHz fills disks fast — use compression and retention policies
Test your pipeline on real data: Simulated data will not contain the problems you will face in production