Home Wiki AI Fundamentals Industrial Data Collection: The First Step Toward Intelligence
AI Fundamentals

Industrial Data Collection: The First Step Toward Intelligence

Why Every Smart Industrial Revolution Starts with Data

Imagine a cement plant with 200 sensors measuring temperature, pressure, and vibration every second — that is 17 million readings per day. But is all this data useful? Only if you collect it properly, clean it from errors, and store it in a way that enables analysis.

Industrial data collection is the foundation of every AI system in a factory. Without clean and reliable data, even the smartest algorithms will produce wrong results — as the saying goes: "Garbage In, Garbage Out."

Types of Industrial Sensors and Data Sources

In any modern factory, data comes from diverse sources:

Direct Physical Sensors:

  • Temperature: Thermocouples (Type K or J) for furnace temperatures up to 1200 degrees C
  • Pressure: Pressure transducers for monitoring pipe and boiler pressure
  • Vibration: Accelerometers for detecting bearing and motor faults
  • Flow: Flow meters for measuring liquid and gas flow rates
  • Level: Level sensors for raw material and product tanks

Digital Sources:

  • SCADA and PLC systems controlling production lines
  • ERP systems recording work orders and inventory
  • Quality logs and inspection reports
  • Machine Vision cameras
Sensor Type Typical Sampling Rate Daily Data Volume
Temperature (Thermocouple) 1-10 Hz ~5 MB
Vibration (Accelerometer) 1-50 kHz ~2 GB
Vision Camera 30 fps ~50 GB
Flow Meter 1 Hz ~1 MB
Power Meter 1 Hz ~2 MB

Sampling Rate: How Often Should We Read the Sensor?

Consider monitoring a furnace temperature that changes slowly — one reading every 10 seconds is perfectly adequate. But if you are monitoring vibration on a motor spinning at 3000 RPM, you need thousands of readings per second to catch any anomaly.

The Nyquist Theorem states: the sampling rate must be at least twice the highest frequency in the signal. In practice, we use 5 to 10 times the maximum frequency for reliable results.

# Calculating the required sampling rate
motor_rpm = 3000          # Motor speed
fundamental_freq = motor_rpm / 60  # 50 Hz
# To detect harmonics up to the 10th order
max_freq = fundamental_freq * 10   # 500 Hz
# Applying Nyquist with safety margin
sampling_rate = max_freq * 2.56    # 1280 Hz (ISO standard)

print(f"Fundamental frequency: {fundamental_freq} Hz")
print(f"Max frequency of interest: {max_freq} Hz")
print(f"Required sampling rate: {sampling_rate} Hz")
print(f"Readings per minute: {sampling_rate * 60:,.0f}")

Practical Guidelines:

  • Slow processes (temperature, level): 0.1 - 1 Hz
  • Medium processes (pressure, flow): 1 - 100 Hz
  • Fast processes (vibration, acoustics): 1 - 50 kHz

Data Quality: The Hidden Enemy

In industrial reality, data is rarely perfect. Imagine a temperature sensor suddenly reporting -999.0 — did the temperature actually drop, or did the sensor disconnect?

Missing Values

These occur when sensor communication drops or the device temporarily fails.

import numpy as np

# Temperature data with missing values
readings = [78.5, 79.1, None, None, 80.3, 77.8, None, 81.2, 80.5, 79.9]

# Method 1: Linear Interpolation
def interpolate_missing(data):
    result = data.copy()
    for i, val in enumerate(result):
        if val is None:
            # Find nearest valid values
            prev_val = next((result[j] for j in range(i-1, -1, -1)
                           if result[j] is not None), None)
            next_val = next((result[j] for j in range(i+1, len(result))
                           if result[j] is not None), None)
            if prev_val and next_val:
                result[i] = (prev_val + next_val) / 2
            elif prev_val:
                result[i] = prev_val  # Forward fill
    return result

clean_data = interpolate_missing(readings)
print(f"Original data:  {readings}")
print(f"After cleaning: {clean_data}")

Outliers

Readings far from the expected range — they could be sensor errors or real events worth investigating.

def detect_outliers_iqr(data):
    """Detect outliers using the Interquartile Range (IQR)"""
    clean = [x for x in data if x is not None]
    q1 = np.percentile(clean, 25)
    q3 = np.percentile(clean, 75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    outliers = [(i, v) for i, v in enumerate(data)
                if v is not None and (v < lower_bound or v > upper_bound)]
    return outliers, lower_bound, upper_bound

# Pressure data with an outlier
pressure = [4.2, 4.3, 4.1, 4.4, 12.8, 4.2, 4.3, 4.1, 4.5, 4.2]
outliers, low, high = detect_outliers_iqr(pressure)
print(f"Bounds: [{low:.1f}, {high:.1f}]")
print(f"Outliers: {outliers}")

Noise

Small random fluctuations in sensor readings. We use filters to smooth the signal:

def moving_average_filter(data, window=5):
    """Moving average filter for smoothing data"""
    filtered = []
    for i in range(len(data)):
        start = max(0, i - window // 2)
        end = min(len(data), i + window // 2 + 1)
        window_data = [x for x in data[start:end] if x is not None]
        filtered.append(sum(window_data) / len(window_data) if window_data else None)
    return filtered
Quality Issue Common Cause Treatment Method
Missing values Communication loss Linear interpolation / Forward fill
Outliers Sensor malfunction IQR or Z-score
Noise Electrical interference Moving average filter
Drift Sensor degradation Periodic calibration
Time delay Slow network NTP synchronization

Data Labeling

To train AI models, we need labeled data — every reading or image must be tagged with a description like "normal" or "faulty."

Consider training a model to detect bearing faults from vibration data. You need to tell the model: "this signal = healthy bearing" and "this signal = worn bearing."

Labeling Methods:

  1. Manual: An expert reviews and classifies data — accurate but slow and expensive
  2. Semi-automatic: The expert labels a small sample, then a preliminary model labels the rest
  3. Rule-based: For example: "if vibration exceeds 10 mm/s = fault"
  4. From maintenance logs: Linking recorded failures with sensor data at the same timestamps

The ETL Process: Extract, Transform, Load

ETL is the pipeline that moves data from its raw sources into the analytical database.

┌──────────┐    ┌──────────────┐    ┌──────────────┐
│ Extract  │───>│  Transform   │───>│    Load      │
├──────────┤    ├──────────────┤    ├──────────────┤
│ PLC/SCADA│    │ Clean        │    │ Historian DB │
│ Sensors  │    │ Unify units  │    │ Time-series  │
│ ERP      │    │ Compute KPIs │    │ Data Lake    │
│ CSV/API  │    │ Flag outliers│    │              │
└──────────┘    └──────────────┘    └──────────────┘
# Simplified industrial ETL pipeline
class IndustrialETL:
    def extract(self, source):
        """Extract data from multiple sources"""
        if source == "opc_ua":
            return self.read_opc_ua()    # Industrial standard
        elif source == "modbus":
            return self.read_modbus()
        elif source == "csv":
            return self.read_csv_files()

    def transform(self, raw_data):
        """Clean and transform data"""
        data = self.remove_duplicates(raw_data)
        data = self.fill_missing_values(data)
        data = self.convert_units(data)       # PSI to bar, F to C
        data = self.flag_outliers(data)
        data = self.calculate_kpis(data)      # OEE, energy consumption
        return data

    def load(self, clean_data, target):
        """Load into target database"""
        if target == "historian":
            self.write_to_historian(clean_data)
        elif target == "data_lake":
            self.write_parquet(clean_data)     # Columnar compressed storage

Historian Databases

In the factory world, we do not use standard databases like MySQL for sensor data — we use Time-Series Databases optimized for this purpose.

Feature Standard Database (SQL) Time-Series DB (Historian)
Write speed Thousands/sec Millions/sec
Compression Standard Specialized 10:1 to 100:1
Time-based queries Slow Highly optimized
Aggregation Manual Automatic (minute/hour/day)
Data retention No policy Automatic (age-based deletion)

Popular Historian Databases:

  • InfluxDB: Open-source, excellent for mid-size projects
  • TimescaleDB: Built on PostgreSQL, combines SQL with time-series
  • OSIsoft PI: The industrial standard in large plants
  • Siemens Historian: Integrated with Siemens systems

From Sensor to Decision: The Full Picture

Trace the complete journey of a single sensor reading:

Temperature Sensor -> PLC -> OPC-UA Server -> ETL Pipeline
    -> Cleaning and Quality Check -> Historian DB
    -> Live Dashboard
    -> AI Model for Fault Prediction
    -> Automatic Maintenance Alert

Time from Reading to Decision:

  • Live monitoring: < 1 second
  • Anomaly detection: 1-5 seconds
  • Predictive maintenance: minutes to hours (batch analysis)

Practical Tips for Engineers

  1. Start with existing data: Most factories already have PLCs recording data — look for it before buying new sensors
  2. Quality over quantity: 100 clean readings are better than a million error-filled ones
  3. Document everything: Which sensor, which unit, which sampling rate — because you will forget in 6 months
  4. Plan for storage: Vibration data at 10 kHz fills disks fast — use compression and retention policies
  5. Test your pipeline on real data: Simulated data will not contain the problems you will face in production
data-collection data-quality historian data-pipeline ETL labeling جمع البيانات جودة البيانات خط البيانات التسمية البيانات الخام المعالجة المسبقة