22.10.2021

Motivation

Motivation

  • Detecting contaminants in water distribution systems
    • Measuring specific contaminants: expensive
    • Measuring all contaminants: impossible
  • Also: detecting other irregularities
    • e.g., equipment/sensor failures

The Data

Source

Variables: Inputs

  • pH — pH value
  • Redox — Redox potential
  • Leit — Electric conductivity
  • Trueb — Turbidity
  • Cl — Chlorine dioxide 1
  • Cl_2 — Chlorine dioxide 2
  • Tp — Temperature
  • Fm — Flow rate 1
  • Fm_2 — Flow rate 2

The Task

  • Challenge: detect events (supervised)
  • Task now: detect anomalies (unsupervised)

Preprocessing

  • Data split: training, validation, test
  • Denoise: moving average
  • Detrend: convert to differences between time steps
  • Missing values: last observation carried forward
  • Scaling: zero mean, unit variance
    • Based on training data mean and variance
  • Reshape: sequence blocks (64 samples x 9 features)

The Model

Autoencoder (AE)

  • AE reconstructs its own input
  • Trained on ‘normal’ data
  • ‘Large’ reconstruction error \(\rightarrow\) anomaly
  • Here: 1D-convolutional layers

Handling the Data

  • Initially trained within 20 epochs
    • Only non-event training samples
  • Validation data analyzed in batches
    • Flag anomalies in batch
    • Update AE with non-anomalies
    • Go to next batch

Parameters?

Name Description Min Max
batch_size batch size 16 1024
nfilter # of filter in first/last layer 10 100
lr learning rate (training) \(10^{-5}\) \(10^{-2}\)
lrup learning rate (update, lr x lrup) \(10^{-2}\) \(10^{0}\)
activation activation function {relu,swish,sigmoid}
drp dropout rate \(10^{-3}\) \(10^{-0.3}\)
num_layers # of layers in encoder and decoder 1 4

Parameter Tuning

Method: Surrogate Model-Based Optimization

  • Learn relation between performance and parameters
  • Here:
    • 150 evaluations
      • Measure: Area under ROC curve AUC (on validation data)
    • Surrogate model: Gaussian process regression
    • Search: evolutionary algorithm

Tuning Result: Best Parameters

Name Description Min Max Best
batch_size batch size 16 1024 82
nfilter # of filter in first/last layer 10 100 66
lr learning rate (training) \(10^{-5}\) \(10^{-2}\) \(10^{-3.30}\)
lrup learning rate (update, lr x lrup) \(10^{-2}\) \(10^{0}\) \(10^{-1.87}\)
activation activation function {relu,swish,sigmoid} relu
drp dropout rate \(10^{-3}\) \(10^{-0.3}\) \(10^{-2.56}\)
num_layers # of layers in encoder and decoder 1 4 1

Tuning Result: Performance

Performance Comparison on Unseen Test Data

  • AE: Autoencoder
  • IF: Isolation Forest
  • LOCF: Last Observation Carried Forward (baseline)

Example Outputs (pH)

Example Outputs (Redox)

Conclusions

  • Reasonable comparative performance on test data
  • Still: no ‘perfect’ fit
  • Some open issues:
    • Performance measures, benchmarking
    • Unknown ‘true events’ in data
    • Searching for better network architectures
    • Include parameters of preprocessing steps
    • Threshold update

Thanks for your attention. Questions?

Tuning Result: Progress

Tuning Result: Sensitivity

Tuning Result: F1