Contrastive learning method for leak detection in water distribution networks
This study emphasizes using a contrastive-learning-based approach for acoustic water leak detection in WDNs. Figure 6 illustrates a three-step framework for CL-based leak detection modeling, outlining the key stages involved in the proposed methodology.
The initial step in the framework encompasses the collection and processing of data. Acoustic signals from the WDNs are acquired using suitable sensors or devices. These signals serve as a valuable source of information for detecting the presence of leaks. Subsequently, the collected data is subjected to preprocessing and prepared for subsequent analysis. In the second step, the CL-based approach is applied to establish a self-supervised-CL-based model for leak detection. This approach leverages the principle of self-supervision. It trains the encoder using self-supervised learning, acquiring representations of the acoustic signals without relying on labeled data.
By pre-training the encoder, the model can capture crucial acoustic features and patterns that distinguish each acoustic signal, laying the foundation for effective leak condition classification. Once the encoder has been pre-trained, it is the foundational component for the subsequent leak detection task. Fine-tuning would be applied to adapt the model specifically for leak detection, further optimizing its performance. This fine-tuning process ensures that the model becomes specialized in accurately and efficiently identifying leak states by utilizing the features acquired during the CL phase.
Cross-entropy, leak detection accuracy, and F1-score are considered when evaluating the proposed model’s performance. The augmentation strategy should be optimized. Comparative analyses of leak detection performance are performed to assess the efficacy of the CL approach in comparison to SL. Additionally, ablation experiments are conducted to evaluate the influence of model architectures and gain insights into its impacts. t-SNE visualization represents the model’s ability to cluster and differentiate instances associated with different leak conditions. Furthermore, out-of-sample validation is undertaken to assess the generalization ability of the model when confronted with unseen data. The detailed procedures and outcomes of the steps above are depicted in the subsequent section of this study.
Data collection and process
Hong Kong, China, is a densely populated city with extensive WDNs. However, the average age of the pipeline network is over thirty years, leading to pipeline deterioration, leaks, and water supply instability. Therefore, it provides opportunities to collect the leak and non-leak samples for leak detection experiments. As depicted in Fig. 7, the research team deployed noise loggers in the chambers close to the potential leak points. The noise loggers are connected to the valve or pipeline to collect vibroacoustic signals caused by the leaks in the pipe. The signal collection is set at the sampling rate of 4096 Hz and lasts 10 s for each sample. To avoid the influence of external noise in leak detection, the signal collection occurred at midnight. For each leak site, the collection period would last several days, depending on work conditions, to ensure sufficient data volume and avoid unexpected noise.
In total, 1004 samples were collected, including 439 leak samples and 565 non-leak samples. Each 10-s sample was divided into 1-s segments to enrich the dataset, considering the basic units. Ultimately, 4390 leak samples and 5650 no-leak samples were collected for subsequent modeling.
Acoustic leak detection modeling
CL is introduced to train the model in capturing the intrinsic feature and semantic details of the data32, thereby enabling the learning of representative representations. The pre-trained blocks enhance the model’s generalization and robustness capabilities33 in downstream tasks, including classification and fault detection. As depicted in Fig. 8, CL modeling procedures can be mainly divided into two phases: i) CL pre-training and ii) Leak detection modeling (downstream task fine-tuning).
Figure 9 shows the conceptual framework for CL. The self-supervised CL utilizes unlabeled data and designs its supervisory signals to learn data representations, significantly reducing annotation costs. First, based on the existing sample x, the approach would generate negative and positive samples using different augmentation (\(t \sim \tau\) and \(t^\prime \sim \tau\)), respectively, generating new samples, \(\hatx_i\) and \(\hatx_j\) to enhance the model’s understanding of data. For time-series signals, the basic augmentation tool includes Gaussian noise, time shift augmentation, pitch shift augmentation, jitter, and adjacent sample augmentation34,35. Generally, the two signals generated from the same sample are considered positive pairs, while signals generated from different samples are considered negative pairs. These augmentation operations can increase the diversity between samples, thereby helping the model learn more robust representations.
Subsequently, the basic encoder f (·) is utilized to project the input x into the representations h. Then, it is passed through the projection head g (·) to obtain the vector Z. The model is then generally trained using InfoNCE to maximize the consistency among positive pairs and minimize the consistency among negative pairs, as depicted
$$\beginarrayc\displaystyle\ell _i,j=-\,\displaystyle\log \frac\displaystyle\exp\left(\frac\displaystyle\rmsim(\bfz_i,\bfz_j)\displaystyle\tau \right){\displaystyle\sum _k=1^2N\,\rmI_[k\ne j]\exp\left(\frac{\displaystyle\rmsim(\bfz_i,{\bfz}_k)}\displaystyle\tau \right)}\endarray$$
(5)
where \(\rmI\) is the indicator function, it outputs 1 when \(k\,\ne\,j\), while outputs 0 when \(k=j\).
The function sim () denotes the similarity score between two vectors. This study utilized cosine distance to measure the cosine of the angle between two vectors, effectively capturing relationships between feature representations in high-dimensional space36. τ is a temperature parameter that controls the sharpness of the distribution, N is the number of batch sizes, and zi and zj denote the signal pairs.
This study’s prominent architecture of CL consists of an encoder and a projector. CNN is employed as the backbone of CL because of its proficiency in processing sequential data, including seismic signal37, time-series data38, and speech recognition39. The convolutional layers are the core of CNNs that utilize learnable filters or kernels to extract the local and global features from input signals40,41. The convolutional layers enable the model to capture local patterns and learn representations at different feature levels. The process within the convolutional layers is depicted as:
$$y_i(j)=K_i\cdot x(j)+b_j$$
(6)
where \(K_i\) is the i-th kernel, and \(b_j\) is the bias, x is the values within the kernel, and y is the kernel output.
Specifically, as illustrated in Fig. 8, the encoder consists of several convolutional blocks, processing and extracting features from input signals. Each block begins with a one-dimensional convolutional layer (Conv1d) with a kernel size of 4, stride of 2, and padding of 1, allowing the model to capture local patterns in the input signal while maintaining dimensionality40,41. Following the convolutional layer, batch normalization is applied. This step stabilizes and accelerates the training process by normalizing the inputs of each layer42. Subsequently, Rectified Linear Unit (ReLU) activation function is used to introduce non-linearity into the model, enabling it to learn complex patterns in the data43. After that, max pooling layer (MaxPool1D) with a kernel size of 2 and a stride of 2 is employed. Max pooling reduces the dimensionality of the feature map. It mitigates the impact of data noise, thereby reducing the computational complexity and aiding in capturing dominant features in the input signal40. Each convolutional block progressively projects the channels from the input dimension to higher dimensions. Specifically, the channels are projected from 1 to 32 in the first block, then 64 in the second block, and 512 in the subsequent blocks. This gradual increase of the model depth allows the model to capture high-level features from acoustic signals.
Following the final convolutional block, a projector layer is connected. The projector is the fully connected layer that maps the high-dimensional features extracted by the convolutional blocks into a lower-dimensional space suitable for similarity measurement. This component is critical for tasks such as CL, where accurately measuring the similarity between pairs of signals is essential. The projector transforms the input vector into the 1024-dim vector. By utilizing InfoNCE loss (Eq. (5)) and extracting critical features from the signals, CL enhances the encoder’s feature capture capabilities and provides valuable insights for subsequent tasks.
After initial CL pre-training, fine-tuning would be conducted, regulating the encoder to fit into the leak conditions. It is worth noting that the encoder is connected to the Multilayer Perceptron (MLP), therefore outputting the leak conditions of input samples. The settings of downstream modeling would be adjusted according to the experiment’s purposes.
Model evaluation and experiments
In this section, the evaluation and experimental results of the proposed model for leak detection are presented. A comprehensive set of experiments was conducted to assess the performance and effectiveness, including data augmentation experiments, model ablation experiments, comparison of labeled experiments, and out-of-sample validation.
Regarding data augmentation experiments, CL requires applying data augmentation methods to generate positive and negative samples for in-depth feature captures. The synergy of augmentations would greatly influence the performance or effectiveness of CL. Therefore, conducting a series of experiments to explore the optimal combination of augmentation methods and reach a higher accuracy is crucial. Due to limited computational resources and feasibility, this study excluded ML-based augmentation methods, such as autoencoders and GANs. Instead, considering the augmentation approaches employed in previous studies44,45, this study focused on signal-based augmentation methods, including masking, WGN, flip y, flip x, translation, and amplitude scaling. Detailed explanations of these augmentation methods are provided in Eqs. (7) to (12).
1) Masking (Eq. (7)): The operation randomly occludes or masks out a portion of the input signal with probability p. This can improve the model’s robustness to partial data missing.
$$X=\rmdropout\;(X,p)$$
(7)
2) Adding WGN (Eq. (8)): The operation overlays the input signals with random noise following Gaussian distribution \(N(0,\sigma )\). This study generated the WGN at 0 dB46,47,48, enhancing the model’s robustness to noisy inputs.
$$X=X+\varepsilon ,\,\varepsilon \sim N(0,\sigma )$$
(8)
3) Flipping along the y-axis (Flip y, Eq. (9)): This operation vertically flips the input signals, increasing the model’s adaptability to vertical variations.
4) Flipping along the x-axis (Flip x, Eq. (10)): The operation horizontally flipping the input signals. This can enhance the model’s adaptability to horizontal variations.
$$X=\rmreverse(X)$$
(10)
5) Translation (Eq. (11)): The operation randomly shifts i units of the input signals in the spatial or temporal dimensions. The i is a random value ranging from 1 to the total length of the input vector minus one. This operation might improve the model’s robustness to changes in the input positions.
$$X=\rmconcat\;(X[i:end],X[0:i-1])$$
(11)
6) Amplitude Scaling (Eq. (12)): The operation randomly scales the amplitude or magnitude of the input signals through the coefficient \(\alpha\), enhancing the model’s robustness to changes in input amplitudes. In this study, the factor σ was set to 1.1.
$$X=\alpha X,\,\alpha \sim N(1,\sigma )$$
(12)
Meanwhile, model ablation is conducted to understand the contribution and significance of different components within the model architecture. By systematically removing or altering specific parts of the model, experiments reveal the impact of these changes on overall performance to optimize the model architecture.
This study employed ablation experiments to evaluate the leak detection capacity of models under different amounts of convolutional blocks. The optimal convolutional blocks were selected based on the leak detection results. Specifically, as shown in Fig. 10, the ablation experiments were conducted on models with different numbers of convolutional blocks, including i) Two conv blocks. ii) Three Conv Blocks. iii) Four Conv blocks. iv) Five Conv bloks. v) Six Conv blocks. Each model was trained with CL and the 10% labeled dataset for downstream leak detection training.
Regarding the comparison of labeled datasets, CL enables the model to be trained using limited label samples. Experiments employed CNN as the backbone for leak detection modeling to demonstrate this point. CL and SL were used to train the model under different proportions of labeled datasets. Specifically, during experiments, a small proportion of the labeled dataset was applied for leak detection training, while the remaining part was used for validation. Specifically, this study extracted 5%, 10%, 15%, 20%, 25%, and 30% of the labeled dataset to train the model using different learning methods. The objective is to evaluate the effectiveness of CL in leveraging unlabeled acoustic signals and achieving comparable or superior performance with a smaller amount of labeled data. Besides, the comparison also reveals the influence of data volume on the leak detection performance.
Out-of-sample validation represents a critical methodology to assess the model’s performance on novel and unseen data, distinct from the data used during its training phase. This evaluation determines the model’s ability to accurately predict previously unencountered data instances. In the specific context presented, supplementary field experiments with the same experiment setting were conducted on WDNs. Consequently, an independent dataset was meticulously assembled, comprising 670 samples indicative of leaks and 890 samples denoting non-leak conditions. Notably, these samples were obtained from unexplored sites, ensuring their novelty and relevance for the evaluation process. Through this rigorous evaluation, the robustness of the model is thoroughly examined, with particular attention given to its capacity for accurately classifying signals associated with leak conditions in unexplored real-world scenarios.
link