Recently, our paper titled “Rethinking Autoencoders for Medical Anomaly Detection from a Theoretical Perspective” has been accepted by MICCAI 2024. This paper provides a theoretical foundation for autoencoder-based anomaly detection methods, revealing the principles and design concepts of autoencoders for anomaly detection. The study demonstrates that minimizing the entropy of the latent space is crucial for preventing the reconstruction of anomalous regions. Experiments on four datasets with two image modalities validate our theoretical findings. To our knowledge, this is the first theoretical elucidation of the principles and design concepts of autoencoders for anomaly detection. The following is the detailed content of the paper.
Figure 1: Schematic of AE-based anomaly detection
As shown in Fig.1, AE \(\phi\) comprises an encoder \(f_e\) and a decoder \(f_d\). The encoder compresses the input image \(\mathbf{X} \in \mathbb{R}^{C \times H \times W}\) into a compact latent vector \(\mathbf{Z} = f_e(\mathbf{X}) \in \mathbb{R}^d\), and the decoder maps the latent vector back to the image space \(\hat{\mathbf{X}} = f_d(\mathbf{Z}) \in \mathbb{R}^{C \times H \times W}\). We formally denote the normal image as \(\mathbf{X}_n\) and the abnormal image as \(\mathbf{X}_a\). In the case of typical medical images like chest X-rays and brain MRIs, each abnormal image \(\mathbf{X}_a\) can be understood as the corresponding healthy version \(\mathbf{X}_n\) with the addition of lesion regions \(\delta\), i.e., \(\mathbf{X}_a = \mathbf{X}_n + \delta\). The training objective is to minimize the reconstruction loss on normal images:
\[\min \mathbb{E}[\Vert \phi(\mathbf{X}_n) - \mathbf{X}_n \Vert^2]\]Ideally, the trained model is expected to achieve the following goals:
\[\begin{align} & \hat{\mathbf{X}}_n = \phi(\mathbf{X}_n) \xrightarrow{} \mathbf{X}_n \\ & \hat{\mathbf{X}}_a = \phi(\mathbf{X}_a) = \phi(\mathbf{X}_n+\delta) \xrightarrow{} \mathbf{X}_n, \label{eq:obj2} \end{align}\]where \(\hat{\mathbf{X}}_n\) and \(\hat{\mathbf{X}}_a\) represent the reconstructions generated by AE. Consequently, the reconstruction error, \(\mathcal{A}_{rec}(\mathbf{X}) = \Vert \hat{\mathbf{X}} - \mathbf{X} \Vert^2\), will be \(\mathbf{0}\) if \(\mathbf{X}=\mathbf{X}_n\) and \(\delta^2\) if \(\mathbf{X}=\mathbf{X}_a\), thereby highlighting the abnormal regions.
However, the training objective (Equation (1)) does not align with the ideal task objectives (Equations (2) and (3)). The training objective encourages the AE to produce reconstruction results identical to the model input, but the ideal reconstruction result for anomalous images during inference differs from the model input. This discrepancy can cause the AE to successfully reconstruct some anomalous regions, leading to false-negative predictions.
In the extreme case, if AE learns a function \(\phi(\mathbf{X})=\mathbf{X}, \forall \mathbf{X}\), it perfectly satisfies the training objective, but cannot detect any anomalies. This phenomenon is called “identical shortcut”.
To better understand this issue and seek potential solutions, we conduct a theoretical analysis of the properties of AEs below.
Proposition 1: For the AE shown in Figure 1, let \(\mathbf{Z}_0 \in \mathbb{R}^D\) be the feature vector before the latent vector \(\mathbf{Z} \in \mathbb{R}^d\), \(\hat{\mathbf{Z}}_0 \in \mathbb{R}^D\) be the feature vector after \(\mathbf{Z}\). Then, if \(d < \frac{D}{2}\), the AE cannot learn identical mapping. Proof is provided in the original text.
Proposition 1 reveals that, AE with appropriate latent dimensions can effectively circumvent the undesired identical shortcut. Consequently, we contend that there is no need to introduce more complex modules to address this problem.
Although AE with \(d<\frac{D}{2}\) does not suffer from identical shortcut, we observe that some undesirable abnormal regions are reconstructed due to the generalization ability of the network. This motivates us to theoretically analyze its reason and ideally, find the optimal solution for guiding the design process. Our argument with theoretical evidence is presented in Proposition 2.
Proposition 2: Given an AE for anomaly detection, let \(\mathbf{X}_n\) be the normal image, \(\mathbf{X}_a\) be the abnormal image, and \(\mathbf{Z}\) be the latent vector of the AE. An optimal AE should satisfy: (1) \(I(\mathbf{X}_n; \mathbf{Z})=H(\mathbf{X}_n)\), and (2) \(I(\mathbf{X}_a; \mathbf{Z})=H(\mathbf{X}_n)\).\footnote[1]{\(I\) indicates the mutual information and \(H\) indicates the information entropy.
Proof is provided in the original text.
Proposition 2 demonstrates the conditions that the latent vector of an optimal AE should satisfy: (1) it should provide all information content of normal data, but (2) should not contain any information content of abnormal information, as depicted in Fig.2(c). Previous AE trained with only Eq.1 is optimized to achieve \(\hat{\mathbf{X}}_n = \mathbf{X}_n\), then it has \(I(\mathbf{X}_n; \mathbf{Z}) = H(\mathbf{X}_n), I(\mathbf{X}_a; \mathbf{Z}) \geq H(\mathbf{X}_n).\) Therefore, this AE satisfies Proposition 2(1), but fails to fulfill (2). The Venn diagram in Fig.2(b) depicts this situation, indicating that its \(H(\mathbf{Z})\) extends beyond the scope of \(H(\mathbf{X}_n)\) and inadvertently provides information about lesions \(H(\mathbf{X}_a|\mathbf{X}_n)\), leading to false negatives. This is an intractable issue since abnormal images are unavailable during training. To address this problem and achieve Proposition 2(2), it is ideal for \(H(\mathbf{Z})\) to be minimized and approach the same level as \(H(\mathbf{X}_n)\), transforming the scope of \(H(\mathbf{Z})\) from Fig.2(b) to (c).
In summary, our theory suggests that in AD, AE tends to benefit from minimizing the entropy of the latent space, aiming to approach the entropy of normal data. This ensures that anomalies cannot be represented and reconstructed by the model. Meanwhile, for more complex datasets with a higher information content, it is required to increase the entropy of the latent space to match that of the normal data. In practice, this can be achieved explicitly through latent dimension adjustment, or implicitly by enforcing latent space restrictions.
Verifying Proposition 1
Fig. 3 presents reconstruction errors w.r.t. the latent dimension on RSNA dataset, which shows trends that align with our theory. Firstly, we observe that when \(d\) is small, an increase in \(d\) results in a decrease of reconstruction errors. While \(d>\frac{D}{2}=512\), an increase in \(d\) does not lead to smaller errors. This validates that AE with a small \(d\) does not encounter identical shortcut, while for \(d>\frac{D}{2}\), the bottleneck becomes saturated. Secondly, even when \(d>\frac{D}{2}\), errors of normal training data are smaller than those of normal testing data, which, in turn, are smaller than errors of abnormal testing data. This suggests that even if the bottleneck is saturated, identical mapping does not occur in AE due to the limited capacity of the network.
Verifying Proposition 2
We validate Proposition 2 by controlling \(H(\mathbf{Z})\) via latent dimension adjustment. Tab. 1 presents the performance of AE with different \(d\) on AD, which aligns with our proposition. Firstly, reducing \(d\) from 128 to 1 initially improves the performance and then leads to deterioration, with the optimal \(d\) typically being quite small. Notably, the performance of \(d=4\) surpasses that of \(d=128\) by more than 10\% AUC on both RSNA and VinDr-CXR datasets, indicating that normal information \(H(\mathbf{X}_n)\) can be represented by a compact vector. On the other hand, a too large value of \(d\) may result in generalization to abnormal samples, while too small value cannot sufficiently represent normal information, leading to performance deterioration.
Secondly, the optimal \(d\) varies across different image modalities, reflecting differences in \(H(\mathbf{X}_n)\). For RSNA and VinDr-CXR datasets, the optimal \(d\) is 4, whereas for Brain Tumor and BraTS2021 datasets, it is 32 and 16, respectively. This disparity can be attributed to the fact that MRIs offer more information content compared to X-rays. MRIs are volumetric scans that capture detailed tissue information, exhibiting greater variations among healthy subjects and encompassing variations among axial slices. These characteristics enable MRIs to surpass the information content of X-rays, necessitating a larger \(d\) to effectively expand \(H(\mathbf{Z})\) for MRI datasets.
This paper investigated a theoretical analysis of AE in anomaly detection. We prove that an appropriate latent dimension can avoid “identical shortcut” in AE. By leveraging information theory, the optimal solution of AE is uncovered. Our findings indicate that, apart from the reconstruction loss, imposing a constraint on the entropy of the latent space is crucial for preventing the reconstruction of abnormal information. Experiments validate our theoretical framework, and highlight the efficacy of simple latent dimension reduction in constraining the entropy and achieving significant performance improvements. Overall, this paper provides a theoretical foundation for guiding the design of AE in anomaly detection, facilitating the development of more effective and reliable anomaly detection methods.
However, the current approach for adjusting the latent dimension relies on evaluation results to find the optimal value, which is undesirable. To overcome this limitation, our future work aims to quantify the information entropy of normal training data, \(H(\mathbf{X}_n)\), and develop self-adaptive methods that dynamically constrain \(H(\mathbf{Z})\) to approach \(H(\mathbf{X}_n)\) on different datasets. This approach would eliminate the need for manual selection of the latent dimension and enhance the adaptability of AE in various anomaly detection scenarios.