[IEEE-TIP] Rethinking Self-training for Semi-supervised Landmark Detection: A Selection-free Approach

Recently, a study from the HKUST Smart Lab on semi-supervised landmark detection has been accepted by the top journal in computer vision, IEEE Transactions on Image Processing. This research proposes a novel self-training method for semi-supervised landmark detection, which addresses the issues caused by the pseudo-label selection of traditional self-training.

Background

Landmark detection is a vision task that aims to localize predefined points on a given image. Fully supervised learning of the task is not scalable under the big data scenario as the annotations are expensive. To address this issue, semi-supervised learning (SSL) is often adopted for exploiting both labeled and unlabeled data. Among the existing SSL methods, self-training and its variants are arguably the most widely used due to their simplicity and effectiveness, which iteratively enlarge the training set by adding confident pseudo-labels via sample-selection strategies.

However, when applied to landmark detection, selection-based self-training causes several problems. First of all, the selected confident pseudo-labels may introduce data bias due to the biased selection, which can lead to performance degradation. Additionally, it is not easy to decide a proper threshold for sample selection, especially for the fine-grained localization task that is more sensitive to noisy labels. Furthermore, coordinate regression-based landmark detectors have recently obtained state-of-the-art (SOTA) results by adopting transformers, yet they do not output confidence, which makes selection-based self-training methods infeasible.

In this work, we aim to rethink self-training for landmark detection by asking the following question: Is pseudo-label selection a necessary component of self-training? We argue that a successful self-training strategy should take care of confirmation bias by utilizing reliable information first, while obtaining such information is not necessarily restricted to the selection of confident pseudo-labels. For example, one may conduct a simpler but more confident task on pseudo-labels to extract reliable information. Figure 1 shows the overall framework of our method.

Figure 1. Self-Training for Landmark Detection (STLD). (1) Model is first trained on the labeled data with supervised learning, (2) then estimates pseudo-labels of unlabeled data, and (3) is retrained on both labeled and pseudo-labeled data with the constructed task curriculum.

Figure 1. Self-Training for Landmark Detection (STLD). (1) Model is first trained on the labeled data with supervised learning, (2) then estimates pseudo-labels of unlabeled data, and (3) is retrained on both labeled and pseudo-labeled data with the constructed task curriculum.

A. Self-training

Self-training is a classic SSL method, which iteratively enlarges the GT training set by adding confident pseudo-labeled samples for obtaining better performance. Due to the simplicity and effectiveness, it has been successfully applied to various vision tasks such as image recognition, object detection, semantic segmentation, and so on. However, past works have mostly adopted self-training with the sample selection strategy via either fixed or dynamic thresholds to address the confirmation bias issue. In contrast, we attempt to explore another line of research by proposing a self-training method without explicit pseudo-label selection.

B. Supervised Landmark Detection

Supervised landmark detection methods can be categorized as either coordinate regression or heatmap regression. Coordinate regression directly regresses the co- ordinates of landmarks, while heatmap regression uses heatmaps as a proxy of coordinates for landmark prediction. Due to the proposal of various backbones for high- resolution representation, heatmap regression has been in the leading position in model performance for a long time. Later, Transformers are introduced to this task for end- to-end coordinate regression, which show superior performance to that of heatmap regression. However, these methods rely on fully labeled data for supervised learning, which is not scalable in practice. In this work, we mainly focus on semi-supervised landmark detection rather than supervised learning, where the former is much more label-efficient than the latter. It is worth noting that the proposed SSL method is applicable to both heatmap and coordinate regression.

C. Semi-supervised Landmark Detection

Semi-supervised landmark detection can be divided into three types. 1) Consistency regularization based methods force models to learn consistency across different geometric or style transformations to obtain supervision for unlabeled images. 2) Some methods attempt to learn more generalizable representations or reasonable predictions with unlabeled data through adversarial learning. 3) Self-training based methods iteratively estimate and utilize pseudo-labels of unlabeled data. Compared to these existing works, our method is a more general solution that 1) applies to both heatmap and coordinate regression, and 2) relies on neither sample selection nor confidence regularization. We attribute these advantages to the carefully designed task-level curriculum.

Preliminaries

In supervised landmark detection, we aim to learn a function by minimizing the following risk:

$$\mathcal{R}^{\text{SL}}(f_{\boldsymbol{\theta}}) = \mathcal{R}^{\text{SL}}(f_{\boldsymbol{\theta}}) + \sum_{\smash{ (\boldsymbol{x}_i,\boldsymbol{y}_i) \in \mathcal{A} }} \mathcal{L} (f*{\boldsymbol{\theta}}(\boldsymbol{x}_i), {\boldsymbol{y}_i})$$

In semi-supervised setting, additional unlabeled images can be used to improve the generalization of model with the following objective function:

$$\mathcal{R}^{\text{SSL}}(f_{\boldsymbol{\theta}}) = \mathcal{R}^{\text{SL}}(f_{\boldsymbol{\theta}}) + \sum_{\smash{ (\boldsymbol{x}_i,\boldsymbol{y}_i) \in \mathcal{A} \subseteq (\mathcal{X}_u, \widehat{\mathcal{Y}}*u) }} \mathcal{L} (f*{\boldsymbol{\theta}}(\boldsymbol{x}_i), {\boldsymbol{y}_i})$$

To solve a semi-supervised learning problem, a typical self-training method directly solves a supervised learning problem by iteratively estimating and selecting confident pseudo-labeled samples to expand the labeled set. Its training pipeline can be described as follows: 1) Train a new model on the labeled set with supervised learning; 2) use the trained model to estimate pseudo-labels for the images from unlabeled set; 3) select confident pseudo-labeled samples based on the confidence threshold, then add the selected samples to the labeled set; 4) repeat steps 1 to 3 until no more data can be added. The objective function at step 1 can be written as

$$\mathcal{R}^{\text{SL}}(f_{\boldsymbol{\theta}}) = \mathcal{R}^{\text{SL}}(f_{\boldsymbol{\theta}}) + \sum_{\smash{ (\boldsymbol{x}_i,\boldsymbol{y}_i) \in \mathcal{A} }} \mathcal{L} (f*{\boldsymbol{\theta}}(\boldsymbol{x}_i), {\boldsymbol{y}_i})$$

As we can see from the pipeline, self-training may reinforce the model’s mistakes as it attempts to teach itself (i.e., confirmation bias). Step 3 plays an important role in easing this issue by only retaining confident pseudo-labels, which are less likely to contain mistakes. Although selecting confident pseudo-labels mitigates the confirmation bias, it may introduce data bias. As shown in Figure 2, we observe that the label distribution of confident pseudo-labels differs from that of GT significantly. In other words, selecting confident pseudo-labels reduces the errors from noisy pseudo-labels, but also introduces data bias due to the biased selection. Additionally, it is not easy to decide a proper threshold for sample selection, especially for the fine-grained localization task that is more sensitive to noisy labels. Moreover, coordinate-regression based methods usually output predictions without confidence scores, which makes classic self-training inapplicable. Thus, we pose the natural question: Is pseudo-label selection a necessary component of self-training?

Figure 2. (a) Visualized label density maps of four data groups from 300W. (b) The KL divergence of unlabeled GT and three data groups respectively: 1) labeled data, 2) confident pseudo-labeles, and 3) all the pseudo-labels, calculated based on the label density maps over 300W.

Figure 2. (a) Visualized label density maps of four data groups from 300W. (b) The KL divergence of unlabeled GT and three data groups respectively: 1) labeled data, 2) confident pseudo-labeles, and 3) all the pseudo-labels, calculated based on the label density maps over 300W.

To answer the question above, we first reformulate self-training as a selection-free method:

$$\boldsymbol{\theta}*{\text{STLD}} = \arg \min*{\boldsymbol{\theta}} \sum_{\smash{ (\boldsymbol{x}_i,\boldsymbol{y}_i) \in (\mathcal{X}_l, \mathcal{Y}_l) }} \mathcal{L} (f*{\boldsymbol{\theta}}(\boldsymbol{x}_i), {\boldsymbol{y}_i}) + \sum_{\smash{ (\boldsymbol{x}_i,\boldsymbol{y}_i) \in (\mathcal{X}_u, \widehat{\mathcal{Y}}*u) }} \mathcal{L} (f*{\boldsymbol{\theta}}(\boldsymbol{x}_i), {\boldsymbol{y}_i})$$

where the model is trained on both labeled and pseudo-labeled data without explicit pseudo-label selection.

We attribute the success of traditional self-training to its data-level curriculum, where confident pseudo-labels are utilized first, to ease the confirmation bias. Therefore, we believe the key to self-training is to utilize reliable information first, while obtaining such information is not necessarily restricted to the selection of confident pseudo-labels. Instead, we could construct a task curriculum, where confident tasks are conducted first on pseudo-labels to provide reliable information. To this end, we propose pseudo pretraining and shrink regression as two essential components of the curriculum, which will be detailed below.

Method

A. Pseudo Pretraining for Better Initialization

During the pseudo pretraining stage, i.e., the simplest task of the curriculum, a model is not asked to fit the pseudo-labels directly, but rather obtaining a better initialization with them. To this end, the model initialized from ImageNet pretrained weights is trained on all the pseudo-labeled data with the following formula:

$$\boldsymbol{\theta}*{\text{pre}} = \arg \min*{\boldsymbol{\theta}} \sum_{\smash{ (\boldsymbol{x}_i,\boldsymbol{y}_i) \in (\mathcal{X}_u, \widehat{\mathcal{Y}}*u) }} \mathcal{L} (f*{\boldsymbol{\theta}}(\boldsymbol{x}_i), {\boldsymbol{y}_i})$$

After obtaining a better initialization with pseudo pretraining, the model continues with the second training stage, which will be described later. Despite its simplicity, the design of pseudo pretraining is able to leverage all the pseudo-labels while handling the noise in an adaptive way. An analysis has been performed to verify this in the experiments, which shows that the noisier pseudo-labels learned during the pseudo pretraining stage are more likely to be forgotten after the second training stage. Intuitively speaking, when the model learns incorrect information from the pseudo-labels at the pretraining stage, it will gradually forget it at the second training stage because the incorrect information would be inconsistent with the ‘correct’ labels.

B. Shrink Regression for Noise Resistance

Although pseudo pretraining handles noisy samples well, it has not fully leveraged the pseudo-labels as they are only used for model initialization. To achieve better performance, it is necessary to utilize pseudo-labels directly for model training while taking care of the noise. To this end, we first inspect the characteristics of noisy labels by visualizing the offsets of pseudo-labels relative to the GTs in 2D histograms, as shown in Figure 3. The figure indicates that pseudo-labels are distributed around GTs for both heatmap and coordinate regression across different labeled ratios. For those pseudo-labels that are relatively far from the GTs, if the model directly fits to them with the standard granularity, it would learn the noisy information introduced by these pseudo-labels due to their deviation from GTs. On the other hand, if a coarse granularity is used for regression, the model fits to a broader region centered on the pseudo-labels, which is more likely to cover the GTs. Accordingly, we argue that the standard regression task on pseudo-labels can be made more confident by reducing the granularity of regression targets. In other words, it sacrifies the precision of regression for the reliability of learning targets. Following this observation, we propose shrink regression, which starts with coarse regression, then progressively increases the granularity until the same level as standard regression. Different instantiations of shrink regression for heatmap and coordinate regression are detailed below.

Figure 3. 2D histograms of the offsets of pseudo-labels relative to GTs, trained on 300W with different labeled ratios. We analyzed both heatmap ((a)-(b)) and coordinate ((c)-(d)) models.

Figure 3. 2D histograms of the offsets of pseudo-labels relative to GTs, trained on 300W with different labeled ratios. We analyzed both heatmap ((a)-(b)) and coordinate ((c)-(d)) models.

  • Heatmap Regression

    When shrink regression is applied, we adjust the granularity of heatmap regression by adjusting its training targets. Specifically, heatmap regression uses a proxy heatmap for training and maps y to 2D Gaussians with centers and standard deviation. By increasing the standard deviation, the positive areas in the heatmap are enlarged, thus becoming coarser and more tolerant to small prediction errors. Therefore, the shrink regression for heatmap regression can be formulated as

    $$\mathcal{L}^{\text{SR}}*{\text{heat}}(f*{\boldsymbol{\theta}}(\boldsymbol{x}), \boldsymbol{y}, t) = \mathcal{L}*{\text{heat}} (f*{\boldsymbol{\theta}}(\boldsymbol{x}), \boldsymbol{y})$$

  • Coordinate Regression

    Since coordinate regression is end-to-end optimized w.r.t. the coordinates, we adjust its granularity via the loss function. In supervised learning, L1 loss is widely used for coordinate- based models since it has been shown to be superior to L2 loss. Formally, L1 and L2 are defined as

    $$\mathcal{L}*1(\hat{\boldsymbol{y}}, \boldsymbol{y}) = |\hat{\boldsymbol{y}} - \boldsymbol{y}|$$

    We further write the gradients of the two losses w.r.t. θ as

    $$\frac{\partial \mathcal{L}*1(\hat{\boldsymbol{y}}, \boldsymbol{y})}{\partial \boldsymbol{\theta}} = \pm \nabla*{\boldsymbol{\theta}} \hat{\boldsymbol{y}}$$

    From the above equations, we can see that the weight of the gradient becomes small when the prediction error becomes small in the L2 loss, while the weight in the L1 loss always equals 1. This may explain why L1 is preferred to L2 for supervised learning, as the former provides more precise results by focusing more on small errors. However, we argue that L2 is more robust to noisy labels (within [-1, 1]) because it downweights the gradients of small errors. That is to say, L2 loss has a coarser granularity than L1. To make the loss adjustable in terms of granularity, we first generalize it to Lp loss, which is formulated as

    $$\frac{\partial \mathcal{L}*2(\hat{\boldsymbol{y}}, \boldsymbol{y})}{\partial \boldsymbol{\theta}} = 2(\hat{\boldsymbol{y}} - \boldsymbol{y}) \nabla*{\boldsymbol{\theta}} \hat{\boldsymbol{y}}$$

    Figure 4. Visualization of shrink loss and its gradient weight with different power p. (a) The loss function. (b) The gradient weight.

    Figure 4. Visualization of shrink loss and its gradient weight with different power p. (a) The loss function. (b) The gradient weight.

    Figure 4 visualizes the Lp loss and its gradient weights with varying p. As can be seen, a larger p has smaller gradient weights on small prediction errors; thus it is coarser and more tolerant to small errors. With the Lp loss, we can obtain shrink loss by shrinking Lp over t:

    $$\mathcal{L}*p(\hat{\boldsymbol{y}}, \boldsymbol{y}) = |\hat{\boldsymbol{y}} - \boldsymbol{y}|^p$$

    The formulation of the shrink loss is also supported by our analysis of gradient correlations between the pseudo-labels and GTs, showing that the gradients of small errors from pseudo-labels are indeed noisier.

C. Training Pipeline

Similar to selection-based self-training, STLD alternately trains with labeled set and estimates pseudo-labels for unlabeled images, but differs in how to utilize pseudo-labels. Instead of adding confident pseudo-labels to the labeled set, STLD utilizes all of them via a task curriculum. To be specific, each round of STLD consists of two stages, where the first stage conducts pseudo pretraining on all the pseudo-labeled data and the second stage directly leverages the labeled set via source-aware mixed training. The mixed training in the first round only conducts standard regression on GTs. From the second round, shrink regression is further added to the mixed training for leveraging the pseudo-labels.

Experiments

Datasets:

  • Facial landmark: 300W, WFLW, AFLW
  • Medical landmark: Hand

Facial landmark data with semi-supervised learning

Based on Table 1, STLD obtains the best results on most of the entries. Specifically, STLD-HM-R18 and STLD-TF-R18 both outperform their ResNet- 18 counterparts on almost all the entries, demonstrating the effectiveness of the proposed method. When a heavier backbone such as HRNet is equipped, the performance of STLD is further boosted. On WFLW and AFLW, STLD-HM-HRNet achieves the best results with 5% labeled data. In contrast, STLD-TF-HRNet performs better with higher labeled ratios, thanks to the representation capability of transformers.

Table 1. Comparison with existing semi-supervised landmark detection methods on 300W, WFLW, and AFLW. Results are in NME (%).

Table 1. Comparison with existing semi-supervised landmark detection methods on 300W, WFLW, and AFLW. Results are in NME (%).

Medical landmark data with omni-supervised learning Compared to facial landmark, medical landmark detection benefits more from omni learning as its labeled data is usually limited due to expensive labeling cost. Therefore, we compare the omni learning results with the existing supervised methods in Table 2 to see the improvement from using extra unlabeled images. As can be seen, without extra unlabeled data, HM and TF obtain 0.68 and 0.69 of MRE (mm), respectively, both of which are slightly worse than SCN (0.66). By leveraging extra 12K unlabeled images from RSNA, the results of HM and TF are boosted to 0.64 and 0.61, respectively, where the latter outperforms SCN by a large margin (i.e., 7.5% improvement). The above results show that it is promising to boost medical landmark detection models through omni-supervised learning to deal with insufficient labeled data.

Table 2. Comparison with existing supervised learning methods on Hand in the omni setting.

Table 2. Comparison with existing supervised learning methods on Hand in the omni setting.

Model Analysis

  • How pseudo pretraining works

    One may worry that the noisy samples learned during pseudo pretraining would be memorized by the model and thus leads to confirmation bias. To understand how it works, we perform an analysis of example forgetting by computing the difference of the pseudo-labels used for training and their predictions after the two-stage learning. Note that shrink regression is not used in the second stage here as we would like to focus on pseudo pretraining. As a comparison, we also do the analysis for the naive baseline. Figure 5 shows the relevant results of both HM (top) and TF (bottom), with different labeled ratios, and the pseudo-labels are grouped by noise. First of all, we can see that the naive baseline forgets more about noisier samples, which is consistent to the findings from previous works. And as expected, the model with the two-stage training does not memorize the pseudo-labels as much as the naive baseline does because the pseudo-labels are not used in the second stage of the proposed method. But interestingly, the forgetting of noisier samples is enlarged when compared to the naive baseline (i.e., the pink area), indicating that the two-stage design with pseudo pretraining is able to handle noisy pseudo-labels implicitly. Such a characteristic helps pseudo pretraining to effectively utilize the task-specific knowledge of pseudo-labels while avoiding the influence of noisy samples.

Figure 5. Analysis of example forgetting by comparing the naive method and pseudo pretraining on prediction difference of pseudo-labels. (top) Heatmap-based model with 5% and 10% labeled data, respectively. (bottom) Transformer-based model with 5% and 10% labeled data, respectively.

Figure 5. Analysis of example forgetting by comparing the naive method and pseudo pretraining on prediction difference of pseudo-labels. (top) Heatmap-based model with 5% and 10% labeled data, respectively. (bottom) Transformer-based model with 5% and 10% labeled data, respectively.

  • How shrink loss works

    While the shrink regression for heatmap-based models looks straightforward, the shrink loss applied to coordinate-based models seems not so obvious on adjusting task confidence. To verify it, we perform a correlation analysis of the gradients between GTs and pseudo-labels. We record the gradients of the last layer when trained with GTs and pseudo-labels, respectively, and group them by pseudo-label noise. For each group, we compute the Pearson correlation of the gradients between GTs and pseudo-labels. Figure 6a and 6b show the results on 5% and 10% labeled settings of 300W, respectively. We can see that the correlation decreases as the loss scale decreases, which indicates that the gradients of small errors trained on pseudo-labels are indeed noisier and less reliable. Therefore, it is reasonable to increase the confidence of the regression task by downweighting the gradients of small prediction errors.

Figure 6. Pearson correlation coefficient of the gradients between pseudo-labels and GTs of TF model, trained on 300W. (a) 5% labeled. (b) 10% labeled.

Figure 6. Pearson correlation coefficient of the gradients between pseudo-labels and GTs of TF model, trained on 300W. (a) 5% labeled. (b) 10% labeled.

  • Qualitative results

    Lastly, we show the superiority of STLD via qualitative results. We first show its advantage in pseudo-label refinement in Figure 7a using 300W. The results of the naive baseline (left) and CL (middle) are also given for comparison. For each image, we plot the initial (red dots) and the last round pseudo-labels (blue dots) to see the improvement. We can see that CL performs better than the naive method in general, and STLD is the best among the three. To be specific, in the first row, the naive method and CL barely change the pseudo-labels inside the rectangle area due to confirmation bias, while STLD is able to largely refine them and thus reduces the NME from 7.38 to 4.45. In the second row, both the naive method and CL are able to refine the pseudo-labels, but the quality is unsatisfactory; on the other hand, our method obtains high-quality pseudo-labels in the last round, as evidenced by the visualized predictions and the significant improvement of NME (from 8.03 to 4.86). To summarize, the baseline and CL are not able to obtain high- quality pseudo-labels due to either confirmation bias (row 1 and 3) or unsatisfactory estimates (row 2 and 4). In contrast, the proposed STLD handles confirmation bias properly through the task curriculum, which further boosts the quality of the predicted pseudo-labels. We then show the advantage of STLD in accurate prediction in Figure 7b using Hand test samples. The SCN and HM baseline are included as reference. We plot ground-truths (green dots), model predictions (red dots), as well as the distance (blue lines) between them for easy comparison. We can see from the figures that STLD makes more accurate predictions than the two counterparts consistently, which validates the promise of applying STLD with omni-supervised learning for boosting medical landmark detection.

Figure 7. (a) Visualized unlabeled samples from 300W, with pseudo-labels from the first (red dots) and the last self-training round (blue dots). The naive method (left), CL (middle), and STLD (right) all use the same base model HM, trained with 1.6% labeled data. (b) Visualized test samples from Hand, with ground-truths (green dots), predictions (red dots), and the distances between them (blue lines). SCN (left) and the baseline HM (middle) are trained in supervised learning while STLD (right) is trained in omni-supervised learning.

Figure 7. (a) Visualized unlabeled samples from 300W, with pseudo-labels from the first (red dots) and the last self-training round (blue dots). The naive method (left), CL (middle), and STLD (right) all use the same base model HM, trained with 1.6% labeled data. (b) Visualized test samples from Hand, with ground-truths (green dots), predictions (red dots), and the distances between them (blue lines). SCN (left) and the baseline HM (middle) are trained in supervised learning while STLD (right) is trained in omni-supervised learning.

Conclusion

In this work, we proposed STLD, a two-stage self-training method for semi-supervised landmark detection. Without explicit pseudo-label selection, the method is still able to leverage pseudo-labels effectively by constructing a task curriculum that transitions from more confident to less confident tasks. Compared to selection-based self-training, the advantages of STLD are three-fold: 1) it does not introduce data bias caused by sample selection; 2) it avoids the choice of confidence threshold that can be sensitive; 3) it is applicable to coordinate regression methods that do not output confidence scores. Experiments on four landmark detection benchmarks have demonstrated the effectiveness of the method in both semi- and omni-supervised settings. We highlight that the contribution of the paper is not limited to the proposal of STLD, but also a new way of applying self-training. It would be interesting to see more instantiations of the task curriculum that works for landmark detection and beyond (e.g., object detection and semantic segmentation), which we leave for future work.

References

H. Jin, H. Che, and H. Chen. Rethinking Self-training for Semi-supervised Landmark Detection: A Selection-free Approach. IEEE Transactions on Image Processing, doi: 10.1109/TIP.2024.3451937.