1 Introduction
Modern face recognition system mainly relies on the power of highcapacity deep neural network coupled with massive annotated data for learning effective face representations
[26, 14, 21, 29, 11, 3, 32]. From CelebFaces [25] ( images) to MegaFace [13] ( images) and MSCeleb1M [9] ( images), face databases of increasingly larger scale are collected and labeled. Though impressive results have been achieved, we are now trapped in a dilemma where there are hundreds of thousands manually labeling hours consumed behind each percentage of accuracy gains. To make things worse, it becomes harder and harder to scale up the current annotation size to even more identities. In reality, nearly all existing largescale face databases suffer from a certain level of annotation noises [5]; it leads us to question how reliable human annotation would be.To alleviate the aforementioned challenges, we shift the focus from obtaining more manually labels to leveraging more unlabeled data. Unlike largescale identity annotations, unlabeled face images are extremely easy to obtain. For example, using a web crawler facilitated by an offtheshelf face detector would produce abundant inthewild face images or videos [24]
. Now the critical question becomes how to leverage the huge existing unlabeled data to boost the performance of largescale face recognition. This problem is reminiscent of the conventional semisupervised learning (SSL)
[34], but significantly differs from SSL in two aspects: First, the unlabeled data are collected from unconstrained environments, where pose, illumination, occlusion variations are extremely large. It is nontrivial to reliably compute the similarity between different unlabeled samples in this inthewild scenario. Second, there is usually no identity overlapping between the collected unlabeled data and the existing labeled data. Thus, the popular label propagation paradigm [35] is no longer feasible here.In this work, we study this challenging yet meaningful semisupervised face recognition problem, which can be formally described as follows. In addition to some labeled data with known face identities, we also have access to a massive number of inthewild unlabeled samples whose identities are exclusive from the labeled ones. Our goal is to maximize the utility of the unlabeled data so that the final performance can closely match the performance when all the samples are labeled. One key insight here is that although unlabeled data do not provide us with the straightforward semantic classes, its inner structure, which can be represented by a graph, actually reflects the distribution of highdimensional face representations. The idea of using a graph to reflect structures is also adopted in crosstask tuning [31]. With the graph, we can sample instances and their relations to establish an auxiliary loss for training our model.
Finding a reliable inner structure from noisy face data is nontrivial. It is wellknown that the representation induced by a single model is usually prone to bias and sensitive to noise. To address the aforementioned challenge, we take a bottomup approach to construct the graph by first identifying positive pairs reliably. Specifically, we propose a novel ConsensusDriven Propagation (CDP)^{1}^{1}1Project page: http://mmlab.ie.cuhk.edu.hk/projects/CDP/^{2}^{2}2Code: https://github.com/XiaohangZhan/cdp/ approach for graph construction in massive unlabeled data. It consists of two modules: a “committee” that provides multiview information on the proposal pair, and a “mediator” that aggregates all the information for a final decision.
The “committee” module is inspired by querybycommittee (QBC) [22]
that was originally proposed for active learning. Different from QBC that measures disagreement, we collect consents from a committee, which comprises a base model and several auxiliary models. The heterogeneity of the committee reveals different views on the structure of the unlabeled data. Then positive pairs are selected as the pair instances that the committee members most agree upon, rather than the base model is most confident of. Hence the committee module is capable of selecting meaningful and hard positive pairs from the unlabeled data besides just easy pairs, complementing the model trained from just labeled data. Beyond the simple voting scheme, as practiced by most QBC methods, we formulate a novel and more effective “
mediator” to aggregate opinions from the committee. The mediator is a binary classifier that produces the final decision as to select a pair or not. We carefully design the inputs to the mediator so that it covers distributional information about the inner structure. The inputs include 1) voting results of the committee, 2) similarity between the pair, and 3) local density between the pair. The last two inputs are measured across all members of the committee and the base model. Thanks to the “committee” module and the “mediator” module, we construct a robust consensusdriven graph on the unlabeled data. Finally, we propagate pseudolabels on the graph to form an auxiliary task for training our base model with unlabeled data.
To summarize, we investigate the usage of massive unlabeled data (over 6M images) for largescale face recognition. Our setting closely resembles realworld scenarios where the unlabeled data are collected from unconstrained environments and their identities are exclusive from the labeled ones. We propose consensusdriven propagation (CDP) to tackle this challenging problem with two carefullydesigned modules, the “committee” and the “mediator”, which select positive face pairs robustly by aggregating multiview information. We show that a wise usage of unlabeled data can complement scarce manual labels to achieve compelling results. With consensusdriven propagation, we can achieve comparable results by only using of the labels when compared to its fullysupervised counterpart.
2 Related Work
Semisupervised Face Recognition. Semisupervised learning [34, 4] is proposed to leverage largescale unlabeled data, given a handful of labeled data. It typically aims at propagating labels to the whole dataset from limited labels, by various ways, including selftraining [30, 19], cotraining [2, 16], multiview learning [20]
[6] and graphbased methods [36]. For face recognition, Roli and Marcialis [18] adopt a selftraining strategy with PCAbased classifiers. In this work, the labels of unlabeled data are inferred with an initial classifier and are added to augment the labeled dataset. Zhao [33] employ Linear Discriminant Analysis (LDA) as the classifier and similarly use selftraining to infer labels. Gao [8] propose a semisupervised sparse representation based method to handle the problem in fewshot learning that labeled examples are typically corrupted by nuisance variables such as bad lighting, wearing glasses. All the aforementioned methods are based on the assumption that the set of categories are shared between labeled data and unlabeled data. However, as mentioned before, this assumption is impractical when the quantity of face identities goes massive.QuerybyCommittee. Query By Committee (QBC) [22]
is a strategy relying on multiple discriminant models to explore disagreements, thus mining meaningful examples for machine learning tasks. ArgamonEngelson
et al. [1]extend the QBC paradigm to the context of probabilistic classification and apply it to natural language processing tasks. Loy
et al. [15] extend QBC to discover unknown classes via a framework for joint explorationexploitation active learning. These previous works make use of the disagreements of the committee for thresholdfree selection. On the contrary, we exploit the consensus of the committee and extend it to the semisupervised learning scenario.3 Methodology
We first provide an overview of the proposed approach. Our approach consists of three stages:
1) Supervised initialization  Given a small portion of labeled data, we separately train the base model and committee members in a fullysupervised manner. More precisely, the base model and all the committee members learn a mapping from image space to feature space using labeled data . For the base model, this process can be denoted as the mapping: , and as for committee members: , .
2) Consensusdriven propagation  CDP is applied on unlabeled data to select valuable samples and conjecture labels thereon. The framework is shown in Fig. 1
. We use the trained models from the first stage to extract deep features for unlabeled data and create kNN graphs. The “committee” ensures the diversity of the graphs. Then a “mediator” network is designed to aggregate diverse opinions in the local structure of kNN graphs to select meaningful pairs. With the selected pairs, a consensusdriven graph is created on the unlabeled data and nodes are assigned with pseudo labels via our label propagation algorithm.
3) Joint training using labeled and unlabeled data  Finally, we retrain the base model with labeled data, and unlabeled data with pseudo labels, in a multitask learning framework.
3.1 ConsensusDriven Propagation
In this section, we formally introduce the detailed steps of CDP.
i. Building kNN Graphs. For the base model and all committee members, we feed them with unlabeled data as input and extract deep features and . With the features, we find nearest neighbors for each sample in
by cosine similarity. This results in different versions of kNN graphs,
for the base model and for each committee member, totally graphs. The nodes in the graphs are examples of the unlabeled data. Each edge in the kNN graph defines a pair, and all the pairs from the base model’s graph form candidates for the subsequent selection, as shown in Fig. 1.ii. Collecting Opinions from Committee. Committee members map the unlabeled data to the feature space via different mapping functions }. Assume two arbitrary connected nodes and in the graph created by the base model, and they are represented by different versions of deep features and . The committee provides the following factors:
1) The relationship, , between the two nodes. Intuitively, it can be understood as whether two nodes are neighbors in the view of each committee member.
(1) 
where is the kNN graph of th committee model and denotes all edges of a graph.
2) The affinity, , between the two nodes. It can be computed as the similarity measured in the feature space with the mapping functions defined by the committee members. Assume that we use cosine similarity as a metric,
(2) 
3) The local structures w.r.t each node. This notion can refer to the distribution of a node’s firstorder, secondorder, and even higherorder neighbors. Among them the firstorder neighbors play the most important role to represent the “local structures” w.r.t a node. And such distribution can be approximated as the distribution of similarities between the node and all of its neighbors , where .
(3) 
As illustrated in Fig. 2, given a pair of nodes extracted from the base model’s graph, the committee members provide diverse opinions to the relationships, the affinity and the local structures, due to their nature of heterogeneity. From these diverse opinions, we seek to find a consent through a mediator in the next step.
iii. Aggregate Opinions via Mediator.
The role of a mediator is to aggregate and convey committee members’ opinions for pair selection. We formulate the mediator as a MultiLayer Perceptron (MLP) classifier albeit other types of classifier are applicable. Recall that all pairs extracted from the base model’s graph constitute the candidates. The mediator shall reweight the opinions of the committee members and make a final decision by assigning a probability to each pair to indicate if a pair shares the same identity,
i.e., positive, or have different identities, i.e., negative.The input to the mediator for each pair
is a concatenated vector containing three parts (here we denote
as for simplicity of notation):1) “relationship vector” : , from the committee.
2) “affinity vector” : from both the base model and the committee.
3) “neighbors distribution vector” including “mean vector”
and “variance vector”
:(4) 
from both the base model and the committee for each node. Then it results in dimensions of the input vector. The mediator is trained on
, and the objective is to minimize the corresponding CrossEntropy loss function. For testing, pairs from
are fed into the mediator and those with a high probability to be positive are collected. Since most of the positive pairs are redundant, we set a high threshold to select pairs, thus sacrificing recall to obtain positive pairs with high precision.iv. Pseudo Label Propagation. The pairs selected by the mediator in the previous step compose a “ConsensusDriven Graph”, whose edges are weighted by pairs’ probability to be positive. Note that the graph does not need to be a connected graph. Unlike conventional label propagation algorithms, we do not assume labeled nodes on the graph. To prepare for subsequent model training, we propagate pseudo labels based on the connectivity of nodes. To propagate pseudo labels, we devise a simple yet effective algorithm to identify connected components. At first, we find connected components based on the current edges in the graph and add it to a queue. For each identified component, if its node number is larger than a predefined value, we eliminate lowscore edges in the component, find connected components from it, and add the new disjoint components to the queue. If the node number of a component is below the predefined value, we annotate all nodes in the component with a new pseudo label. We iterate this process until the queue is empty when all the eligible components are labeled.
3.2 Joint Training using Labeled and Unlabeled Data
Once the unlabeled data are assigned with pseudo labels, we can use them to augment the labeled data and update the base model. Since the identity intersection of two data sets is unknown, we formulate the learning in a multitask training fashion, as shown in Fig. 3. The CNN architectures for the two tasks are exactly the same as the base model, and the weights are shared. Both CNNs are followed by a fullyconnected layer to map deep features into the respective label space. The overall optimization objective is , where the loss, , is the same as the one for training the base model and committee members. In the following experiments, we employ as our loss function. But note that there is no restriction to which loss is equipped with CDP. In Section 4.3, we show that CDP still helps considerably despite with advanced loss functions. In this equation, denotes labeled data, while denotes unlabeled data and the assigned labels. is the weight to balance the two components. Its value is fixed following the proportion of images in the labeled and unlabeled set. The model is trained from scratch.
4 Experiments
Training Set. MSCeleb1M [9] is a largescale face recognition dataset containing training examples with identities. To address the original annotation noises, we clean up the official training set and crawl images of more identities, producing about images with identities. We split the cleaned dataset into 11 balanced parts randomly by identities, so as to ensures that there is no identity overlapping between different parts. Note that though our experiments adopt this harder setting, our approach can be readily applied to identityoverlapping settings since it makes no assumptions on the identities. Among the different parts, one part is regarded as labeled and the other ten parts are regarded as unlabeled. We also use one of the unlabeled parts as a validation set to adjust hyperparameters and perform ablation study. The labeled part contains images with identities. The model trained only on the labeled part is regarded as the lower bound performance. The fullysupervised version is trained with full labels from all the 11 parts. To investigate the utility of the unlabeled data, we compare different methods with 2, 4, 6, 8, and 10 parts of unlabeled data included, respectively.
Testing Sets. MegaFace [13] is currently the largest public benchmark for face identification. It includes a gallery set containing images, and a probe set from FaceScrub [17] with 3,530 images. However, there are some noisy images from FaceScrub, hence we use the noises list proposed by InsightFace^{3}^{3}3InsightFace: https://github.com/deepinsight/insightface/tree/master/src/megaface to clean it. We adopt rank1 identification rate in MegaFace benchmark, which is to select the top1 image from the gallery and average the top1 hit rate. IJBA [17] is a face verification benchmark contains 5,712 images from 500 identities. We report the true positive rate under the condition that the false positive rate is 0.001 for evaluation.
Committee Setup. To create a “committee” with high heterogeneity, we employ popular CNN architectures including ResNet18 [10], ResNet34, ResNet50, ResNet101, DenseNet121 [12], VGG16 [23], Inception V3 [28], InceptionResNet V2 [27] and a smaller variant of NASNetA [37]. The number of committee members is eight in our experiments, but we also explore the choice of the number of committee member from to . We trained all the architectures with the labeled part of data and the performance is listed in Table 1. The numbers of parameters are also listed. Tiny NASNetA shows the best performance among all the architectures but uses the smallest number of parameters. Model ensemble results are also presented. Empirically, the best ensemble combination is to assemble the four topperforming models, i.e., Tiny NASNetA, InceptionResnet V2, DenseNet121, ResNet101, yielding 68.86% and 76.97% on two benchmarks. We select Tiny NASNetA as our base architecture and the other 8 models as committee members. The following experiments demonstrate that the “committee” helps even though its members are weaker than the base architecture. In Section 4.3 we also show that our approach is widely applicable by switching the base architecture.
Architecture  MegaFace  IJBA  Parameters  
Base  Tiny NASNetA  61.78  75.87  
Committee  VGG16  50.22  70.75  
ResNet18  51.48  69.23  
ResNet34  52.44  72.52  
Inception V3  52.82  75.53  
ResNet50  56.16  73.21  
ResNet101  57.87  74.52  
InceptionResNet V2  58.68  75.13  
DesNet121  60.77  69.78  
Ensemble  (multiple)  69.86  76.97   
Implementation Details. The “mediator” is an MLP classifier with hidden layers, each of which containing
nodes. It uses ReLU as the activation function. At test time, we set the probability threshold as
to select highconfident pairs. More details can be found in the supplementary material.4.1 Comparisons and Results
Competing Methods. 1) Supervised deep feature extractor + Hierarchical Clustering
: We prepare a strong baseline by hierarchical clustering with supervised deep feature extractor. Hierarchical clustering is a practical way to deal with massive data comparing to other clustering methods. The clusters are assigned pseudo labels and augment the training set. For best performance, we carefully adjust the threshold of hierarchical clustering using the validation set and discard clusters with just a single image. 2)
Pair selection by naive committee voting: A pair is selected if this pair is voted by all the committee members (best setting empirically). A vote is counted if there is an edge in the kNN graph of a committee member.Benchmarking. As shown in Fig. 4, the proposed CDP method achieves impressive results on both benchmarks. From the results, we observe that:
1) Comparing to the lower bound (ratio of unlabeled:labeled is 0:1) with no unlabeled data, CDP obtains significant and steady improvements given different quantities of unlabeled data.
2) CDP surpasses the baseline “Hierarchical Clustering” by a large margin, obtaining competitive or even better results over the fullysupervised counterpart. In the MegaFace benchmark, with 10 fold unlabeled data added, CDP yields 78.18% of identification rate. Comparing to the lower bound without unlabeled data that yields 61.78%, CDP obtains 16.4% of improvement. Notably, there are only 0.34% gap between CDP and the fullysupervised setting that reaches 78.52%. The results suggest that CDP is capable of maximizing the utility of the unlabeled data.
3) CDP by the “mediator” performs better than by naive voting, indicating that the “mediator” is more capable in aggregating committee opinions.
4) In the IJBA face verification task, both settings of CDP surpass the fullysupervised counterpart. The poorer results observed on the fullysupervised baseline suggest the vulnerability of this task against noisy annotations in the training set, as discussed in Section 1. By contrast, our method is more resilient to noise. We will discuss this next based on Fig. 9.
Visual Results. We visualize the results of CDP in Fig. 9. It can be observed that CDP is highly precise in identity label assignment, regardless the diverse backgrounds, expressions, poses and illuminations. It is also observed that CDP behaves to be selective in choosing samples for pair candidates, as it automatically discards 1) wronglyannotated faces not belonging to any identity; 2) samples with extremely low quality, including heavily blurred and cartoon images. This explains why CDP outperforms the fullysupervised baseline in the IJBA face verification task (Fig. 4).
4.2 Ablation Study
We perform ablation study on the validation set to show the gain of each component, as shown in 2. Several indicators are included for comparison. Higher recall and precision of selected pairs will result in better consensusdriven graph, hence improves the quality of assigned labels. For assigned labels, pairwise recall and precision reflect the quality of the labels, and directly correlate the final performance on two benchmarks. Higher pairwise recall indicates more true examples in a category, which is important for the subsequent training. Higher pairwise precision indicates less noises in a category.
The Effectiveness of “Committee”. When we vary the number of committee members, we adjust pair similarity threshold to obtain fixed recall for convenience. With increasing committee number, an interesting observation is that, the peak of precision occurs where the number is 4. However, it does not bring the best quality of assigned labels, which occurs where the number is 68. This shows that more committee members will bring more meaningful pairs rather than just correct pairs. This conclusion is consistent with our assumption that the committee is able to select more hard positive pairs relative to the base model.
The Effectiveness of “Mediator”. For the “mediator”, we study the influence of different input settings. With only the “relationship vector” as input, the values of those indicators are close to that of direct voting. Then the “affinity vector” remarkably improves recall and precision of selected pairs, and also improves both pairwise recall and precision of assigned labels. The “neighbors distribution vector” and further boost the quality of the assigned labels. The improvements originate in the effect brought by these aspects of information, and hence the “mediator” performs better than naive voting.
Methods  Committee number  Mediator inputs  Pair selection  Assigned labels  


recall  precision 



Clustering            0.558  0.950  
Voting  0    0.313  0.966  0.680  0.829  
2    0.313  0.986  0.783  0.849  
4    0.313  0.987  0.791  0.862  
6    0.313  0.984  0.801  0.877  
8    0.313  0.979  0.807  0.876  
Mediator  8  0.318  0.975  0.825  0.822  
+  0.561  0.982  0.832  0.888  
++  0.527  0.983  0.825  0.912 
4.3 Further Analysis
Different Base Architectures. In previous experiments we have chosen Tiny NASNetA as the base model and other architecture as committee members. To investigate the influence of the base model, here we switch the base model to ResNet18, ResNet50, InceptionResNet V2 respectively and list their performance in Table 3. We observe consistent and large improvements from the lower bound on all the base architectures. Specifically, with highcapacity InceptionResNet V2, our CDP achieves 81.88% and 92.07% on MegaFace and IJBA benchmarks, with 23.20% and 16.94% improvements. It is significant considering that CDP uses the same amount of labeled data as the lower bound ( of all the labels). Our performance is also much higher than the ensemble of base model and committee, indicating that CDP actually exploits the intrinsic structure of the unlabeled data to learn effective representations.
Base  ResNet18  ResNet50  Tiny NASNetA  InceptionResNet V2  

MegaFace  IJBA  MegaFace  IJBA  MegaFace  IJBA  MegaFace  IJBA  
Lower Bound  51.48  69.23  56.16  73.12  61.78  75.87  58.68  75.13 
CDP  72.75  86.23  75.66  88.34  78.18  90.64  81.88  92.07 
Supervised  73.88  85.08  77.13  87.92  78.52  89.40  84.74  91.90 
Different in kNN. Here we inspect the effect of in kNN. In this comparable study, the probability threshold of a pair to be positive is fixed to . As shown in Table 4, higher results in more selected pairs and thus a denser consensusdriven graph, but the precision is almost unchanged. Note that the recall drops because the cardinal true pair number increases faster than the that of selected pairs. Actually, it is unnecessary to pursue high recall rate if the selected pairs are enough. For assigned labels, denser graph brings higher pairwise recall and lower precision. Hence it is a tradeoff between pairwise recall and precision of the assigned labels via varying .
Pair selection  Assigned labels  


recall  precision 



10  1.61M  0.601  0.985  0.810  0.940  
20  2.54M  0.527  0.983  0.825  0.912  
30  2.96M  0.507  0.982  0.834  0.886  
40  3.17M  0.464  0.982  0.837  0.874 
Committee Heterogeneity. To study the influence of committee heterogeneity, we conduct experiments with homogeneous committee architectures. The homogeneous committee consists of eight ResNet50 models that are trained with different data feeding orders, and the base model is the identical one as the heterogeneous setting. The model capacity of ResNet50 is at the median of the heterogeneous committee, for a fair comparison. As shown in Table 5, heterogeneous committee performs better than the homogeneous one via either voting or the “mediator”. The study verifies that committee heterogeneity is helpful.
Committee  Methods  Pair selection  Assigned labels  

pair number  recall  precision 



Homogeneous  voting  1.93M  0.368  0.648  0.746  0.681  
mediator  2.46M  0.508  0.853  0.798  0.831  
Heterogeneous  voting  1.41M  0.313  0.979  0.807  0.876  
mediator  2.54M  0.527  0.983  0.825  0.912 
Inside Mediator. To evaluate the participation of each input, we visualize the first layer’s weights in the “mediator”, as shown in Fig. 5. It is the weights of the first layer in the “mediator”, where the number of input and output channels is 53 and 50. Hence each column represents the weights of each input. The values in green is close to 0, and blue less than 0, yellow greater than 0. Both values in yellow and blue indicate high response to the corresponding inputs. We conclude that the committee’s “affinity vector” () and the mean vector of “neighbors distribution” () contribute higher to the response, than “relationship vector” () and the variance vector of “neighbors distribution” (). The result is reasonable since similarities contain more information than voting results, and the mean of neighbors’ distribution directly reflects the local density.
Incorporating Advanced Loss Functions. Our CDP framework is compatible with various forms of loss functions. Apart from , we also equip CDP with an advanced loss function, ArcFace [7], the current top entry on MegaFace benchmark. For parameters related to ArcFace, we set the margin and adopt the output setting “E”, that is “BNDropoutFCBN”. We also use a cleaner training set aiming to obtain a higher baseline. As shown in Table 6, we observe that CDP still brings large improvements over this much higher baseline.
Softmax  ArcFace [7]  

baseline  61.78%  76.93% 
CDP ( Ratio = 2)  70.51%  83.68% 
Efficiency and Scalability. The stepbystep runtime of CDP is listed as follows: for millionlevel data, graph construction (kNN search) takes minutes to perform on a CPU with processors, the “committee”+“mediator” network inference takes minutes to perform on eight GPUs, and the propagation takes another minutes on a single CPU. Since our approach constructs graphs in a bottomup manner and the “committee”+“mediator” only operate on local structures, the runtime of CDP grows linearly with the number of unlabeled data. Therefore, CDP is both efficient and scalable.
5 Conclusion
We have proposed a novel approach, ConsensusDriven Propagation (CDP), to exploit massive unlabeled data for improving largescale face recognition. We achieve highly competitive results against fullysupervised counterpart by using only 9% of the labels. Extensive analysis on different aspects of CDP is conducted, including influences of the number of committee members, inputs to the mediator, base architecture, and committee heterogeneity. The problem is wellsolved for the first time in the literature, considering the practical and nontrivial challenges it brings.
Acknowledgement: This work is partially supported by the Big Data Collaboration Research grant from SenseTime Group (CUHK Agreement No. TS1610626), the General Research Fund (GRF) of Hong Kong (No. 14236516, 14241716).
References

[1]
ArgamonEngelson, S., Dagan, I.: Committeebased sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research
11(335) (1999) 
[2]
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with cotraining. In: Proceedings of the eleventh annual conference on Computational learning theory (1998)
 [3] Cao, K., Rong, Y., Li, C., Tang, X., Loy, C.C.: Poserobust face recognition via deep residual equivariant mapping. In: CVPR (2018)
 [4] Chapelle, O., Scholkopf, B., Zien, A.: Semisupervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20(3) (2009)
 [5] Chen, L., Wang, F., Li, C., Huang, S., Chen, Y., Qian, C., Loy, C.C.: The devil of face recognition is in the noise. In: ECCV (2018)
 [6] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological) (1977)
 [7] Deng, J., Guo, J., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698 (2018)
 [8] Gao, Y., Ma, J., Yuille, A.L.: Semisupervised sparse representation based classification for face recognition with insufficient labeled samples. TIP 26(5) (2017)
 [9] Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Msceleb1m: A dataset and benchmark for largescale face recognition. In: ECCV (2016)
 [10] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
 [11] Huang, C., Li, Y., Loy, C.C., Tang, X.: Deep imbalanced learning for face recognition and attribute prediction. arXiv preprint arXiv:1806.00194 (2018)
 [12] Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., Keutzer, K.: Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869 (2014)
 [13] KemelmacherShlizerman, I., Seitz, S.M., Miller, D., Brossard, E.: The megaface benchmark: 1 million faces for recognition at scale. In: CVPR (2016)

[14]
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)
 [15] Loy, C.C., Hospedales, T.M., Xiang, T., Gong, S.: Streambased joint explorationexploitation active learning. In: CVPR (2012)
 [16] Mitchell, T.M.: The role of unlabeled data in supervised learning. In: Language, Knowledge, and Representation (2004)
 [17] Ng, H.W., Winkler, S.: A datadriven approach to cleaning large face datasets. In: ICIP (2014)

[18]
Roli, F., Marcialis, G.L.: Semisupervised pcabased face recognition using selftraining. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) (2006)
 [19] Rosenberg, C., Hebert, M., Schneiderman, H.: Semisupervised selftraining of object detection models (2005)
 [20] de Sa, V.R.: Learning classification with unlabeled data. In: NIPS (1994)
 [21] Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR (2015)
 [22] Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings of the fifth annual workshop on Computational learning theory (1992)
 [23] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014)
 [24] Sohn, K., Liu, S., Zhong, G., Yu, X., Yang, M.H., Chandraker, M.: Unsupervised domain adaptation for face recognition in unlabeled videos. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. pp. 3210–3218 (2017)
 [25] Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identificationverification. In: NIPS (2014)
 [26] Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: CVPR (2014)

[27]
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inceptionv4, inceptionresnet and the impact of residual connections on learning. In: AAAI. vol. 4 (2017)

[28]
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
 [29] Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: ECCV (2016)
 [30] Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: ACL (1995)
 [31] Zhan, X., Liu, Z., Luo, P., Tang, X., Loy, C.C.: Mixandmatch tuning for selfsupervised semantic segmentation. In: AAAI (2018)
 [32] Zhang, X., Yang, L., Yan, J., Lin, D.: Accelerated training for massive classification via dynamic class selection. In: AAAI (2018)
 [33] Zhao, X., Evans, N., Dugelay, J.L.: Semisupervised face recognition with lda selftraining. In: ICIP (2011)
 [34] Zhu, X.: Semisupervised learning literature survey. Computer Science, University of WisconsinMadison (2006)
 [35] Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation (2002)
 [36] Zhu, X., Lafferty, J., Rosenfeld, R.: Semisupervised learning with graphs. Ph.D. thesis, Carnegie Mellon University, language technologies institute, school of computer science (2005)
 [37] Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012 (2017)
Appendix 0.A Detailed Implementation
We use PyTorch
^{4}^{4}4PyTorch: http://pytorch.org/ to implement our CNN models and the “mediator”.0.a.1 Supervised Initialization
For the “tiny NASNetA”, we use the implementation of “NASNetALarge”^{5}^{5}5https://github.com/Cadene/pretrainedmodels.pytorch, and keep “x_conv0”, “x_stem_01”, “x_cell_01”, “x_reduction_cell_0”, “x_cell_67”, while removing other cells.
For all backbone architectures of the base model and the committee members, we replace the last “average pooling” layer by a convolution layer followed by a “fullyconnected” layer , to embed each image into a dimensional feature vector. Then the feature vector is fed into a linear layer to produce score for each category. The models are trained from scratch and the initialization strategy is “Xavier”. The batch size ranges from to , and the initial learning rate ranges from to , w.r.t different architectures. Each batch is scattered in
GPUs. The learning rate is decayed by 10 times at epoch
and , where is the maximal number of epochs ranging from to w.r.t different architectures.0.a.2 ConsensusDriven Propagation
We use NMSLIB ^{6}^{6}6NMSLIB: https://github.com/searchivarius/nmslib for cosine similarity based kNN search, and is set to in our main comparison.
The “mediator” is an MLP classifier with hidden layers, each of which containing nodes. It uses ReLU as the activation function and “crossentropy” loss for binary classification. Note that the configuration parameters of the “mediator” makes little difference to the final results, as long as it’s not too complicated to overfit the training pairs. We train the “mediator” on pairs extracted from the kNN graph of the base model on the labeled data, for epochs until convergence. The learning rate is initially, and is decayed by 10 times when epoch finishes.
In testing time, we feed pairs from unlabeled data into the trained “mediator” and obtain probabilities for each pair. We set the probability threshold as to select highconfident pairs for the construction of the “consensusbased graph”.
Our label propagation algorithm is shown in Algorithm 1.
0.a.3 Joint Training
In this stage we collect assigned labels for the unlabeled data and retrain the base model from scratch with both labeled data and unlabeled data in a multitask manner. The loss weights are equal to the proportion of total images in each part. The model is trained for to epochs w.r.t different ratios of unlabeled data ( for and for ), and the learning rate schedule is the same as in supervised initialization.
Appendix 0.B More Analysis
0.b.1 Onehot Labels v.s. Soft Labels.
The label propagation procedure in CDP is flexible to be adapted to other label modalities. For example, it can also propagate soft labels, i.e., the vector of probabilities a node belongs to each identity. The propagation of soft labels follows an initial propagation of onehot labels. Then the label vectors are diffused from each node to their neighbors in breadthfirst manner, with two hyperparameters and , standing for the maximal diffusion depth and the decay ratio of the values on each diffusion step. Finally on each node, the values of identities are normalized to form a probability vector. In this experiment, we adopt “Cross Entropy Loss” to utilize the soft labels in multitask training. As shown in Table 7, with appropriate combination of and , soft labels help to improve the performance on MegaFace by point, very close to the fullysupervised counterpart.
parameters  MegaFace  IJBA  
CDP (Ratio=2)  depth=0  70.51%  85.6% 
depth=3, decay=0.2  71.21%  86.70%  
depth=3, decay=0.5  70.58%  85.78%  
depth=5, decay=0.2  70.66%  86.11%  
depth=5, decay=0.5  71.48%  85.52%  
depth=5, decay=0.8  70.59%  85.50%  
depth=10, decay=0.2  70.32%  86.69%  
supervised    71.5%  84.07% 
Appendix 0.C Visual Results
Fig. 7 shows partial view of the “consensusbased graph” in CDP. It clearly shows that CDP produces dense connections among samples in the same category and weak connections between samples in different categories. Such graph facilitates the following label propagation.
Fig. 8 shows 5 groups of faces and the assigned labels. For most of examples in the unlabeled data, CDP is able to group faces belonging to the same person together and assign the same label.
Fig. 9 shows 4 groups of faces and CDP is able to automatically discard noisy samples.
Fig. 10 shows the typical failure cases of our methods. In some cases, CDP cannot identify heavily occluded faces and atypical faces that even humans cannot easily tell discriminate. It is due to the lack of extreme training examples in the labeled data, hence both the base model and the committee trained on the labeled data cannot handle those cases well. However, these failure cases will be handled and the performance of CDP can be continuously improved as the base model and the committee go stronger.
Comments
There are no comments yet.