Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms P. Jonathon Phillipsa,1 , Amy N. Yatesa , Ying Hub , Carina A. Hahnb , Eilidh Noyesb , Kelsey Jacksonb , Jacqueline G. Cavazosb , Géraldine Jeckelnb , Rajeev Ranjanc , Swami Sankaranarayananc , Jun-Cheng Chend , Carlos D. Castillod , Rama Chellappac , David Whitee , and Alice J. O’Tooleb a Information Access Division, National Institute of Standards and Technology, Gaithersburg, MD 20899; b School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, TX 75080; c Department of Electrical and Computer Engineering, University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20854; d University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20854; and e School of Psychology, The University of New South Wales, Sydney, NSW 2052, Australia Achieving the upper limits of face identification accuracy in forensic applications can minimize errors that have profound social and personal consequences. Although forensic examiners identify faces in these applications, systematic tests of their accuracy are rare. How can we achieve the most accurate face identification: using people and/or machines working alone or in collaboration? In a comprehensive comparison of face identification by humans and computers, we found that forensic facial examiners, facial reviewers, and superrecognizers were more accurate than fingerprint examiners and students on a challenging face identification test. Individual performance on the test varied widely. On the same test, four deep convolutional neural networks (DCNNs), developed between 2015 and 2017, identified faces within the range of human accuracy. Accuracy of the algorithms increased steadily over time, with the most recent DCNN scoring above the median of the forensic facial examiners. Using crowd-sourcing methods, we fused the judgments of multiple forensic facial examiners by averaging their rating-based identity judgments. Accuracy was substantially better for fused judgments than for individuals working alone. Fusion also served to stabilize performance, boosting the scores of lower-performing individuals and decreasing variability. Single forensic facial examiners fused with the best algorithm were more accurate than the combination of two examiners. Therefore, collaboration among humans and between humans and machines offers tangible benefits to face identification accuracy in important applications. These results offer an evidence-based roadmap for achieving the most accurate face identification possible. Council report Strengthening Forensic Science in the United States: A Path Forward (8; cf. ref. 9). In the most comprehensive study to date (3), forensic facial examiners were superior to motivated control participants and to students on six tests of face identity matching. However, image pairs in these tests appeared for a maximum of 30 s. Identification decisions in a forensic laboratory typically require days or weeks to complete and are made with the assistance of image measurement and manipulation tools (10). Accordingly, the performance of forensic facial examiners in ref. 3 represents a lower-bound estimate of the accuracy of examiners in practice. Superrecognizers are untrained people with strong skills in face recognition. Multiple laboratory-based face recognition tests of these individuals indicate that highly accurate face identification can be achieved by people with no professional training (1). Superrecognizers contribute to face recognition decisions made in law enforcement (11, 12) but have not been compared with forensic examiners or machines. The term wisdom-of-crowds refers to accuracy improvements achieved by combining the judgments of multiple individuals to make a decision. Face recognition accuracy by humans can be boosted substantially by crowd-sourcing responses (2–5), Significance This study measures face identification accuracy for an international group of professional forensic facial examiners working under circumstances that apply in real world casework. Examiners and other human face “specialists,” including forensically trained facial reviewers and untrained superrecognizers, were more accurate than the control groups on a challenging test of face identification. Therefore, specialists are the best available human solution to the problem of face identification. We present data comparing state-of-theart face recognition technology with the best human face identifiers. The best machine performed in the range of the best humans: professional facial examiners. However, optimal face identification was achieved only when humans and machines worked in collaboration. face identification forensic science face recognition algorithm wisdom-of-crowds machine learning technology S ocieties rely on the expertise and training of professional forensic facial examiners, because decisions by professionals are thought to assure the highest possible level of face identification accuracy. If accuracy is the goal, however, the scientific literature in psychology and computer vision points to three additional approaches that merit consideration. First, untrained “superrecognizers” from the general public perform surprisingly well on laboratory-based face recognition studies (1). Second, wisdom-of-crowds effects for face recognition, implemented by averaging individuals’ judgments, can boost performance substantially over the performance of a person working alone (2–5). Third, computer-based face recognition algorithms over the last decade have steadily closed the gap between human and machine performance on increasingly challenging face recognition tasks (6, 7). Beginning with forensic facial examiners, remarkably little is known about their face identification accuracy relative to people without training, and nothing is known about their accuracy relative to computer-based face recognition systems. Independent and objective scientific research on the accuracy of forensic facial practitioners began in response to the National Research www.pnas.org/cgi/doi/10.1073/pnas.1721355115 Author contributions: P.J.P., A.N.Y., D.W., and A.J.O. designed research; R.R., S.S., J.-C.C., C.D.C., and R.C. contributed new reagents/analytic tools; P.J.P., A.N.Y., Y.H., C.A.H., E.N., K.J., J.G.C., G.J., and A.J.O. analyzed data; R.R., S.S., J.-C.C., C.D.C., and R.C. implemented and ran the face recognition algorithms; and P.J.P. and A.J.O. wrote the paper. Conflict of interest statement: The University of Maryland is filing a US patent application that will cover portions of algorithms A2017a and A2017b. R.R., C.D.C., and R.C. are coinventors on this patent. This article is a PNAS Direct Submission. This open access article is distributed under Creative Commons AttributionNonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND). 1 To whom correspondence should be addressed. Email: jonathon@nist.gov. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1721355115/-/DCSupplemental. PNAS Latest Articles 1 of 6 PSYCHOLOGICAL AND COGNITIVE SCIENCES Edited by Thomas D. Albright, The Salk Institute for Biological Studies, La Jolla, CA, and approved April 30, 2018 (received for review December 13, 2017) including for forensic examiners in a time-restricted laboratory experiment (3). Combining human and machine face identification judgments also improves accuracy over either one operating alone (5). The effect of fusing the judgments of professionals and algorithms has not been explored. Computer-based face recognition systems now assist forensic face examiners by searching databases of images to generate potential identity matches for human review (13). Direct comparisons between human and machine accuracy have been based on algorithms developed before 2013. At that time, algorithms performed well with high-quality frontal images of faces with minimal changes in illumination and expression. Since then, deep learning and deep convolutional neural networks (DCNNs) have become the state of the art for face recognition (14–18). DCNNs can recognize faces from highly variable, low-quality images. These algorithms are often trained with millions of face images of thousands of people. Our goal was to achieve the most accurate face identification using people and/or machines working alone or in collaboration. The task was to determine whether pairs of face images showed the same person or different people. Image pairs were prescreened to be highly challenging based on data from humans and computer algorithms. Images were taken with limited control of illumination, expression, and appearance. Fig. 1 shows two example pairs (all pairs are shown in SI Appendix, Figs. S8–S14). To provide a comprehensive assessment of human accuracy, we tested three face specialist groups (forensic facial examiners, forensic facial reviewers, and superrecognizers) and two control groups (fingerprint examiners and undergraduate students). Humans responded on a 7-point scale that varied from high confidence that the pair showed the same person (+3) to high confidence that the pair showed different people (−3). We also tested four face recognition algorithms based on DCNNs developed between 2015 and 2017. Algorithm responses were real-valued similarity scores indicating the likelihood that the images showed the same person. The five subject groups and four algorithms were tested on the same image pairs. Facial examiners, reviewers, superrecognizers, and fingerprint examiners had 3 mo to complete the test. Students took the test in a single session. Forensic facial experts are professionals trained to identify faces in images and videos using a set of tools and procedures (10) that vary across forensic laboratories (19). We tested two classes of forensic facial professionals. Examiners (n = 57, 28 females, from five continents) have extensive training, and their identity comparisons involve a rigorous and time-consuming process. Their identification decisions can be presented in written documents that can be used to support legal actions, prosecutions, and expert testimony in court. Reviewers (n = 30, 17 females, from two continents) are trained to perform faster and less rigorous identifications that may be used in law enforcement and can assist in generating leads in criminal cases. We also tested superrecognizers (n = 13, 8 females, from two continents) (20), defined here as a person who had taken a Fig. 1. Examples highlighting the face region in the images used in this study (all image pairs are shown in SI Appendix, Figs. S8–S14). (Left) This pair is a same identity pair, and (Right) this pair shows a different identity pair. 2 of 6 www.pnas.org/cgi/doi/10.1073/pnas.1721355115 standard face recognition test that qualified them as a superrecognizer (1) or as a person used professionally as a superrecognizer (e.g., the London Metropolitan Police) (SI Appendix, SI Text). Professional fingerprint examiners and undergraduate students served as control groups. Fingerprint examiners (n = 53, 41 females, from two continents) are trained forensic professionals who perform fingerprint comparisons. They provide a baseline for forensic ability and training that excludes expertise in facial forensics. Fingerprint examiners complete extensive training for professional certification. Undergraduate students (n = 31, 24 females, from one continent) were tested as a proxy for the general population. To compare humans with face recognition algorithms, four DCNNs were tested on the same stimuli judged by humans. We refer to the algorithms as A2015 (14), A2016 (15), A2017a (16), and A2017b (17). The inclusion of multiple algorithms provides a robust sample of the state of the art for automatic face recognition. To make the test comparable with humans as an “unfamiliar” face matching test, we verified that none of the algorithms had been trained on images from the dataset used for the human test. Note that A2015 can be downloaded from the web and therefore, provides a public benchmark algorithm. Results Accuracy. Fig. 2 shows performance of the subject groups and algorithms using the area under the receiver operating characteristic curve (AUC) as a measure of accuracy. The groups are ordered by AUC median from the most to least accurate: facial examiners (0.93), facial reviewers (0.87), superrecognizers (0.83), fingerprint examiners (0.76), and students (0.68). Algorithm performance increased monotonically from the oldest algorithm (A2015) to the newest algorithm (A2017b). Comparing the algorithms with the human groups, the publicly available algorithm (A2015) performed at a level similar to the students (0.68). Algorithm A2016 performed at the level of fingerprint examiners (0.76). Algorithm A2017a performed at a level (0.85) comparable with the superrecognizers (0.83) and reviewers (0.87). The performance of A2017b (0.96) was slightly higher than the median of the facial examiners (0.93). More formally, all face specialist groups surpassed fingerprint examiners (facial examiners, P = 2.14 × 10−6 ; facial reviewers, P = 0.004; superrecognizers, P = 0.017). The face specialist groups also surpassed students (facial examiners, P = 2.53 × 10−8 ; facial reviewers, P = 4.01 × 10−6 ; superrecognizers, P = 0.0005) (SI Appendix, SI Text). Performance across the face specialist groups did not differ statistically. Summary statistics for accuracy, however, should be interpreted in the context of the full performance distributions within each group. Performance Distributions. Individual accuracy varied widely in all groups. All face specialist groups (facial examiners, reviewers, and superrecognizers) had at least one participant with an AUC below the median of the students. At the top of the distribution, all but the student group had at least one participant with no errors. To examine specialist groups in the context of the general population (students), we fit a Gaussian distribution to the student AUCs (SI Appendix, SI Text). Next, we computed the fraction of participants in each group who scored above the 95th percentile (Fig. 2, dashed line). For the facial examiner group, 53% were above the 95th percentile of students; for the facial reviewers, this proportion was 36%. For superrecognizers, it was 46%, and for fingerprint examiners, it was 17%. For the algorithms, the accuracy of A2017b was higher than the majority (73%) of participants in the face specialist groups. Conversely, 35% of examiners, 13% of reviewers, and 23% of superrecognizers were more accurate than A2017b. Compared with students, the accuracy of A2017b was equivalent to a Phillips et al. 0.9 A2017a a 0.8 95% Students A2017b b AUC A2016 0.7 A2015 Random 0.6 0.5 0.4 rs ine am Ex iew v Re ers r-R pe Su rs ize gn o ec t rin erp g Fin nts de Stu ms rith o Alg Group Fig. 2. Human and machine accuracy. Black dots indicate AUCs of individual participants; red dots are group medians. In the algorithms column, red dots indicate algorithm accuracy. Face specialists (facial examiners, facial reviewers, and superrecognizers) surpassed fingerprint examiners, who surpassed the students. The violin plot outlines are estimates of the density for the AUC distribution for the subject groups. The dashed horizontal line marks the accuracy of a 95th percentile student. All algorithms perform in the range of human performance. The best algorithm places slightly above the forensic examiners’ median. student at the 98th percentile (z score = 2.090), A2017a was at the 91st percentile (z score = 1.346), A2016 was at the 76th percentile (z score = 0.676), and A2015 was at the 53rd percentile (z score = 0.082). These results show a steady increase in algorithm accuracy from a level comparable with students in 2015 to a level comparable with the forensic facial examiners in 2017. algorithms as follows. For each face image pair, an algorithm returned a similarity score that is an estimate of how likely it is that the images show the same person. Because the similarity score scales differ across algorithms, we rescaled the scores to the range of human ratings (SI Appendix, SI Text). For each face pair, the human rating and scaled algorithm score were averaged, and the AUC was computed for each participant–algorithm fusion. Fig. 4 shows the results of fusing humans and algorithms. The most effective fusion was the fusion of individual facial examiners with algorithm A2017b, which yielded a median AUC score of 1.0. This score was superior to the combination of two facial examiners (Mann–Whitney U test = 2.82 × 104 , n1 = 1,596, n2 = 57, P = 8.37 × 10−7 ). Fusing individual examiners with A2017a and A2016 yielded performance equivalent to the fusion of two examiners (Mann–Whitney U test = 4.53 × 104 , n1 = 1,596, n2 = 57, P = 0.956; Mann–Whitney U test = 4.33 × 104 , n1 = 1,596, n2 = 57, P = 0.526, respectively). Fusing one examiner with A2015 did not improve accuracy over a single examiner (Mann–Whitney U test = 1,592, n1 = 57, n2 = 57, P = 0.86). Fusing one examiner with A2017b proved more accurate than fusing one examiner with either A2017a or A2016 (Mann– Whitney U test = 1,054, n1 = 57, n2 = 57, P = 7.92 × 10−4 ; Mann–Whitney U test = 942, n1 = 57, n2 = 57, P = 7.28 × 10−5 , respectively). Finally, fusing one examiner with both A2017b and A2017a did not improved accuracy over fusing one examiner with A2017b (Mann–Whitney U test = 1,414, n1 = 57, n2 = 57, P = 0.21). This analysis was repeated for fusing algorithms and facial reviewers and for fusing algorithms and superrecognizers. Similar results were found for both groups (SI Appendix, SI Text). Error Rates for Highly Confident Decisions In legal proceedings, the conclusions of greatest impact are identification errors made with high confidence. These can lead to Examiners 1.0 Fusing Human Judgments. In forensic practice, it is common for Phillips et al. AUC 0.8 0.6 0.4 1 2 3 4 5 6 7 Number of Subjects Fused 8 9 10 8 9 10 Medians of Fusion 1.0 0.9 AUC multiple examiners to review an identity comparison to assure consistency and consensus (3, 5). To examine the effects of fusion on accuracy, we combined individual participants’ judgments in each group. We began with one participant and increased the number of participants’ judgments fused from 2 to 10. To fuse n participants, we selected n participants randomly and averaged their rating-based judgments for each image pair. For fusing judgments, averaging is generally the most effective fusion strategy (21). An AUC was then computed from these average judgments. The sampling procedure was repeated 100 times for each value of n. Median accuracy peaked at 1.0 (no errors) with the fusion of four examiners or three superrecognizers (Fig. 3). The performance of all of the groups increased with fusion (SI Appendix, SI Text). For reviewers, the median peaked at 0.98 with 10 participants fused. Fingerprint examiners peaked at a median of 0.97 for 10 participants. For superrecognizers, the median increased from 0.83 to 0.98 when two superrecognizers were fused and to 1.0 when three or more superrecognizers were fused. Using a fusion perspective in comparing accuracy across participant groups, the data indicate that the median examiner (0.93) performs at a level roughly equal to two facial reviewers (median = 0.93) and seven fingerprint examiners (median = 0.94). Notably, the median of individual judgments by examiners is superior to the combination of 10 students (median = 0.88). Fusing Humans and Machines. We examined the effectiveness of combining examiners, reviewers, and superrecognizers with algorithms. Human judgments were fused with each of the four 0.8 0.7 0.6 1 Group 2 Examiners 3 4 5 6 7 Number of Subjects Reviewers Super-Recognizers Fingerprint Students Fig. 3. Plots illustrate the effectiveness of fusing multiple participants within groups. For all groups, combining judgments by simple averaging is effective. The violin plots in Upper show the distribution of AUCs for fusing examiners. Red circles indicate median AUCs. In Lower, the medians of the AUC distributions for the examiners, reviewers, superrecognizers, fingerprint examiners, and students appear. The median AUC reaches 1.0 for fusing four examiners or fusing three superrecognizers. The median AUC of fusing 10 students was 0.88, substantially below the median AUC for individual examiner accuracy. PNAS Latest Articles 3 of 6 PSYCHOLOGICAL AND COGNITIVE SCIENCES Perfect 1.0 1.0 A2017b 7 0.9 AUC A2017a 7 0.8 A2016 6 0.7 A2015 5 0.6 r s rs 15 7a 16 7b ne thm ine 20 01 20 01 mi ori A2 A2 +A +A am xa r r g + + l x E e r e r A e in in ne ne oE On am mi am mi Tw xa xa Ex Ex E E e e e e On On On On Group Fig. 4. Fusion of examiners and algorithms. Violin plots show the distribution of AUCs for each fusion test. Red dots indicate median AUCs. The distribution of individual examiners and the fusion of two examiners appear in columns 1 and 2. Also, algorithm performance appears in column 7. In between, plots show the forensic facial examiners fused with each of the four algorithms. Fusing one examiner and A2017b is more accurate than fusing two examiners, fusing examiners and A2017a or A2016 is equivalent to fusing two examiners, and fusing examiners with A2015 does not improve accuracy over a single examiner. miscarriages of justice with profound societal implications. In this study, the two responses that expressed high confidence were “the observations strongly support that it is the same person” (+3) and “the observations strongly support that it is not the same person” (−3). To examine the error rates associated with judgments of +3 and −3, we computed the fraction of highconfidence same-person (+3) ratings made to different identity face pairs and estimated the error rate as a Bernoulli distribution. The Bernoulli parameter q̂ is the fraction of different identity pairs that were given a rating of +3. Fig. 5 shows the estimated parameter q̂ with 95% confidence intervals by participant group. (SI Appendix, Table S2 shows estimated Bernoulli parameters and the confidence intervals.) The analysis was also conducted on the probability of same identity pairs being assigned a −3 rating. For facial examiners, the error rate for judging with high confidence that two different faces were the same was 0.009 (upper limit of the confidence interval, 0.022). The corresponding error rate on judging the same person as two different people was 0.018 (upper limit of confidence interval, 0.030). For facial reviewers, the corresponding error rates and confidence intervals were similar to those for the facial examiners (SI Appendix, SI Text). For superrecognizers, although their error rate for the rating of +3 on two different faces was comparable with that of examiners and reviewers, their error rate for −3 ratings assigned to same face image pairs was higher. Student error rates for highconfidence decisions were substantially higher than those of the facial examiners, reviewers, and superrecognizers. Notably, we found that fusion reduced high-confidence errors for facial examiners, facial reviewers, and superrecognizers (SI Appendix, SI Text). Specifically, fusing one individual and A2017b was superior to fusing two individuals, and fusing two individuals was superior to one individual. One possible explanation for these results is that forensic professionals avoid extreme ratings at both ends of the scale. To test this, we examined whether forensic professionals (facial examiners, facial reviewers, fingerprint examiners) overall made fewer high-confidence responses than nonprofessionals 4 of 6 www.pnas.org/cgi/doi/10.1073/pnas.1721355115 (superrecognizers, students). For each participant, the number of high-confidence responses was computed. Analysis showed that forensic professionals made fewer high-confidence decisions than nonforensic professionals (Mann–Whitney U test = 1,966.5, n1 = 140, n2 = 44, P = 2.83 × 10−4 ). This is consistent with a result obtained in a previous study by Norell et al. (22), which tested police detectives and students on face identity matching experiments. The result suggests that forensic training of any kind may affect the use of the response scale to avoid errors made with high confidence. Discussion The results of the study point to tangible ways to maximize face identification accuracy by exploiting the strengths of humans and machines working collaboratively. First, to optimize the accuracy of face identification, the best approach is to combine human and machine expertise. Fusing the most accurate machine with individual forensic facial examiners produced decisions that were more accurate than those arrived at by any pair of human and/or machine judges. This human–machine combination yielded higher accuracy than the fusion of two individual forensic facial examiners. Computational theory indicates that fusing systems works best when their decision strategies differ (21, 23). Therefore, the superiority of human–machine fusion over human–human fusion suggests that humans and machines have different strengths and weaknesses that can be exploited/mitigated by cross-fusion. Second, for human decisions, the highest possible accuracy is obtained when human judgments are combined by simple averaging. The power of fusing human decisions to improve accuracy is well-known in the face recognition literature (3, 4). Our results speak to the tangible benefits of putting fusion formally into the process of a forensic decision-making process. Collaborative peer review of decisions is a common strategy in facial forensics. This study suggests that, in addition to social collaboration, computationally combining multiple independent decisions made in isolation also produces solid gains in accuracy (24). Although fusing student judgments improves accuracy, we show that there are limits to the gains possible from fusion. A fusion of student judgments will not approach the accuracy of fusing facial examiners or reviewers. This suggests that a strategy for achieving optimal accuracy is to fuse people in the most accurate group of humans. Fig. 5. Estimated probability of highly confident same person ratings (+3 judgment, strong evidence the same person) when the identities are different and estimated probability of highly confident different person ratings (−3 judgment, strong evidence different people) when the identity is the same. The 95% confidence intervals are shown. Phillips et al. Phillips et al. Materials and Methods Test Protocol for Human Participants. To allow examiners access to their tools and methods while comparing face images, participants in all conditions, except the untrained student control group, downloaded the pairs of face images and were allowed 3 mo to complete the comparisons. For facial examiners and reviewers, comparisons were completed in their laboratory using their tools and methods. For superrecognizers and fingerprint examiners, the comparisons were done on a computer using tools available on the computer (e.g., image software tools). Students viewed the face pairs presented on a computer monitor one at a time. The size of the images was preset, and it was the same for all images. Pairs remained visible until a response was entered on the keyboard. For each pair of face images, the participants in all subject groups were required to respond on a 7-point scale: +3, the observations strongly support that it is the same person; +2, the observations support that it is the same person; +1, the observations support to some extent that it is the same person; 0, the observations support neither that it is the same person nor that it is different persons; −1, the observations support to some extent that it is not the same person; −2, the observations support that it is not the same person; −3, the observations strongly support that it is not the same person. The wording was chosen to reflect scales used by forensic examiners in their daily work. A receiver operating characteristic curve and the AUC were computed from the ratings for each subject. The experimental design was approved by the National Institute of Standards and Technology (NIST) IRB. Data collection procedures for students were approved by the IRB at the University of Texas at Dallas, and all subjects provided consent. Test Protocol for Algorithms. Algorithms first encoded each face as a compact vector of feature values by processing the image with the trained DCNN. DCNNs consist of multiple layers of simulated neurons that convolute and pool input (face images), feeding the data forward to one or more fully connected layers at the top of the network. The output is a compressed feature vector that represents a face (algorithm A2015 uses 4,096 features, A2016 uses 320 features, and A2017a and A2017b use 512 features). For each image pair in the test, a similarity score was computed between the representations of the two faces. The similarity score is the algorithm’s estimate of whether the images show the same person. To avoid response bias, performance was measured by computing an AUC directly from the similarity score distributions for same and different identity pairs, eliminating the need for a threshold. SI Appendix, SI Text has details on the algorithms. Stimuli. Image pairs were chosen carefully in three screening steps. These steps were based on human and algorithm performance (details follow). The goal of the screening process was to select highly challenging image pairs that would test the upper limits of the participants’ skills, while avoiding floor effects for the students. The starting point for pair selection was a set of 9,307 images of 507 individuals taken with a Nikon D70 6 megapixel single-lens reflex camera. Images were acquired during a single academic year in indoor and outdoor settings at the University of Notre Dame. Faces were in approximately frontal pose (Fig. 1 shows example pairs). We screened for identity matching difficulty with a fusion of three topperforming algorithms from an international competition of algorithms [Face Recognition Vendor Test 2006 (FRVT 2006)] (26). Based on the results of the fusion algorithm, the images were stratified into three difficulty levels (27). Image pairs were further pruned using human experimental data. We began with the accuracy of undergraduate students on the two most difficult levels for the algorithm (28, 29). We selected the highest performing 25% of participants and chose the 84 same identity and 84 different identity image pairs that elicited the highest proportion of errors in this group. These pairs formed a stimulus pool of image pairs that were challenging for humans and previous generation face recognition algorithms. A second stimulus pool was created in a similar way but with the goal of finding image pairs on which previous generation algorithms failed systematically. We sampled the stimuli from those used in a recent study that compared human and computer algorithm performance on a special set of image pairs for which machine performance in the FRVT 2006 (26) was 100% incorrect (29). Specifically, similarity scores computed between same identity faces were uniformly lower than those computed for the different identity image pairs. Finally, we implemented a third level of stimulus screening for both stimulus pools. We used performance on an identity matching task with very short (30 s) stimulus presentation times (3) and sorted these stimuli according to difficulty for the forensic examiners from that test. Discussions with facial examiners before the study indicated that they were willing to compare 20 pairs of images over a 3-mo period. This PNAS Latest Articles 5 of 6 PSYCHOLOGICAL AND COGNITIVE SCIENCES Third, systematic differences were found for the performance of the human groups on average. Professional forensic facial examiners, professional facial reviewers, and superrecognizers were the most accurate groups. Fingerprint examiners were less accurate than the face specialists but more accurate than students. Notably, the group medians ranged from highly accurate for facial examiners (AUC = 0.93) to moderately above chance for students (AUC = 0.68). This suggests that our face matching test tapped into the entire operating range of normal human accuracy. Fourth, the distribution of individual performance in this test was perhaps as informative as the summary data on central tendency. In particular, although the median accuracy measures strongly prescribe the use of professional facial examiners for cases where face identification accuracy is important, some individuals in this group performed poorly. Mitigating this concern to some extent, confident incorrect judgments by facial examiners were extremely rare. At the other end of the spectrum, some individuals in other groups performed with high accuracy that was well within the range of the best face specialists. Remarkably, in all but the student group, at least one individual performed the test with no errors. The range of accuracy of individuals in each group suggests the possibility of prescreening the general population for people with natural ability at face identification. The superrecognizers in our study were not trained formally in face recognition, yet they performed at levels comparable with those of the facial professionals. This suggests that both talent and training may underlie the high accuracy seen in the two groups of facial professionals. Turning to the performance of the algorithms, the results indicate the potential for machines to contribute beneficially to the forensic process. Accuracy of the publicly available algorithm that we tested (A2015) was at the level of median accuracy of the students—modestly above chance. The other algorithms follow a rapid upward performance trajectory: from parity with a median fingerprint examiner (A2016) to parity with a median superrecognizer (A2017a) and finally, to parity with median forensic facial examiners (A2017b). There is now a decade-long effort to compare the accuracy of face recognition algorithms with humans (6). In the earliest tests (25), the face matching tasks presented relatively controlled images. As these tests progressed, algorithms and humans were compared on progressively more challenging image pairs. In this study, image pairs were selected to be extremely challenging based on both human and algorithm performance. The difficulty of these items for humans was supported by the accuracy of students, who represent a general population of untrained humans. Students performed poorly on these challenging image pairs. All four of the algorithms performed at or above median student performance. Two algorithms performed in the range of the facial specialists, and one algorithm matched the performance of forensic facial examiners. In summary, this is the most comprehensive examination to date of face identification performance across groups of humans with variable levels of training, experience, talent, and motivation. We compared the accuracy of state-of-the-art face recognition algorithms with humans and show the benefits of a collaborative effort that combines the judgments of humans and machines. The work draws on previous cornerstone findings on human expertise and talent with faces, strategies for fusing human judgments, and computational advances in face recognition. The study provides an evidence-based roadmap for achieving highly accurate face identification. These methods should be extended in future work to test humans and machines on a wider range of face recognition tasks, including recognition across viewpoint and with low-quality images and video as well as recognition of faces from diverse demographic categories. allowed them to spend the time that they would normally spend for a forensic comparison. Using the screening described, we chose 12 image pairs from the first stimulus pool and 8 pairs from the second. There were same (n = 12) and different identity (n = 8) pairs. The slight imbalance eliminated the use of a process of elimination strategy (SI Appendix, SI Text). Data Availability. Deidentified data for facial examiners and reviewers, superrecognizers, and fingerprint examiners can be obtained by signing a data transfer agreement with the NIST. The images are available by license from the University of Notre Dame. Data for the students and algorithms are in Datasets S1 and S2. ACKNOWLEDGMENTS. Work was funded in part by the Federal Bureau of Investigation (FBI) to the NIST; the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) via IARPA R&D Contract 2014-14071600012 (to R.C.); Australian Research Council Linkage Projects LP160101523 (to D.W.) and LP130100702 (to D.W.); and National Institute of Justice Grant 2015-IJ-CX-K014 (to A.J.O.). The views and conclusions contained herein should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, the IARPA, or the FBI. The US Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. The identification of any commercial product or trade name does not imply endorsement or recommendation by the NIST. 1. Noyes E, Phillips PJ, O’Toole AJ (2017) What is a super-recogniser? Face Processing: Systems, Disorders, and Cultural Differences, eds Bindermann M, Megreya AM (Nova, New York), pp 173–201. 2. White D, Burton AM, Kemp RI, Jenkins R (2013) Crowd effects in unfamiliar face matching. Appl Cognit Psychol 27:769–777. 3. White D, Phillips PJ, Hahn CA, Hill MQ, O’Toole AJ (2015) Perceptual expertise in forensic facial image comparison. Proc R Soc B 282:20151292. 4. Dowsett AJ, Burton AM (2015) Unfamiliar face matching: Pairs out-perform individuals and provide a route to training. Br J Psychol 106:433–445. 5. O’Toole A, Abdi H, Jiang F, Phillips PJ (2007) Fusing face recognition algorithms and humans. IEEE Trans Syst Man Cybern B 37:1149–1155. 6. Phillips PJ, O’Toole AJ (2014) Comparison of human and computer performance across face recognition experiments. Image Vis Comput 32:74–85. 7. Phillips PJ (2017) A cross benchmark assessment of deep convolutional neural networks for face recognition. Proceedings of the 12th IEEE International Conference on Automatic Face Gesture Recognition, pp 705–710. Available at https://ieeexplore. ieee.org/document/7961810/. Accessed May 14, 2018. 8. National Research Council (2009) Strengthening Forensic Science in the United States: A Path Forward (National Academies Press, Washington, DC). 9. White D, Norell K, Phillips PJ, O’Toole AJ (2017) Human factors in forensic face identification. Handbook of Biometrics for Forensic Science, eds Tistaerlli M, Champod C (Springer, Cham, Switzerland), pp 195–218. 10. Facial Identification Scientific Working Group (2012) Guidelines for facial comparison methods, Version 1.0. Available at https://www.fiswg.org/FISWG GuidelinesforFacialComparisonMethods v1.0 2012 02 02.pdf. Accessed May 14, 2018. 11. Davis JP, Lander K, Evans R, Jansari A (2016) Investigating predictors of superior face recognition ability in police super-recognisers. Appl Cognit Psychol 30:827–840. 12. Robertson DJ, Noyes E, Dowsett A, Jenkins R, Burton AM (2016) Face recognition by metropolitan police super-recognisers. PLoS One 11:e0150036. 13. White D, Dunn JD, Schmid AC, Kemp RI (2015) Error rates in users of automatic face recognition software. PLoS One 10:e0139827. 14. Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. Proceedings of the British Machine Vision Conference, eds Xie X, Jones MW, Tam GKL, pp 41.1–41.12. Available at www.bmva.org/bmvc/2015/index.html. Accessed May 14, 2018. 15. Chen JC, Patel VM, Chellappa R (2016) Unconstrained face verification using deep cnn features. Proceedings of the IEEE Winter Conference of Appl Computer Vis (WACV), pp 1–9. Available at https://ieeexplore.ieee.org/document/7477557/. Accessed May 14, 2018. 16. Ranjan R, Sankaranarayanan S, Castillo CD, Chellappa R (2017) An all-in-one convolutional neural network for face analysis. Proceedings of the 12th IEEE International Conference on Automatic Face Gesture Recognition Gesture Recognition, pp 17–24. Available at https://ieeexplore.ieee.org/document/7961718/. Accessed May 14, 2018. 17. Ranjan R, Castillo CD, Chellappa R (2017) L2-constrained softmax loss for discriminative face verification. arXiv:170309507. 18. Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: Closing the gap to human-level performance in face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Washington, DC), pp 1701– 1708. 19. Prince J (2012) To examine emerging police use of facial recognition systems and facial image comparison procedures—Israel, Netherlands, UK, USA, Canada. The Winston Churchill Memorial Trust of Australia. Available at https://www.churchilltrust. com.au/media/fellows/2012 Prince Jason.pdf. Accessed May 14, 2018. 20. Russell R, Duchaine B, Nakayama K (2009) Super-recognizers: People with extraordinary face recognition ability. Psychon Bull Rev 16:252–257. 21. Kittler J, Hatef M, Duin RPW, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20:226–239. 22. Norell K, et al. (2015) The effect of image quality and forensic expertise in facial image comparisons. J Forensic Sci 60:331–340. 23. Hu Y, et al. (2017) Person recognition: Qualitative differences in how forensic face examiners and untrained people rely on the face versus the body for identification. Vis Cognit 25:492–506. 24. Jeckeln G, Hahn CA, Noyes E, Cavazos JG, O’Toole AJ (March 5, 2018) Wisdom of the social versus non-social crowd in face identification. Br J Psychol, 10.1111/bjop. 12291. 25. O’Toole AJ, et al. (2007) Face recognition algorithms surpass humans matching faces across changes in illumination. IEEE Trans Pattern Anal Mach Intell 29:1642–1646. 26. Phillips PJ, et al. (2010) FRVT 2006 and ICE 2006 large-scale results. IEEE Trans Pattern Anal Mach Intell 32:831–846. 27. Phillips PJ, et al. (2011) An introduction to the good, the bad, and the ugly face recognition challenge problem. Proceedings of the Ninth IEEE International Conference on Automatic Face Gesture Recognition, pp 346–353. Available at https:// ieeexplore.ieee.org/document/5771424/. Accessed May 14, 2018. 28. O’Toole AJ, An X, Dunlop J, Natu V, Phillips PJ (2012) Comparing face recognition algorithms to humans on challenging tasks. ACM Trans Appl Perception 9:1–13. 29. Rice A, Phillips PJ, Natu V, An X, O’Toole AJ (2013) Unaware person recognition from the body when face identification fails. Psychol Sci 24:2235–2243. 6 of 6 www.pnas.org/cgi/doi/10.1073/pnas.1721355115 Phillips et al.