REPORT TO THE PRESIDENT Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods Executive Office of the President President’s Council of Advisors on Science and Technology September 2016 REPORT TO THE PRESIDENT Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods Executive Office of the President President’s Council of Advisors on Science and Technology September 2016 About the President’s Council of Advisors on Science and Technology The President’s Council of Advisors on Science and Technology (PCAST) is an advisory group of the Nation’s leading scientists and engineers, appointed by the President to augment the science and technology advice available to him from inside the White House and from cabinet departments and other Federal agencies. PCAST is consulted about, and often makes policy recommendations concerning, the full range of issues where understandings from the domains of science, technology, and innovation bear potentially on the policy choices before the President. For more information about PCAST, see www.whitehouse.gov/ostp/pcast. The President’s Council of Advisors on Science and Technology Co-Chairs John P. Holdren Assistant to the President for Science and Technology Director, Office of Science and Technology Policy Eric S. Lander President Broad Institute of Harvard and MIT Vice Chairs William Press Raymer Professor in Computer Science and Integrative Biology University of Texas at Austin Maxine Savitz Honeywell (ret.) Members Wanda M. Austin President and CEO The Aerospace Corporation Christopher Chyba Professor, Astrophysical Sciences and International Affairs Princeton University Rosina Bierbaum Professor, School of Natural Resources and Environment, University of Michigan Roy F. Westin Chair in Natural Economics, School of Public Policy, University of Maryland S. James Gates, Jr. John S. Toll Professor of Physics Director, Center for String and Particle Theory University of Maryland, College Park Christine Cassel Planning Dean Kaiser Permanente School of Medicine Mark Gorenberg Managing Member Zetta Venture Partners v Susan L. Graham Pehong Chen Distinguished Professor Emerita in Electrical Engineering and Computer Science University of California, Berkeley Ed Penhoet Director Alta Partners Professor Emeritus, Biochemistry and Public Health University of California, Berkeley Michael McQuade Senior Vice President for Science and Technology United Technologies Corporation Barbara Schaal Dean of the Faculty of Arts and Sciences Mary-Dell Chilton Distinguished Professor of Biology Washington University of St. Louis Chad Mirkin George B. Rathmann Professor of Chemistry Director, International Institute for Nanotechnology Northwestern University Eric Schmidt Executive Chairman Alphabet, Inc. Mario Molina Distinguished Professor, Chemistry and Biochemistry University of California, San Diego Professor, Center for Atmospheric Sciences Scripps Institution of Oceanography Daniel Schrag Sturgis Hooper Professor of Geology Professor, Environmental Science and Engineering Director, Harvard University Center for Environment Harvard University Craig Mundie President Mundie Associates Staff Ashley Predith Executive Director Diana E. Pankevich AAAS Science & Technology Policy Fellow Jennifer L. Michael Program Support Specialist vi PCAST Working Group Working Group members participated in the preparation of this report. The full membership of PCAST reviewed and approved it. Working Group Eric S. Lander (Working Group Chair) President Broad Institute of Harvard and MIT Michael McQuade Senior Vice President for Science and Technology United Technologies Corporation S. James Gates, Jr. John S. Toll Professor of Physics Director, Center for String and Particle Theory University of Maryland, College Park William Press Raymer Professor in Computer Science and Integrative Biology University of Texas at Austin Susan L. Graham Pehong Chen Distinguished Professor Emerita in Electrical Engineering and Computer Science University of California, Berkeley Daniel Schrag Sturgis Hooper Professor of Geology Professor, Environmental Science and Engineering Director, Harvard University Center for Environment Harvard University Staff Diana E. Pankevich AAAS Science & Technology Policy Fellow Kristen Zarrelli Advisor, Public Policy & Special Projects Broad Institute of Harvard and MIT Writer Tania Simoncelli Senior Advisor to the Director Broad Institute of Harvard and MIT vii Senior Advisors PCAST consulted with a panel of legal experts to provide guidance on factual matters relating to the interaction between science and the law. PCAST also sought guidance and input from two statisticians, who have expertise in this domain. Senior advisors were given an opportunity to review early drafts to ensure factual accuracy. PCAST expresses its gratitude to those listed here. Their willingness to engage with PCAST on specific points does not imply endorsement of the views expressed in this report. Responsibility for the opinions, findings, and recommendations in this report and for any errors of fact or interpretation rests solely with PCAST. Senior Advisor Co-Chairs The Honorable Harry T. Edwards Judge United States Court of Appeals District of Columbia Circuit Jennifer L. Mnookin Dean, David G. Price and Dallas P. Price Professor of Law University of California Los Angeles Law Senior Advisors The Honorable James E. Boasberg District Judge United States District Court District of Columbia The Honorable Pamela Harris Judge United States Court of Appeals Fourth Circuit The Honorable Andre M. Davis Senior Judge United States Court of Appeals Fourth Circuit Karen Kafadar Commonwealth Professor and Chair Department of Statistics University of Virginia David L. Faigman Acting Chancellor & Dean University of California Hastings College of the Law The Honorable Alex Kozinski Judge United States Court of Appeals Ninth Circuit Stephen Fienberg Maurice Falk University Professor of Statistics and Social Science (Emeritus) Carnegie Mellon University The Honorable Cornelia T.L. Pillard Judge United States Court of Appeals District of Columbia Circuit viii The Honorable Charles Fried Beneficial Professor of Law Harvard Law School Harvard University The Honorable Jed S. Rakoff District Judge United States District Court Southern District of New York The Honorable Nancy Gertner Senior Lecturer on Law Harvard Law School Harvard University The Honorable Patti B. Saris Chief Judge United States District Court District of Massachusetts ix EXECUTIVE OFFICE OF THE PRESIDENT PRESIDENT’S COUNCIL OF ADVISORS ON SCIENCE AND TECHNOLOGY WASHINGTON, D.C. 20502 President Barack Obama The White House Washington, DC 20502 Dear Mr. President: We are pleased to send you this PCAST report on Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods. The study that led to the report was a response to your question to PCAST, in 2015, as to whether there are additional steps on the scientific side, beyond those already taken by the Administration in the aftermath of the highly critical 2009 National Research Council report on the state of the forensic sciences, that could help ensure the validity of forensic evidence used in the Nation’s legal system. PCAST concluded that there are two important gaps: (1) the need for clarity about the scientific standards for the validity and reliability of forensic methods and (2) the need to evaluate specific forensic methods to determine whether they have been scientifically established to be valid and reliable. Our study aimed to help close these gaps for a number of forensic “feature-comparison” methods—specifically, methods for comparing DNA samples, bitemarks, latent fingerprints, firearm marks, footwear, and hair. Our study, which included an extensive literature review, was also informed by inputs from forensic researchers at the Federal Bureau of Investigation Laboratory and the National Institute of Standards and Technology as well as from many other forensic scientists and practitioners, judges, prosecutors, defense attorneys, academic researchers, criminal-justice-reform advocates, and representatives of Federal agencies. The findings and recommendations conveyed in this report, of course, are PCAST’s alone. Our report reviews previous studies relating to forensic practice and Federal actions currently underway to strengthen forensic science; discusses the role of scientific validity within the legal system; explains the criteria by which the scientific validity of feature-comparison forensic methods can be judged; and applies those criteria to the selected feature-comparison methods. x Based on our findings concerning the “foundational validity” of the indicated methods as well as their “validity as applied” in practice in the courts, we offer recommendations on actions that could be taken by the National Institute of Standards and Technology, the Office of Science and Technology Policy, and the Federal Bureau of Investigation Laboratory to strengthen the scientific underpinnings of the forensic disciplines, as well as on actions that could be taken by the Attorney General and the judiciary to promote the more rigorous use of these disciplines in the courtroom. Sincerely, John P. Holdren Co-Chair Eric S. Lander Co-Chair xi Table of Contents The President’s Council of Advisors on Science and Technology ...................................................v PCAST Working Group ................................................................................................................... vii Senior Advisors ............................................................................................................................. viii Table of Contents ........................................................................................................................... xii Executive Summary ............................................................................................................... 1 1. Introduction .................................................................................................................... 21 2. Previous Work on Validity of Forensic-Science Methods .................................................. 25 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 DNA Evidence and Wrongful Convictions ................................................................................. 25 Studies of Specific Forensic-Science Methods and Laboratory Practices ................................. 27 Testimony Concerning Forensic Evidence ................................................................................ 29 Cognitive Bias ............................................................................................................................ 31 State of Forensic Science .......................................................................................................... 32 State of Forensic Practice ......................................................................................................... 33 National Research Council Report ............................................................................................ 34 Recent Progress ........................................................................................................................ 35 3.1 3.2 Evolution of Admissibility Standards ........................................................................................ 40 Foundational Validity and Validity as Applied .......................................................................... 42 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 Feature-Comparison Methods: Objective and Subjective Methods ........................................ 46 Foundational Validity: Requirement for Empirical Studies....................................................... 47 Foundational Validity: Requirement for Scientifically Valid Testimony ................................... 54 Neither Experience nor Professional Practices Can Substitute for Foundational Validity ....... 55 Validity as Applied: Key Elements ............................................................................................. 56 Validity as Applied: Proficiency Testing .................................................................................... 57 Non-Empirical Views in the Forensic Community..................................................................... 59 Empirical Views in the Forensic Community............................................................................. 63 Summary of Scientific Findings ................................................................................................. 65 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 DNA Analysis of Single-source and Simple-mixture samples.................................................... 69 DNA Analysis of Complex-mixture Samples.............................................................................. 75 Bitemark Analysis ...................................................................................................................... 83 Latent Fingerprint Analysis ....................................................................................................... 87 Firearms Analysis .................................................................................................................... 104 Footwear Analysis: Identifying Characteristics ....................................................................... 114 Hair Analysis ............................................................................................................................ 118 Application to Additional Methods ......................................................................................... 122 Conclusion ............................................................................................................................... 122 3. The Role of Scientific Validity in the Courts ...................................................................... 40 4. Scientific Criteria for Validity and Reliability of Forensic Feature-Comparison Methods .... 44 5. Evaluation of Scientific Validity for Seven Feature-Comparison Methods ........................ 67 xii 6. Recommendations to NIST and OSTP ............................................................................. 124 6.1 6.2 6.3 6.4 6.5 Role for NIST in Ongoing Evaluation of Foundational Validity................................................ 124 Accelerating the Development of Objective Methods ........................................................... 125 Improving the Organization for Scientific Area Committees .................................................. 126 Need for an R&D Strategy for Forensic Science...................................................................... 127 Recommendations .................................................................................................................. 128 7.1 7.2 Role for FBI Laboratory ........................................................................................................... 131 Recommendations .................................................................................................................. 134 8.1 8.2 8.3 Ensuring the Use of Scientifically Valid Methods in Prosecutions .......................................... 136 Revision of DOJ Recently Proposed Guidelines on Expert Testimony .................................... 136 Recommendations .................................................................................................................. 140 9.1 9.2 9.3 9.4 Scientific Validity as a Foundation for Expert Testimony ....................................................... 142 Role of Past Precedent ............................................................................................................ 143 Resources for Judges............................................................................................................... 144 Recommendations .................................................................................................................. 145 7. Recommendations to the FBI Laboratory ....................................................................... 131 8. Recommendations to the Attorney General................................................................... 136 9. Recommendations to the Judiciary ................................................................................ 142 10. Scientific Findings ........................................................................................................ 146 Appendix A: Statistical Issues............................................................................................. 151 Sensitivity and False Positive Rate ................................................................................................... 151 Confidence Intervals ........................................................................................................................ 152 Calculating Results for Conclusive Tests .......................................................................................... 153 Bayesian Analysis ............................................................................................................................. 153 Appendix B. Additional Experts Providing Input ................................................................. 155 xiii Executive Summary “Forensic science” has been defined as the application of scientific or technical practices to the recognition, collection, analysis, and interpretation of evidence for criminal and civil law or regulatory issues. Developments over the past two decades—including the exoneration of defendants who had been wrongfully convicted based in part on forensic-science evidence, a variety of studies of the scientific underpinnings of the forensic disciplines, reviews of expert testimony based on forensic findings, and scandals in state crime laboratories— have called increasing attention to the question of the validity and reliability of some important forms of forensic evidence and of testimony based upon them. 1 A multi-year, Congressionally-mandated study of this issue released in 2009 by the National Research Council 2 (Strengthening Forensic Science in the United States: A Path Forward) was particularly critical of weaknesses in the scientific underpinnings of a number of the forensic disciplines routinely used in the criminal justice system. That report led to extensive discussion, inside and outside the Federal government, of a path forward, and ultimately to the establishment of two groups: the National Commission on Forensic Science hosted by the Department of Justice and the Organization for Scientific Area Committees for Forensic Science at the National Institute of Standards and Technology. When President Obama asked the President’s Council of Advisors on Science and Technology (PCAST) in 2015 to consider whether there are additional steps that could usefully be taken on the scientific side to strengthen the forensic-science disciplines and ensure the validity of forensic evidence used in the Nation’s legal system, PCAST concluded that there are two important gaps: (1) the need for clarity about the scientific standards for the validity and reliability of forensic methods and (2) the need to evaluate specific forensic methods to determine whether they have been scientifically established to be valid and reliable. This report aims to help close these gaps for the case of forensic “feature-comparison” methods—that is, methods that attempt to determine whether an evidentiary sample (e.g., from a crime scene) is or is not associated with a potential “source” sample (e.g., from a suspect), based on the presence of similar patterns, impressions, or other features in the sample and the source. Examples of such methods include the analysis of DNA, hair, latent fingerprints, firearms and spent ammunition, toolmarks and bitemarks, shoeprints and tire tracks, and handwriting. Citations to literature in support of points made in the Executive Summary are found in the main body of the report. The National Research Council is the study-conducting arm of the National Academies of Science, Engineering, and Medicine. 1 2 1 In the course of its study, PCAST compiled and reviewed a set of more than 2,000 papers from various sources— including bibliographies prepared by the Subcommittee on Forensic Science of the National Science and Technology Council and the relevant Working Groups organized by the National Institute of Standards and Technology (NIST); submissions in response to PCAST’s request for information from the forensic-science stakeholder community; and PCAST’s own literature searches. To educate itself on factual matters relating to the interaction between science and the law, PCAST consulted with a panel of Senior Advisors comprising nine current or former Federal judges, a former U.S. Solicitor General, a former state Supreme Court justice, two law-school deans, and two distinguished statisticians who have expertise in this domain. Additional input was obtained from the Federal Bureau of Investigation (FBI) Laboratory and individual scientists at NIST, as well as from many other forensic scientists and practitioners, judges, prosecutors, defense attorneys, academic researchers, criminal-justice-reform advocates, and representatives of Federal agencies. The willingness of these groups and individuals to engage with PCAST does not imply endorsement of the views expressed in the report. The findings and recommendations conveyed in this report are the responsibility of PCAST alone. The resulting report—summarized here without the extensive technical elaborations and dense citations in the main text that follows—begins with a review of previous studies relating to forensic practice and Federal actions currently underway to strengthen forensic science; discusses the role of scientific validity within the legal system; explains the criteria by which the scientific validity of forensic feature-comparison methods can be judged; applies those criteria to six such methods in detail and reviews an evaluation by others of a seventh method; and offers recommendations on Federal actions that could be taken to strengthen forensic science and promote its more rigorous use in the courtroom. We believe the findings and recommendations will be of use both to the judiciary and to those working to strengthen forensic science. Previous Work on Scientific Validity of Forensic-Science Disciplines Ironically, it was the emergence and maturation of a new forensic science, DNA analysis, in the 1990s that first led to serious questioning of the validity of many of the traditional forensic disciplines. When DNA evidence was first introduced in the courts, beginning in the late 1980s, it was initially hailed as infallible; but the methods used in early cases turned out to be unreliable: testing labs lacked validated and consistently-applied procedures for defining DNA patterns from samples, for declaring whether two patterns matched within a given tolerance, and for determining the probability of such matches arising by chance in the population. When, as a result, DNA evidence was declared inadmissible in a 1989 case in New York, scientists engaged in DNA analysis in both forensic and non-forensic applications came together to promote the development of reliable principles and methods that have enabled DNA analysis of single-source samples to become the “gold standard” of forensic science for both investigation and prosecution. Once DNA analysis became a reliable methodology, the power of the technology—including its ability to analyze small samples and to distinguish between individuals—made it possible not only to identify and convict true perpetrators but also to clear wrongly accused suspects before prosecution and to re-examine a number of past 2 convictions. Reviews by the National Institute of Justice and others have found that DNA testing during the course of investigations has cleared tens of thousands of suspects and that DNA-based re-examination of past cases has led so far to the exonerations of 342 defendants. Independent reviews of these cases have revealed that many relied in part on faulty expert testimony from forensic scientists who had told juries incorrectly that similar features in a pair of samples taken from a suspect and from a crime scene (hair, bullets, bitemarks, tire or shoe treads, or other items) implicated defendants in a crime with a high degree of certainty. The questions that DNA analysis had raised about the scientific validity of traditional forensic disciplines and testimony based on them led, naturally, to increased efforts to test empirically the reliability of the methods that those disciplines employed. Relevant studies that followed included: • a 2002 FBI re-examination of microscopic hair comparisons the agency’s scientists had performed in criminal cases, in which DNA testing revealed that 11 percent of hair samples found to match microscopically actually came from different individuals; • a 2004 National Research Council report, commissioned by the FBI, on bullet-lead evidence, which found that there was insufficient research and data to support drawing a definitive connection between two bullets based on compositional similarity of the lead they contain; • a 2005 report of an international committee established by the FBI to review the use of latent fingerprint evidence in the case of a terrorist bombing in Spain, in which the committee found that “confirmation bias”—the inclination to confirm a suspicion based on other grounds—contributed to a misidentification and improper detention; and • studies reported in 2009 and 2010 on bitemark evidence, which found that current procedures for comparing bitemarks are unable to reliably exclude or include a suspect as a potential biter. Beyond these kinds of shortfalls with respect to “reliable methods” in forensic feature-comparison disciplines, reviews have found that expert witnesses have often overstated the probative value of their evidence, going far beyond what the relevant science can justify. Examiners have sometimes testified, for example, that their conclusions are “100 percent certain;” or have “zero,” “essentially zero,” or “negligible,” error rate. As many reviews—including the highly regarded 2009 National Research Council study—have noted, however, such statements are not scientifically defensible: all laboratory tests and feature-comparison analyses have non-zero error rates. Starting in 2012, the Department of Justice (DOJ) and FBI undertook an unprecedented review of testimony in more than 3,000 criminal cases involving microscopic hair analysis. Their initial results, released in 2015, showed that FBI examiners had provided scientifically invalid testimony in more than 95 percent of cases where that testimony was used to inculpate a defendant at trial. In March 2016, the Department of Justice announced its intention to expand to additional forensic-science methods its review of forensic testimony by the FBI Laboratory in closed criminal cases. This review will help assess the extent to which similar testimonial overstatement has occurred in other forensic disciplines. 3 The 2009 National Research Council report was the most comprehensive review to date of the forensic sciences in this country. The report made clear that some types of problems, irregularities, and miscarriages of justice cannot simply be attributed to a handful of rogue analysts or underperforming laboratories, but are systemic and pervasive—the result of factors including a high degree of fragmentation (including disparate and often inadequate training and educational requirements, resources, and capacities of laboratories), a lack of standardization of the disciplines, insufficient high-quality research and education, and a dearth of peerreviewed studies establishing the scientific basis and validity of many routinely used forensic methods. The 2009 report found that shortcomings in the forensic sciences were especially prevalent among the featurecomparison disciplines, many of which, the report said, lacked well-defined systems for determining error rates and had not done studies to establish the uniqueness or relative rarity or commonality of the particular marks or features examined. In addition, proficiency testing, where it had been conducted, showed instances of poor performance by specific examiners. In short, the report concluded that “much forensic evidence—including, for example, bitemarks and firearm and toolmark identifications—is introduced in criminal trials without any meaningful scientific validation, determination of error rates, or reliability testing to explain the limits of the discipline.” The Legal Context Historically, forensic science has been used primarily in two phases of the criminal-justice process: (1) investigation, which seeks to identify the likely perpetrator of a crime, and (2) prosecution, which seeks to prove the guilt of a defendant beyond a reasonable doubt. In recent years, forensic science—particularly DNA analysis—has also come into wide use for challenging past convictions. Importantly, the investigative and prosecutorial phases involve different standards for the use of forensic science and other investigative tools. In investigations, insights and information may come from both wellestablished science and exploratory approaches. In the prosecution phase, forensic science must satisfy a higher standard. Specifically, the Federal Rules of Evidence (Rule 702(c,d)) require that expert testimony be based, among other things, on “reliable principles and methods” that have been “reliably applied” to the facts of the case. And, the Supreme Court has stated that judges must determine “whether the reasoning or methodology underlying the testimony is scientifically valid.” This is where legal standards and scientific standards intersect. Judges’ decisions about the admissibility of scientific evidence rest solely on legal standards; they are exclusively the province of the courts and PCAST does not opine on them. But, these decisions require making determinations about scientific validity. It is the proper province of the scientific community to provide guidance concerning scientific standards for scientific validity, and it is on those scientific standards that PCAST focuses here. We distinguish here between two types of scientific validity: foundational validity and validity as applied. (1) Foundational validity for a forensic-science method requires that it be shown, based on empirical studies, to be repeatable, reproducible, and accurate, at levels that have been measured and are appropriate to the intended application. Foundational validity, then, means that a method can, in 4 principle, be reliable. It is the scientific concept we mean to correspond to the legal requirement, in Rule 702(c), of “reliable principles and methods.” (2) Validity as applied means that the method has been reliably applied in practice. It is the scientific concept we mean to correspond to the legal requirement, in Rule 702(d), that an expert “has reliably applied the principles and methods to the facts of the case.” Scientific Criteria for Validity and Reliability of Forensic Feature-Comparison Methods Chapter 4 of the main report provides a detailed description of the scientific criteria for establishing the foundationally validity and reliability of forensic feature-comparison methods, including both objective and subjective methods. 3 Subjective methods require particularly careful scrutiny because their heavy reliance on human judgment means they are especially vulnerable to human error, inconsistency across examiners, and cognitive bias. In the forensic feature-comparison disciplines, cognitive bias includes the phenomena that, in certain settings, humans may tend naturally to focus on similarities between samples and discount differences and may also be influenced by extraneous information and external pressures about a case. The essential points of foundational validity include the following: (1) Foundational validity requires that a method has been subjected to empirical testing by multiple groups, under conditions appropriate to its intended use. The studies must (a) demonstrate that the method is repeatable and reproducible and (b) provide valid estimates of the method’s accuracy (that is, how often the method reaches an incorrect conclusion) that indicate the method is appropriate to the intended application. (2) For objective methods, the foundational validity of the method can be established by studying measuring the accuracy, reproducibility, and consistency of each of its individual steps. (3) For subjective feature-comparison methods, because the individual steps are not objectively specified, the method must be evaluated as if it were a “black box” in the examiner’s head. Evaluations of validity and reliability must therefore be based on “black-box studies,” in which many examiners render 3 Feature-comparison methods may be classified as either objective or subjective. By objective feature-comparison methods, we mean methods consisting of procedures that are each defined with enough standardized and quantifiable detail that they can be performed by either an automated system or human examiners exercising little or no judgment. By subjective methods, we mean methods including key procedures that involve significant human judgment—for example, about which features to select within a pattern or how to determine whether the features are sufficiently similar to be called a probable match. 5 decisions about many independent tests (typically, involving “questioned” samples and one or more “known” samples) and the error rates are determined. (4) Without appropriate estimates of accuracy, an examiner’s statement that two samples are similar—or even indistinguishable—is scientifically meaningless: it has no probative value, and considerable potential for prejudicial impact. Once a method has been established as foundationally valid based on appropriate empirical studies, claims about the method’s accuracy and the probative value of proposed identifications, in order to be valid, must be based on such empirical studies. Statements claiming or implying greater certainty than demonstrated by empirical evidence are scientifically invalid. Forensic examiners should therefore report findings of a proposed identification with clarity and restraint, explaining in each case that the fact that two samples satisfy a method’s criteria for a proposed match does not mean that the samples are from the same source. For example, if the false positive rate of a method has been found to be 1 in 50, experts should not imply that the method is able to produce results at a higher accuracy. To meet the scientific criteria for validity as applied, two tests must be met: (1) The forensic examiner must have been shown to be capable of reliably applying the method and must actually have done so. Demonstrating that an expert is capable of reliably applying the method is crucial—especially for subjective methods, in which human judgment plays a central role. From a scientific standpoint, the ability to apply a method reliably can be demonstrated only through empirical testing that measures how often the expert reaches the correct answer. Determining whether an examiner has actually reliably applied the method requires that the procedures actually used in the case, the results obtained, and the laboratory notes be made available for scientific review by others. (2) The practitioner’s assertions about the probative value of proposed identifications must be scientifically valid. The expert should report the overall false-positive rate and sensitivity for the method established in the studies of foundational validity and should demonstrate that the samples used in the foundational studies are relevant to the facts of the case. Where applicable, the expert should report the probative value of the observed match based on the specific features observed in the case. And the expert should not make claims or implications that go beyond the empirical evidence and the applications of valid statistical principles to that evidence. We note, finally, that neither experience, nor judgment, nor good professional practices (such as certification programs and accreditation programs, standardized protocols, proficiency testing, and codes of ethics) can substitute for actual evidence of foundational validity and reliability. The frequency with which a particular pattern or set of features will be observed in different samples, which is an essential element in drawing conclusions, is not a matter of “judgment.” It is an empirical matter for which only empirical evidence is relevant. Similarly, an expert’s expression of confidence based on personal professional experience or expressions of consensus among practitioners about the accuracy of their field is no substitute for error rates estimated from relevant studies. For forensic feature-comparison methods, establishing foundational validity based on empirical evidence is thus a sine qua non. Nothing can substitute for it. 6 Evaluation of Scientific Validity for Seven Feature-Comparison Methods For this study, PCAST applied the criteria discussed above to six forensic feature-comparison methods: (1) DNA analysis of single-source and simple-mixture samples, (2) DNA analysis of complex-mixture samples, (3) bitemarks, (4) latent fingerprints, (5) firearms identification, and (6) footwear analysis. For each method, Chapter 5 of the main report provides a brief overview of the methodology, discusses background information and studies, provides an evaluation on scientific validity, and offers suggestions on a path forward. For a seventh feature-comparison method—hair analysis—we do not undertake a full evaluation of scientific validity, but review supporting material recently released for comment by the Department of Justice. This Executive Summary provides only a brief summary of some key findings concerning these seven methods. DNA Analysis of Single-Source and Simple-Mixture Samples The vast majority of DNA analysis currently involves samples from a single individual or from a simple mixture of two individuals (such as from a rape kit). DNA analysis in such cases is an objective method in which the laboratory protocols are precisely defined and the interpretation involves little or no human judgment. To evaluate the foundational validity of an objective method, one can examine the reliability of each of the individual steps rather than having to rely on black-box studies. In the case of DNA analysis of single-source and simple-mixture samples, each of the steps has been found to be “repeatable, reproducible, and accurate” with levels that have been measured and are “appropriate to the intended application” (to quote the requirement for foundational validity as stated above), and the probability of a match arising by chance in the population by chance can be estimated directly from appropriate genetic databases and is extremely low. Concerning validity as applied, DNA analysis, like all forensic analyses, is not infallible in practice. Errors can and do occur. Although the probability that two samples from different sources have the same DNA profile is tiny, the chance of human error is much higher. Such errors may stem from sample mix-ups, contamination, incorrect interpretation, and errors in reporting. To minimize human error, the FBI requires, as a condition of participating in the National DNA Index System, that laboratories follow the FBI’s Quality Assurance Standards. These require that the examiner run a series of controls to check for possible contamination and ensure that the PCR process ran properly. The Standards also requires semi-annual proficiency testing of all analysts who perform DNA testing for criminal cases. We find, though, that there is a need to improve proficiency testing. DNA Analysis of Complex-Mixture Samples Some investigations involve DNA analysis of complex mixtures of biological samples from multiple unknown individuals in unknown proportions. (Such samples arise, for example, from mixed blood stains, and increasingly from multiple individual touching a surface.) The fundamental difference between DNA analysis of complexmixture samples and DNA analysis of single-source and simple mixtures lies not in the laboratory processing, but in the interpretation of the resulting DNA profile. 7 DNA analysis of complex mixtures is inherently difficult. Such samples result in a DNA profile that superimposes multiple individual DNA profiles. Interpreting a mixed profile is different from and more challenging than interpreting a simple profile, for many reasons. It is often impossible to tell with certainty which genetic variants are present in the mixture or how many separate individuals contributed to the mixture, let alone accurately to infer the DNA profile of each one. The questions an examiner must ask, then, are, “Could a suspect’s DNA profile be present within the mixture profile? And, what is the probability that such an observation might occur by chance?” Because many different DNA profiles may fit within some mixture profiles, the probability that a suspect “cannot be excluded” as a possible contributor to complex mixture may be much higher (in some cases, millions of times higher) than the probabilities encountered for single-source DNA profiles. Initial approaches to the interpretation of complex mixtures relied on subjective judgment by examiners and simplified calculations. This approach is problematic because subjective choices made by examiners can dramatically affect the answer and the estimated probative value—introducing significant risk of both analytical error and confirmation bias. PCAST finds that subjective analysis of complex DNA mixtures has not been established to be foundationally valid and is not a reliable methodology. Given the problems with subjective interpretation of complex DNA mixtures, a number of groups launched efforts to develop computer programs that apply various algorithms to interpret complex mixtures in an objective manner. The programs clearly represent a major improvement over purely subjective interpretation. They still require scientific scrutiny, however, to determine (1) whether the methods are scientifically valid, including defining the limitations on their reliability (that is, the circumstances in which they may yield unreliable results) and (2) whether the software correctly implements the methods. PCAST finds that, at present, studies have established the foundational validity of some objective methods under limited circumstances (specifically, a three-person mixture in which the minor contributor constitutes at least 20 percent of the intact DNA in the mixture) but that substantially more evidence is needed to establish foundational validity across broader settings. Bitemark Analysis Bitemark analysis typically involves examining marks left on a victim or an object at the crime scene and comparing those marks with dental impressions taken from a suspect. Bitemark comparison is based on the premises that (1) dental characteristics, particularly the arrangement of the front teeth, differ substantially among people and (2) skin (or some other marked surface at a crime scene) can reliably capture these distinctive features. Bitemark analysis begins with an examiner deciding whether an injury is a mark caused by human teeth. If so, the examiner creates photographs or impressions of the questioned bitemark and of the suspect’s dentition; compares the bitemark and the dentition; and determines if the dentition (1) cannot be excluded as having made the bitemark, (2) can be excluded as having made the bitemark, or (3) is inconclusive. Bitemark analysis is a subjective method. Current protocols do not provide well-defined standards concerning the identification of features or the degree of similarity that must be identified to support a reliable conclusion 8 that the mark could have or could not have been created by the dentition in question. Conclusions about all these matters are left to the examiner’s judgment. As noted above, the foundational validity of a subjective method can only be established through multiple, appropriately designed black-box studies. Few studies—and no appropriate black-box studies—have been undertaken to study the ability of examiners to accurately identify the source of a bitemark. In these studies, the observed false-positive rates were very high—typically above ten percent and sometimes far above. Moreover, several of these studies employed inappropriate closed-set designs that are likely to underestimate the true false positive rate. Indeed, available scientific evidence strongly suggests that examiners not only cannot identify the source of bitemark with reasonable accuracy, they cannot even consistently agree on whether an injury is a human bitemark. For these reasons, PCAST finds that bitemark analysis is far from meeting the scientific standards for foundational validity. We note that some practitioners have expressed concern that the exclusion of bitemarks in court could hamper efforts to convict defendants in some cases. If so, the correct solution, from a scientific perspective, would not be to admit expert testimony based on invalid and unreliable methods but rather to attempt to develop scientifically valid methods. But, PCAST considers the prospects of developing bitemark analysis into a scientifically valid method to be low. We advise against devoting significant resources to such efforts. Latent Fingerprint Analysis Latent fingerprint analysis typically involves comparing (1) a “latent print” (a complete or partial friction-ridge impression from an unknown subject) that has been developed or observed on an item with (2) one or more “known prints” (fingerprints deliberately collected under a controlled setting from known subjects; also referred to as “ten prints”), to assess whether the two may have originated from the same source. It may also involve comparing latent prints with one another. An examiner might be called upon to (1) compare a latent print to the fingerprints of a known suspect who has been identified by other means (“identified suspect”) or (2) search a large database of fingerprints to identify a suspect (“database search”). Latent fingerprint analysis was first proposed for use in criminal identification in the 1800s and has been used for more than a century. The method was long hailed as infallible, despite the lack of appropriate empirical studies to assess its error rate. In response to criticism on this point in the 2009 National Research Council report, those working in the field of latent fingerprint analysis recognized the need to perform empirical studies to assess foundational validity and measure reliability and have made progress in doing so. Much credit goes to the FBI Laboratory, which has led the way in performing black-box studies to assess validity and estimate reliability, as well as so-called “white-box” studies to understand the factors that affect examiners’ decisions. PCAST applauds the FBI Laboratory’s efforts. There are also nascent efforts to begin to move the field from a purely subjective method toward an objective method—although there is still a considerable way to go to achieve this important goal. PCAST finds that latent fingerprint analysis is a foundationally valid subjective methodology—albeit with a false positive rate that is substantial and is likely to be higher than expected by many jurors based on longstanding claims about the infallibility of fingerprint analysis. The false-positive rate could be as high as 1 error in 306 9 cases based on the FBI study and 1 error in 18 cases based on a study by another crime laboratory. 4 In reporting results of latent-fingerprint examination, it is important to state the false-positive rates based on properly designed validation studies With respect to validity as applied, there are, however, a number of open issues, notably: (1) Confirmation bias. Work by FBI scientists has shown that examiners often alter the features that they initially mark in a latent print based on comparison with an apparently matching exemplar. Such circular reasoning introduces a serious risk of confirmation bias. Examiners should be required to complete and document their analysis of a latent fingerprint before looking at any known fingerprint and should separately document any additional data used during their comparison and evaluation. (2) Contextual bias. Work by academic scholars has shown that examiners’ judgments can be influenced by irrelevant information about the facts of a case. Efforts should be made to ensure that examiners are not exposed to potentially biasing information. (3) Proficiency testing. Proficiency testing is essential for assessing an examiner’s capability and performance in making accurate judgments. As discussed elsewhere in this report, proficiency testing needs to be improved by making it more rigorous, by incorporating it systematically within the flow of casework, and by disclosing tests for evaluation by the scientific community. Scientific validity as applied, then, requires that an expert: (1) has undergone relevant proficiency testing to test his or her accuracy and reports the results of the proficiency testing; (2) discloses whether he or she documented the features in the latent print in writing before comparing it to the known print; (3) provides a written analysis explaining the selection and comparison of the features; (4) discloses whether, when performing the examination, he or she was aware of any other facts of the case that might influence the conclusion; and (5) verifies that the latent print in the case at hand is similar in quality to the range of latent prints considered in the foundational studies. Concerning the path forward, continuing efforts are needed to improve the state of latent-print analysis—and these efforts will pay clear dividends for the criminal justice system. One direction is to continue to improve latent print analysis as a subjective method. There is a need for additional empirical studies to estimate error rates for latent prints of varying quality and completeness, using well-defined measures. A second—and more important—direction is to convert latent-print analysis from a subjective method to an objective method. The past decade has seen extraordinary advances in automated image analysis based on machine learning and other approaches—leading to dramatic improvements in such tasks as face recognition and the interpretation of medical images. This progress holds promise of making fully automated latent The main report discusses the appropriate calculations of error rates, including best estimates (which are 1 in 604 and 1 in 24, respectively, for the two studies cited) and confidence bounds (stated above). It also discusses issues with specific studies, including problems with studies that may contribute to differences in rates (as in the two studies cited). 4 10 fingerprint analysis possible in the near future. There have already been initial steps in this direction, both in academia and industry. The most important resource to propel the development of objective methods would be the creation of huge databases containing known prints, each with many corresponding ”simulated” latent prints of varying qualities and completeness, which would be made available to scientifically-trained researchers in academia and industry. The simulated latent prints could be created by “morphing” the known prints, based on transformations derived from collections of actual latent print-record print pairs. Firearms Analysis In firearms analysis, examiners attempt to determine whether ammunition is or is not associated with a specific firearm based on “toolmarks” produced by guns on the ammunition. The discipline is based on the idea that the toolmarks produced by different firearms vary substantially enough (owing to variations in manufacture and use) to allow components of fired cartridges to be identified with particular firearms. For example, examiners may compare “questioned” cartridge cases from a gun recovered from a crime scene to test fires from a suspect gun. Examination begins with an evaluation of class characteristics of the bullets and casings, which are features that are permanent and predetermined before manufacture. If these class characteristics are different, an elimination conclusion is rendered. If the class characteristics are similar, the examination proceeds to identify and compare individual characteristics, such as the markings that arise during firing from a particular gun. Firearms analysts have long stated that their discipline has near-perfect accuracy; however, the 2009 National Research Council study of all the forensic disciplines concluded about firearms analysis that “sufficient studies have not been done to understand the reliability and reproducibility of the methods”—that is, that the foundational validity of the field had not been established. Our own extensive review of the relevant literature prior to 2009 is consistent with the National Research Council’s conclusion. We find that many of these earlier studies were inappropriately designed to assess foundational validity and estimate reliability. Indeed, there is internal evidence among the studies themselves indicating that many previous studies underestimated the false positive rate by at least 100-fold. We identified one notable advance since 2009: the completion of the first appropriately designed black-box study of firearms. The work was commissioned and funded by the Defense Department’s Forensic Science Center and was conducted by an independent testing lab (the Ames Laboratory, a Department of Energy national laboratory affiliated with Iowa State University). The false-positive rate was estimated at 1 in 66, with a confidence bound indicating that the rate could be as high as 1 in 46. While the study is available as a report to the Federal government, it has not been published in a scientific journal. The scientific criteria for foundational validity require that there be more than one such study, to demonstrate reproducibility, and that studies should ideally be published in the peer-reviewed scientific literature. Accordingly, the current evidence still falls short of the scientific criteria for foundational validity. 11 Whether firearms analysis should be deemed admissible based on current evidence is a decision that belongs to the courts. If firearms analysis is allowed in court, the scientific criteria for validity as applied should be understood to require clearly reporting the error rates seen in the one appropriately designed black-box study. Claims of higher accuracy are not scientifically justified at present. Validity as applied would also require, from a scientific standpoint, that an expert testifying on firearms analysis (1) has undergone rigorous proficiency testing on a large number of test problems to measure his or her accuracy and discloses the results of the proficiency testing and (2) discloses whether, when performing the examination, he or she was aware of any other facts of the case that might influence the conclusion. Concerning the path forward, with firearms analysis as with latent fingerprint analysis, two directions are available for strengthening the scientific underpinnings of the discipline. The first is to improve firearms analysis as a subjective method, which would require additional black-box studies to assess scientific validity and reliability and more rigorous proficiency testing of examiners, using problems that are appropriately challenging and publically disclosed after the test. The second direction, as with latent print analysis, is to convert firearms analysis from a subjective method to an objective method. This would involve developing and testing image-analysis algorithms for comparing the similarity of tool marks on bullets. There have already been encouraging steps toward this goal. The same tremendous progress over the past decade in image analysis that gives us reason to expect early achievement of fully automated latent print analysis is cause for optimism that fully automated firearms analysis may be possible in the near future. Efforts in this direction are currently hampered, however, by lack of access to realistically large and complex databases that can be used to continue development of these methods and validate initial proposals. NIST, in coordination with the FBI Laboratory, should play a leadership role in propelling the needed transformation by creating and disseminating appropriate large datasets. These agencies should also provide grants and contracts to support work—and systematic processes to evaluate methods. In particular, we believe that “prize” competitions—based on large, publicly available collections of images—could attract significant interest from academia and industry. Footwear Analysis Footwear analysis is a process that typically involves comparing a known object, such as a shoe, to a complete or partial impression found at a crime scene, to assess whether the object is likely to be the source of the impression. The process proceeds in a stepwise manner, beginning with a comparison of “class characteristics” (such as design, physical size, and general wear) and then moving to “identifying characteristics” or “randomly acquired characteristics” (such as marks on a shoe caused by cuts, nicks, and gouges in the course of use). PCAST has not addressed the question of whether examiners can reliably determine class characteristics—for example, whether a particular shoeprint was made by a size 12 shoe of a particular make. While it is important that studies be undertaken to estimate the reliability of footwear analysis aimed at determining class characteristics, PCAST chose not to focus on this aspect of footwear examination because it is not inherently a 12 challenging measurement problem to determine class characteristics, to estimate the frequency of shoes having a particular class characteristic, or (for jurors) to understand the nature of the features in question. Instead, PCAST focused on the reliability of conclusions that an impression was likely to have come from a specific piece of footwear. This is a much harder problem because it requires knowing how accurately examiners can identify specific features shared between a shoe and an impression, how often they fail to identify features that would distinguish them, and what probative value should be ascribed to a particular “randomly acquired characteristic.” PCAST finds that there are no appropriate black-box studies to support the foundational validity of footwear analysis to associate shoeprints with particular shoes based on specific identifying marks. Such associations are unsupported by any meaningful evidence or estimates of their accuracy and thus are not scientifically valid. Hair Analysis Forensic hair analysis is a process by which examiners compare microscopic features of hair to determine whether a particular person may be the source of a questioned hair. As PCAST was completing this report, the Department of Justice released for comment proposed guidelines concerning testimony on hair examination, including a supporting document addressing the validity and reliability of the discipline. While PCAST has not performed the sort of in-depth evaluation for the hair-analysis discipline that we did for other featurecomparison disciplines discussed here, we undertook a review of the DOJ’s supporting document in order to shed further light on the standards for conducting a scientific evaluation of a forensic feature-comparison discipline. The document states that “microscopic hair comparison has been demonstrated to be a valid and reliable scientific methodology,” while noting that “microscopic hair comparisons alone cannot lead to personal identification and it is crucial that this limitation be conveyed both in the written report and in testimony.” In support of its conclusion that hair examination is valid and reliable, however, the document discusses only a handful of studies of human hair comparison, from the 1970s and 1980s. The supporting documents fail to note that subsequent studies found substantial flaws in the methodology and results of the key papers. PCAST’s own review of the cited papers finds that these studies do not establish the foundational validity and reliability of hair analysis. The DOJ’s supporting document also cites a 2002 FBI study that used mitochondrial DNA analysis to re-examine 170 samples from previous cases in which the FBI Laboratory had performed microscopic hair examination. But that study’s key conclusion does not support the conclusion that hair analysis is a “valid and reliable scientific methodology.” The FBI authors actually found that, in 9 of 80 cases (11 percent) the FBI Laboratory had found the hairs to be microscopically indistinguishable, the DNA analysis showed that the hairs actually came from different individuals. These shortcomings illustrate both the difficulty of these scientific evaluations and the reason they are best carried out by a science-based agency that is not itself involved in the application of forensic science within the 13 legal system. They also underscore why it is important that quantitative information about the reliability of methods (e.g., the frequency of false associations in hair analysis) be stated clearly in expert testimony. Closing Observations on the Seven Evaluations Although we have undertaken detailed evaluations of only six specific methods—and a review of an evaluation by others of a seventh—our approach could be applied to assess the foundational validity and validity as applied of any forensic feature-comparison method, including traditional forensic disciplines as well as methods yet to be developed (such as microbiome analysis or internet-browsing patterns). We note, finally, that the evaluation of scientific validity is necessarily based on the available scientific evidence at a point in time. Some methods that have not been shown to be foundationally valid may ultimately be found to be reliable, although significant modifications to the methods may be required to achieve this goal. Other methods may not be salvageable, as was the case with compositional bullet lead analysis and is likely the case with bitemarks. Still others may be subsumed by different but more reliable methods, much as DNA analysis has replaced other methods in some instances. Recommendations to NIST and OSTP Recommendation 1. Assessment of foundational validity It is important that scientific evaluations of the foundational validity be conducted, on an ongoing basis, to assess the foundational validity of current and newly developed forensic feature-comparison technologies. To ensure the scientific judgments are unbiased and independent, such evaluations should be conducted by an agency which has no stake in the outcome. (A) The National Institute of Standards and Technology (NIST) should perform such evaluations and should issue an annual public report evaluating the foundational validity of key forensic feature-comparison methods. (i) The evaluations should (a) assess whether each method reviewed has been adequately defined and whether its foundational validity has been adequately established and its level of accuracy estimated based on empirical evidence; (b) be based on studies published in the scientific literature by the laboratories and agencies in the U.S. and in other countries, as well as any work conducted by NIST’s own staff and grantees; (c) as a minimum, produce assessments along the lines of those in this report, updated as appropriate; and (d) be conducted under the auspices of NIST, with additional expertise as deemed necessary from experts outside forensic science. (ii) NIST should establish an advisory committee of experimental and statistical scientists from outside the forensic science community to provide advice concerning the evaluations and to ensure that they are rigorous and independent. The members of the advisory committee should be selected jointly by NIST and the Office of Science and Technology Policy. 14 (iii) NIST should prioritize forensic feature-comparison methods that are most in need of evaluation, including those currently in use and in late-stage development, based on input from the Department of Justice and the scientific community. (iv) Where NIST assesses that a method has been established as foundationally valid, it should (a) indicate appropriate estimates of error rates based on foundational studies and (b) identify any issues relevant to validity as applied. (v) Where NIST assesses that a method has not been established as foundationally valid, it should suggest what steps, if any, could be taken to establish the method’s validity. (vi) NIST should not have regulatory responsibilities with respect to forensic science. (vii) NIST should encourage one or more leading scientific journals outside the forensic community to develop mechanisms to promote the rigorous peer review and publication of papers addressing the foundational validity of forensic feature-comparison methods. (B) The President should request and Congress should provide increased appropriations to NIST of (a) $4 million to support the evaluation activities described above and (b) $10 million to support increased research activities in forensic science, including on complex DNA mixtures, latent fingerprints, voice/speaker recognition, and face/iris biometrics. Recommendation 2. Development of objective methods for DNA analysis of complex mixture samples, latent fingerprint analysis, and firearms analysis The National Institute of Standards and Technology (NIST) should take a leadership role in transforming three important feature-comparison methods that are currently subjective—latent fingerprint analysis, firearms analysis, and, under some circumstances, DNA analysis of complex mixtures—into objective methods. (A) NIST should coordinate these efforts with the Federal Bureau of Investigation Laboratory, the Defense Forensic Science Center, the National Institute of Justice, and other relevant agencies. (B) These efforts should include (i) the creation and dissemination of large datasets and test materials to support the development and testing of methods by both companies and academic researchers, (ii) grant and contract support, and (iii) sponsoring processes, such as prize competitions, to evaluate methods. Recommendation 3. Improving the Organization for Scientific Area Committees Process (A) The National Institute of Standards and Technology (NIST) should improve the Organization for Scientific Area Committees (OSAC), which was established to develop and promulgate standards and guidelines to improve best practices in the forensic science community. (i) NIST should establish a Metrology Resource Committee, composed of metrologists, statisticians, and other scientists from outside the forensic-science community. A representative of the Metrology Resource 15 Committee should serve on each of the Scientific Area Committees (SACs) to provide direct guidance on the application of measurement and statistical principles to the developing documentary standards. (ii) The Metrology Resource Committee, as a whole, should review and publically approve or disapprove all standards proposed by the Scientific Area Committees before they are transmitted to the Forensic Science Standards Board. (B) NIST should ensure that the content of OSAC-registered standards and guidelines are freely available to any party that may desire them in connection with a legal case or for evaluation and research, including by aligning with the policies related to reasonable availability of standards in the Office of Management and Budget Circular A-119, Federal Participation in the Development and Use of Voluntary Consensus Standards and Conformity Assessment Activities and the Office of the Federal Register, IBR (incorporation by reference) Handbook. Recommendation 4. R&D strategy for forensic science (A) The Office of Science and Technology Policy (OSTP) should coordinate the creation of a national forensic science research and development strategy. The strategy should address plans and funding needs for: (i) major expansion and strengthening of the academic research community working on forensic sciences, including substantially increased funding for both research and training; (ii) studies of foundational validity of forensic feature-comparison methods; (iii) improvement of current forensic methods, including converting subjective methods into objective methods, and development of new forensic methods; (iv) development of forensic feature databases, with adequate privacy protections, that can be used in research; (v) bridging the gap between research scientists and forensic practitioners; and (vi) oversight and regular review of forensic-science research. (B) In preparing the strategy, OSTP should seek input from appropriate Federal agencies, including especially the Department of Justice, Department of Defense, National Science Foundation, and National Institute of Standards and Technology; Federal and State forensic science practitioners; forensic science and non-forensic science researchers; and other stakeholders. 16 Recommendation to the FBI Laboratory Recommendation 5. Expanded forensic-science agenda at the Federal Bureau of Investigation Laboratory (A) Research programs. The Federal Bureau of Investigation (FBI) Laboratory should undertake a vigorous research program to improve forensic science, building on its recent important work on latent fingerprint analysis. The program should include: (i) conducting studies on the reliability of feature-comparison methods, in conjunction with independent third parties without a stake in the outcome; (ii) developing new approaches to improve reliability of feature-comparison methods; (iii) expanding collaborative programs with external scientists; and (iv) ensuring that external scientists have appropriate access to datasets and sample collections, so that they can carry out independent studies. (B) Black-box studies. Drawing on its expertise in forensic science research, the FBI Laboratory should assist in the design and execution of additional empirical ‘black-box’ studies for subjective methods, including for latent fingerprint analysis and firearms analysis. These studies should be conducted by or in conjunction with independent third parties with no stake in the outcome. (C) Development of objective methods. The FBI Laboratory should work with the National Institute of Standards and Technology to transform three important feature-comparison methods that are currently subjective—latent fingerprint analysis, firearm analysis, and, under some circumstances, DNA analysis of complex mixtures—into objective methods. These efforts should include (i) the creation and dissemination of large datasets to support the development and testing of methods by both companies and academic researchers, (ii) grant and contract support, and (iii) sponsoring prize competitions to evaluate methods. (D) Proficiency testing. The FBI Laboratory, should promote increased rigor in proficiency testing by (i) within the next four years, instituting routine blind proficiency testing within the flow of casework in its own laboratory, (ii) assisting other Federal, State, and local laboratories in doing so as well, and (iii) encouraging routine access to and evaluation of the tests used in commercial proficiency testing. (E) Latent fingerprint analysis. The FBI Laboratory should vigorously promote the adoption, by all laboratories that perform latent fingerprint analysis, of rules requiring a “linear Analysis, Comparison, Evaluation” process—whereby examiners must complete and document their analysis of a latent fingerprint before looking at any known fingerprint and should separately document any additional data used during comparison and evaluation. 17 (F) Transparency concerning quality issues in casework. The FBI Laboratory, as well as other Federal forensic laboratories, should regularly and publicly report quality issues in casework (in a manner similar to the practices employed by the Netherlands Forensic Institute, described in Chapter 5), as a means to improve quality and promote transparency. (G) Budget. The President should request and Congress should provide increased appropriations to the FBI to restore the FBI Laboratory’s budget for forensic science research activities from its current level to $30 million and should evaluate the need for increased funding for other forensic-science research activities in the Department of Justice. Recommendations to the Attorney General Recommendation 6. Use of feature-comparison methods in Federal prosecutions (A) The Attorney General should direct attorneys appearing on behalf of the Department of Justice (DOJ) to ensure expert testimony in court about forensic feature-comparison methods meets the scientific standards for scientific validity. While pretrial investigations may draw on a wider range of methods, expert testimony in court about forensic feature-comparison methods in criminal cases—which can be highly influential and has led to many wrongful convictions—must meet a higher standard. In particular, attorneys appearing on behalf of the DOJ should ensure that: (i) the forensic feature-comparison methods upon which testimony is based have been established to be foundationally valid with a level of accuracy suitable to their intended application, as shown by appropriate empirical studies and consistency with evaluations by the National Institute of Standards and Technology (NIST), where available; and (ii) the testimony is scientifically valid, with the expert’s statements concerning the accuracy of methods and the probative value of proposed identifications being constrained by the empirically supported evidence and not implying a higher degree of certainty. (B) DOJ should undertake an initial review, with assistance from NIST, of subjective feature-comparison methods used by DOJ to identify which methods (beyond those reviewed in this report) lack appropriate black-box studies necessary to assess foundational validity. Because such subjective methods are presumptively not established to be foundationally valid, DOJ should evaluate whether it is appropriate to present in court conclusions based on such methods. (C) Where relevant methods have not yet been established to be foundationally valid, DOJ should encourage and provide support for appropriate black-box studies to assess foundational validity and measure reliability. The design and execution of these studies should be conducted by or in conjunction with independent third parties with no stake in the outcome. 18 Recommendation 7. Department of Justice guidelines on expert testimony (A) The Attorney General should revise and reissue for public comment the Department of Justice’s (DOJ) proposed “Uniform Language for Testimony and Reports” and supporting documents to bring them into alignment with scientific standards for scientific validity. (B) The Attorney General should issue instructions directing that: (i) Where empirical studies and/or statistical models exist to shed light on the accuracy of a forensic featurecomparison method, an examiner should provide quantitative information about error rates, in accordance with guidelines to be established by DOJ and the National Institute of Standards and Technology, based on advice from the scientific community. (ii) Where there are not adequate empirical studies and/or statistical models to provide meaningful information about the accuracy of a forensic feature-comparison method, DOJ attorneys and examiners should not offer testimony based on the method. If it is necessary to provide testimony concerning the method, they should clearly acknowledge to courts the lack of such evidence. (iii) In testimony, examiners should always state clearly that errors can and do occur, due both to similarities between features and to human mistakes in the laboratory. Recommendation to the Judiciary Recommendation 8. Scientific validity as a foundation for expert testimony (A) When deciding the admissibility of expert testimony, Federal judges should take into account the appropriate scientific criteria for assessing scientific validity including: (i) foundational validity, with respect to the requirement under Rule 702(c) that testimony is the product of reliable principles and methods; and (ii) validity as applied, with respect to requirement under Rule 702(d) that an expert has reliably applied the principles and methods to the facts of the case. These scientific criteria are described in Finding 1. (B) Federal judges, when permitting an expert to testify about a foundationally valid feature-comparison method, should ensure that testimony about the accuracy of the method and the probative value of proposed identifications is scientifically valid in that it is limited to what the empirical evidence supports. Statements suggesting or implying greater certainty are not scientifically valid and should not be permitted. In particular, courts should never permit scientifically indefensible claims such as: “zero,” “vanishingly small,” “essentially zero,” “negligible,” “minimal,” or “microscopic” error rates; “100 percent certainty” or proof “to a reasonable degree of scientific certainty;” identification “to the exclusion of all other sources;” or a chance of error so remote as to be a “practical impossibility.” 19 (C) To assist judges, the Judicial Conference of the United States, through its Standing Advisory Committee on the Federal Rules of Evidence, should prepare, with advice from the scientific community, a best practices manual and an Advisory Committee note, providing guidance to Federal judges concerning the admissibility under Rule 702 of expert testimony based on forensic feature-comparison methods. (D) To assist judges, the Federal Judicial Center should develop programs concerning the scientific criteria for scientific validity of forensic feature-comparison methods. 20 1. Introduction “Forensic science” has been defined as the application of scientific or technical practices to the recognition, collection, analysis, and interpretation of evidence for criminal and civil law or regulatory issues. 5 The forensic sciences encompass a broad range of disciplines, each with its own set of technologies and practices. The National Institute of Justice (NIJ) divides those disciplines into twelve categories: general toxicology; firearms and toolmarks; questioned documents; trace evidence (such as hair and fiber analysis); controlled substances; biological/serology screening (including DNA analysis); fire debris/arson analysis; impression evidence; blood pattern evidence; crime scene investigation; medicolegal death investigation; and digital evidence. 6 In the years ahead, science and technology will likely offer additional powerful tools for the forensic domain—perhaps the ability to compare populations of bacteria in the gut or patterns of search on the Internet. Historically, forensic science has been used primarily in two phases of the criminal-justice process: (1) investigation, which seeks to identify the likely perpetrator of a crime, and (2) prosecution, which seeks to prove the guilt of a defendant beyond a reasonable doubt. (In recent years, forensic science—particularly DNA analysis—has also come into wide use for challenging past convictions.) Importantly, the investigative and prosecutorial phases involve different standards for the use of forensic science and other investigative tools. In investigations, insights and information may come from both well-established science and exploratory approaches.7 In the prosecution phase, forensic science must satisfy a higher standard. Specifically, the Federal Rules of Evidence require that expert testimony be based, among other things, on “reliable principles and methods” that have been “reliably applied” to the facts of the case. 8 And, the Supreme Court has stated that judges must determine “whether the reasoning or methodology underlying the testimony is scientifically valid.” 9 This is where legal standards and scientific standards intersect. Judges’ decisions about the admissibility of scientific evidence rest solely on legal standards; they are exclusively the province of the courts. But, the overarching subject of the judges’ inquiry is scientific validity. 10 It is the proper province of the scientific community to provide guidance concerning scientific standards for scientific validity. 11 Definition of “forensic science” as provided by the National Commission on Forensic Science in its Views Document, “Defining forensic science and related terms.” Adopted April 30-May 1, 2015. www.justice.gov/ncfs/file/786571/download. 6 See: National Institute of Justice. Status and Needs of Forensic Science Service Providers: A Report to Congress. 2006. www.ojp.usdoj.gov/nij/pubs-sum/213420.htm. 7 While investigative methods need not meet the standards of reliability required under the Federal Rules of Evidence, they should be based in sound scientific principles and practices so as to avoid false accusations. 8 Fed. R. Evid. 702. 9 Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993) at 592. 10 Daubert, at 594. 11 In this report, PCAST addresses solely the scientific standards for scientific validity and reliability. We do not offer opinions concerning legal standards. 5 21 A focus on the scientific side of this intersection is timely because it has become increasingly clear in recent years that lack of rigor in the assessment of the scientific validity of forensic evidence is not just a hypothetical problem but a real and significant weakness in the judicial system. As recounted in Chapter 2, reviews by competent bodies of the scientific underpinnings of forensic disciplines and the use in courtrooms of evidence based on those disciplines have revealed a dismaying frequency of instances of use of forensic evidence that do not pass an objective test of scientific validity. The most comprehensive such review to date was conducted by a National Research Council (NRC) committee co-chaired by Judge Harry Edwards of the U.S. Court of Appeals for the District of Columbia Circuit and Constantine Gatsonis, Director of the Center for Statistical Sciences at Brown University. Mandated by Congress in an appropriations bill signed into law in late 2005, the study launched in the fall of 2006 and the committee released its report in February 2009. 12 The 2009 NRC report described a disturbing pattern of deficiencies common to many of the forensic methods routinely used in the criminal justice system, most importantly a lack of rigorous and appropriate studies establishing their scientific validity, concluding that “much forensic evidence—including, for example, bitemarks and firearm and toolmark identifications—is introduced in criminal trials without any meaningful scientific validation, determination of error rates, or reliability testing to explain the limits of the discipline.” 13 In 2013, after prolonged discussion of the NRC report’s findings and recommendations inside and outside the Federal government, the Department of Justice (DOJ)—in collaboration with the National Institute of Standards and Technology (NIST)—established the National Commission on Forensic Science (NCFS) as a Federal advisory body charged with providing forensic-science guidance and policy recommendations to the Attorney General. Co-chaired by the Deputy Attorney General and the Director of NIST, the NCFS’s 32 members include eight academic scientists and five other science Ph.D.s; the other members include judges, attorneys, and forensic practitioners. To strengthen forensic science more generally, in 2014 NIST established the Organization for Scientific Area Committees for Forensic Science (OSAC) to “coordinate development of standards and guidelines…to improve quality and consistency of work in the forensic science community.” 14 In September 2015, President Obama asked his Council of Advisors on Science and Technology (PCAST) to explore, in light of the work being done by the NCSF and OSAC, what additional efforts could contribute to strengthening the forensic-science disciplines and ensuring the scientific reliability of forensic evidence used in the Nation’s legal system. After review of the ongoing activities and the relevant scientific and legal literatures—including particularly the scientific and legal assessments in the 2009 NRC report—PCAST concluded that there are two important gaps: (1) the need for clarity on the scientific meaning of “reliable principles and methods” and “scientific validity” in the context of certain forensic disciplines, and (2) the need to evaluate National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009). 13 Ibid., 107-8. 14 See: www.nist.gov/forensics/organization-scientific-area-committees-forensic-science. 12 22 specific forensic methods to determine whether they have been scientifically established to be valid and reliable. Within the broad span of forensic disciplines, we chose to narrow our focus to techniques that we refer to here as forensic “feature-comparison” methods (see Box 1). 15 While one motivation for this narrowing was to make our task tractable within the limits of available time and resources, we chose this particular class of methods because: (1) they are commonly used in criminal cases; (2) they have attracted a high degree of concern with respect to validity (e.g., the 2009 NRC report); and (3) they all belong to the same broad scientific discipline, metrology, which is “the science of measurement and its application,” in this case to measuring and comparing features. 16 BOX 1. Forensic feature-comparison methods PCAST uses the term “forensic feature-comparison methods” to refer to the wide variety of methods that aim to determine whether an evidentiary sample (e.g., from a crime scene) is or is not associated with a potential source sample (e.g., from a suspect) based on the presence of similar patterns, impressions, features, or characteristics in the sample and the source. Examples include the analyses of DNA, hair, latent fingerprints, firearms and spent ammunition, tool and toolmarks, shoeprints and tire tracks, bitemarks, and handwriting. PCAST began this study by forming a working group of six of its members to gather information for consideration. 17 To educate itself about factual matters relating to the interaction between science and law, PCAST consulted with a panel of Senior Advisors (listed in the front matter) comprising nine current or former Federal judges, one former U.S. Solicitor General and State supreme court justice, two law school deans, and two statisticians, who have expertise in this domain. PCAST also sought input from a diverse group of additional experts and stakeholders, including forensic scientists and practitioners, judges, prosecutors, defense attorneys, criminal justice reform advocates, statisticians, academic researchers, and Federal agency representatives (see Appendix B). Input was gathered through multiple in-person meetings and conference calls, including a session PCAST notes that there are issues related to the scientific validity of other types of forensic evidence that are beyond the scope of this report but require urgent attention—including notably arson science and abusive head trauma commonly referred to as “Shaken Baby Syndrome.” In addition, a major area not addressed in this report is scientific methods for assessing causation—for example, whether exposure to substance was likely to have caused harm to an individual. 16 International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (VIM 3rd edition) JCGM 200 (2012). 17 Two of the members have been involved with forensic science. PCAST Co-chair Eric Lander has served in various scientific roles (expert witness in People v. Castro 545 N.Y.S.2d 985 (Sup. Ct. 1989), a seminal case on the quality of DNA analysis discussed on p. 25; court’s witness in U.S. v. Yee, 134 F.R.D. 161 in 1991; member of the NRC panel on forensic DNA analysis in 1992; scientific co-author with a forensic scientist from the FBI Laboratory in 1994; and a member of the Board of Directors of the Innocence Project from 2004 to the present). All of these roles have been unremunerated. PCAST member S. James Gates, Jr. has been a member, since its inception, of the National Commission on Forensic Science. 15 23 at a meeting of PCAST on January 15, 2016. PCAST also took the unusual step of initiating an online, open solicitation to broaden input, in particular from the forensic-science practitioner community; more than 70 responses were received. 18 PCAST also shared a draft of this report with NIST and DOJ, which provided detailed and helpful comments that were carefully considered in revising the report. PCAST expresses its gratitude to all those who shared their views. Their willingness to engage with PCAST does not imply endorsement of the views expressed in the report. Responsibility for the opinions, findings and recommendations expressed in this report and for any errors of fact or interpretation rests solely with PCAST. The remainder of our report is organized as follows. 18 • Chapter 2 provides a brief overview of the findings of other studies relating to forensic practice and testimony based on it, and it reviews, as well, Federal actions currently underway to strengthen forensic science. • Chapter 3 briefly reviews the role of scientific validity within the legal system. It describes the important distinction between legal standards and scientific standards. • Chapter 4 then describes the scientific standards for “reliable principles and methods” and “scientific validity” as they apply to forensic feature-comparison methods and offers clear criteria that could be readily applied by courts. • Chapter 5 illustrates the application of the indicated criteria by using them to evaluate the scientific validity of six important “feature-comparison” methods: DNA analysis of single-source and simplemixture samples, DNA analysis of complex mixtures, bitemark analysis, latent fingerprint analysis, firearms analysis, and footwear analysis. We also discuss an evaluation by others of a seventh method, hair analysis. • In Chapters 6–9, we offer recommendations, based on the findings of Chapters 4–5, concerning Federal actions that could be taken to strengthen forensic science and promote its more rigorous use in the courtroom. See: www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensics_request_for_information.pdf. 24 2. Previous Work on Validity of Forensic-Science Methods Developments over the past two decades—including the exoneration of defendants who had been wrongfully convicted based in part on forensic-science evidence, a variety of studies of the scientific underpinnings of the forensic disciplines, reviews of expert testimony based on forensic findings, and scandals in state crime laboratories—have called increasing attention to the question of the validity and reliability of some important forensic methods evidence and testimony based upon them. (For definitions of key terms such as scientific validity and reliability, see Box 1 on page 47-8.) In this chapter, we briefly review this history to inform our assessment of the current state of forensic science methods and their validity and the path forward. 19 2.1 DNA Evidence and Wrongful Convictions Ironically, it was the emergence and maturation of a new forensic science, DNA analysis, that first led to serious questioning of the validity of many of the traditional forensic disciplines. When defendants convicted with the help of forensic evidence from those traditional disciplines began to be exonerated on the basis of persuasive DNA comparisons deeper inquiry into scientific validity began. How this came to pass provides useful context for our inquiry here. When DNA evidence was first introduced in the courts, beginning in the late 1980s, it was initially hailed as infallible. But the methods used in early cases turned out to be unreliable: testing labs lacked validated and consistently-applied procedures for defining DNA patterns from samples, for declaring whether two patterns matched within a given tolerance, and for determining the probability of such matches arising by chance in the population. 20 When DNA evidence was declared inadmissible in People v. Castro, a New York case in 1989, scientists— including at the U.S. National Academy of Sciences and the Federal Bureau of Investigation (FBI)—came together In producing this summary we relied particularly on the National Research Council 2009 report, Strengthening Forensic Science in the United States: A Path Forward and the National Academies of Sciences, Engineering, and Medicine 2015 report, Support for Forensic Science Research: Improving the Scientific Role of the National Institute of Justice. 20 See: Lander, E.S. “DNA fingerprinting on trial.” Nature, Vol. 339 (1989): 501-5; Lander, E.S., and B. Budowle. “DNA fingerprinting dispute laid to rest.” Nature, Vol. 371 (1994): 735-8; Kaye, D.H. “DNA Evidence: Probability, Population Genetics, and the Courts.” Harv. J. L. & Tech, Vol. 7 (1993): 101-72; Roberts, L. “Fight erupts over DNA fingerprinting.” Science, Vol. 254 (1991): 1721-3; Thompson, W.C., and S. Ford. “Is DNA fingerprinting ready for the courts?” New Scientist, Vol. 125 (1990): 38-43; Neufeld, P.J., and N. Colman. “When science takes the witness stand.” Scientific American, Vol. 262 (1991): 46-53. 19 25 to promote the development of reliable principles and methods that have enabled DNA analysis of single-source samples to become the “gold standard” of forensic science for both investigation and prosecution. 21 Both the initial recognition of serious problems and the subsequent development of reliable procedures were aided by the existence of a robust community of molecular biologists who used DNA analysis in non-forensic applications, such as in biomedical and agricultural sciences. They were also aided by judges who recognized that this powerful forensic method should only be admitted as courtroom evidence once its reliability was properly established. Once DNA analysis became a reliable methodology, the power of the technology—including its ability to analyze small samples and to distinguish between individuals—made it possible not only to identify and convict true perpetrators but also to clear mistakenly accused suspects before prosecution and to re-examine a number of past convictions. Reviews by the National Institute of Justice (NIJ) 22 and others have found that DNA testing during the course of investigations has cleared tens of thousands of suspects. DNA-based re-examination of past cases, moreover, has led so far to the exonerations of 342 defendants, including 20 who had been sentenced to death, and to the identification of 147 real perpetrators. 23 Independent reviews of these cases have revealed that many relied in part on faulty expert testimony from forensic scientists who had told juries that similar features in a pair of samples taken from a suspect and from a crime scene (e.g., hair, bullets, bitemarks, tire or shoe treads, or other items) implicated defendants in a crime with a high degree of certainty. 24 According to the reviews, these errors were not simply a matter of individual examiners testifying to conclusions that turned out to be incorrect; rather, they reflected a systemic problem— the testimony was based on methods and included claims of accuracy that were cloaked in purported scientific respectability but actually had never been subjected to meaningful scientific scrutiny. 25 People v. Castro 545 N.Y.S.2d 985 (Sup. Ct. 1989). The case, in which a janitor was charged with the murder of a woman in the Bronx, was among the first criminal cases involving DNA analysis in the United States. The court held a 15-week-long pretrial hearing about the admissibility of the DNA evidence. By the end of the hearing, the independent experts for both the defense and prosecution unanimously agreed that the DNA evidence presented was not scientifically reliable—and the judge ruled the evidence inadmissible. See: Lander, E.S. “DNA fingerprinting on trial.” Nature, Vol. 339 (1989): 501-5. These events eventually led to two NRC reports on forensic DNA analysis, in 1992 and 1996, and to the founding of the Innocence Project (www.innocenceproject.org). 22 DNA testing has excluded 20-25 percent of initial suspects in sexual assault cases. U.S Department of Justice, Office of Justice Programs, National Institute of Justice. Convicted by Juries, Exonerated by Science: Case Studies in the Use of DNA Evidence to Establish Innocence after Trial, (1996): xxviii. 23 Innocence Project, “DNA Exonerations in the United States.” See: www.innocenceproject.org/dna-exonerations-in-theunited-states. 24 For example, see: Gross, S.R., and M. Shaffer. “Exonerations in the United States, 1989-2012.” National Registry of Exonerations, (2012) available at: www.law.umich.edu/special/exoneration/Documents/exonerations_us_1989_2012_full_report.pdf. See also: Saks, M.J., and J.J. Koehler. “The coming paradigm shift in forensic identification science.“ Science, Vol. 309, No. 5736 (2005): 892-5. 25 Garrett, B.L., and P.J. Neufeld. “Invalid forensic science testimony and wrongful convictions.” Virginia Law Review, Vol. 91, No. 1 (2009): 1-97; National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 42-3. 21 26 2.2 Studies of Specific Forensic-Science Methods and Laboratory Practices The questions that DNA analysis had raised about the scientific validity of traditional forensic disciplines and testimony based on them led, naturally, to increased efforts to test empirically the reliability of the methods that those disciplines employed. Scrutiny was directed, similarly, to the practices by which forensic evidence is collected, stored, and analyzed in crime laboratories around the country. The FBI Laboratory, widely regarded as one of the best in the country, played an important role in the latter investigations, re-assessing its own practices as well as those of others. In what follows we summarize some of the key findings of the studies of methods and practices that ensued in the case of the “comparison” disciplines that are the focus in this report. Bullet Lead Examination From the 1960s until 2005, the FBI used compositional analysis of bullet lead as a forensic tool of analysis to identify the source of bullets. Yet, an NRC report commissioned by the FBI and released in 2004 challenged the foundational validity of identifications based on the discipline. The technique involved comparing the quantity of various elements in bullets found at a crime scene with that of unused bullets to determine whether the bullets came from the same box of ammunition. The 2004 NRC report found that there is no scientific basis for making such a determination. 26 While the method for determining the concentrations of different elements within a bullet was found to be reliable, the report found there was insufficient research and data to support drawing a connection, based on compositional similarity between a particular bullet and a given batch of ammunition, which is usually the relevant question in a criminal case. 27 In 2005, the FBI announced that it would discontinue the practice of bullet lead examinations, noting that while it “firmly supports the scientific foundation of bullet lead analysis,” the manufacturing and distribution of bullets was too variable to make the matching reliable. 28 National Research Council. Forensic Analysis: Weighing Bullet Lead Evidence. The National Academies Press. Washington DC. (2004). Lead bullet examination, also known as Compositional Analysis of Bullet Lead (CABL), involves comparing the elemental composition of bullets found at a crime scene with unused cartridges in the possession of a suspect. This technique assumes that (1) the molten source used to produce a single “lot” of bullets has a uniform composition throughout, (2) no two molten sources have the same composition, and (3) bullets with different compositions are not mixed during the manufacturing or shipping processes. However, in practice, this is not the case. The 2004 NRC report found that compositionally indistinguishable volumes of lead could produce small lots of bullets—on the order of 12,000 bullets—or large lots—with more than 35 million bullets. The report also found no assurance that indistinguishable volumes of lead could not occur at different times and places. Neither scientists nor bullet manufacturers are able to definitively attest to the significance of an association made between bullets in the course of a bullet lead examination. The most that one can say is that bullets that are indistinguishable by CABL could have come from the same source. 27 Faigman, D.L., Cheng, E.K., Mnookin, J.L., Murphy, E.E., Sander, J., and C. Slobogin (Eds.) Modern Scientific Evidence: The Law and Science of Expert Testimony, 2015-2016 ed. Thomson/West Publishing (2016). 28 Federal Bureau of Investigation. FBI Laboratory Announces Discontinuation of Bullet Lead Examinations. (September 1, 2005, press release). www.fbi.gov/news/pressrel/press-releases/fbi-laboratory-announces-discontinuation-of-bullet-leadexaminations (accessed May 6, 2016). 26 27 Latent Fingerprints In 2005, an international committee established by the FBI released a report concerning flaws in the FBI’s practices for fingerprint identification that had led to a prominent misidentification. Based almost entirely on a latent fingerprint recovered from the 2004 bombing of the Madrid commuter train system, the FBI erroneously detained an American in Portland, Oregon and held him for two weeks as a material witness. 29 An FBI examiner concluded the fingerprints matched with “100 percent certainty,” although Spanish authorities were unable to confirm the match. 30 The review committee concluded that the FBI’s misidentification had occurred primarily as a result of “confirmation bias.” 31 Similarly, a report by the DOJ’s Office of the Inspector General highlighted “reverse reasoning” from the known print to the latent image that led to an exaggerated focus on apparent similarities and inadequate attention to differences between the images. 32 Hair Analysis In 2002, FBI scientists used mitochondrial DNA sequencing to re-examine 170 microscopic hair comparisons that the agency’s scientists had performed in criminal cases. The DNA analysis showed that, in 11 percent of cases in which the FBI examiners had found the hair samples to match microscopically, DNA testing of the samples revealed they actually came from different individuals. 33 These false associations may not have been the result of a failure of the examiner to perform the analysis correctly; instead, the characteristics could have just happened to have been shared by chance. The study showed that the power of microscopic hair comparison to distinguish between samples from different sources was much lower than previously assumed. (For example, earlier studies suggested that the false positive rate for of hair analysis is in the range of 1 in 40,000. 34) Bitemarks A 2010 study of experimentally created bitemarks produced by known biters found that skin deformation distorts bitemarks so substantially and so variably that current procedures for comparing bitemarks are unable to reliably exclude or include a suspect as a potential biter. (“The data derived showed no correlation and was Stacey, R.B. “Report on the erroneous fingerprint individualization in the Madrid train bombing case.” Forensic Science Communications, Vol. 7, No. 1 (2005). 30 Application for Material Witness Order and Warrant Regarding Witness: Brandon Bieri Mayfield, In re Federal Grand Jury Proceedings 03-01, 337 F. Supp. 2d 1218 (D. Or. 2004) (No. 04-MC-9071). 31 Specifically, similarities between the two prints, combined with the inherent pressure of working on an extremely highprofile case, influenced the initial examiner’s judgment: ambiguous characteristics were interpreted as points of similarity and differences between the two prints were explained away. A second examiner, not shielded from the first examiner’s conclusions, simply confirmed the first examiner’s results. See: Stacey, R.B. “Report on the erroneous fingerprint individualization in the Madrid train bombing case.” Forensic Science Communications, Vol. 7, No. 1 (2005). 32 U.S. Department of Justice, Office of the Inspector General. “A review of the FBI’s handling of the Brandon Mayfield case.” (2006). oig.justice.special/s0601/final.pdf. 33 Houck, M.M., and B. Budowle. “Correlation of microscopic and mitochondrial DNA hair comparisons.” Journal of Forensic Sciences, Vol. 47, No. 5 (2002): 964-7. 34 Gaudette, B. D., and E.S. Keeping. “An attempt at determining probabilities in human scalp hair comparisons.“ Journal of Forensic Sciences, Vol. 19 (1975): 599-606. This study was recently cited by DOJ to support the assertion that hair analysis is a valid and reliable scientific methodology. www.justice.gov/dag/file/877741/download. The topic of hair analysis is discussed in Chapter 5. 29 28 not reproducible, that is, the same dentition could not create a measurable impression that was consistent in all of the parameters in any of the test circumstances. 35) A recent study by the American Board of Forensic Odontology also showed a disturbing lack of consistency in the way that forensic odontologists go about analyzing bitemarks, including even on deciding whether there was sufficient evidence to determine whether a photographed bitemark was a human bitemark. 36 In February 2016, following a six-month investigation, the Texas Forensic Science Commission unanimously recommended a moratorium on the use of bitemark identifications in criminal trials, concluding that the validity of the technique has not been scientifically established. 37 These examples illustrate how several forensic feature-comparison methods that have been in wide use have nonetheless not been subjected to meaningful tests of scientific validity or measures of reliability. 2.3 Testimony Concerning Forensic Evidence Reviews of trial transcripts have found that expert witnesses have often overstated the probative value of their evidence, going far beyond what the relevant science can justify. For example, some examiners have testified: • that their conclusions are “100 percent certain;” have “zero,” “essentially zero,” vanishingly small,” “negligible,” “minimal,” or “microscopic” error rate; or have a chance of error so remote as to be a “practical impossibility.” 38 As many reviews have noted, however, such statements are not scientifically defensible. All laboratory tests and feature-comparison analyses have non-zero error rates, even if an Bush, M.A., Cooper, H.I., and R.B. Dorion. “Inquiry into the scientific basis for bitemark profiling and arbitrary distortion compensation.” Journal of Forensic Sciences, Vol. 55, No. 4 (2010): 976-83. See also Bush, M.A., Miller, R.G., Bush, P.J., and R.B. Dorion. “Biomechanical factors in human dermal bitemarks in a cadaver model.” Journal of Forensic Sciences, Vol. 54, No. 1 (2009): 167-76. 36 Balko, R. “A bite mark matching advocacy group just conducted a study that discredits bite mark evidence.” Washington Post, April 8, 2015. www.washingtonpost.com/news/the-watch/wp/2015/04/08/a-bite-mark-matching-advocacy-groupjust-conducted-a-study-that-discredits-bite-mark-evidence.; Adam J. Freeman & Iain A. Pretty, Construct Validity of Bitemark Assessments Using the ABO Bitemark Decision Tree, American Academy of Forensic Sciences, Annual Meeting, Odontology Section, G14, February 2015 (data made available by the authors upon request). 37 Texas Forensic Science Commission. “Forensic bitemark comparison complaint filed by National Innocence Project on behalf of Steven Mark Chaney – Final Report.” (2016). www.fsc.texas.gov/sites/default/files/FinalBiteMarkReport.pdf. 38 Thompson, W.C., Taroni, F., and C.G.G. Aitken. “How the Probability of a False Positive Affects the Value of DNA Evidence.” J Forensic Sci, Vol. 48, No. 1 (2003): 1-8; Thompson, W.C. “The Myth of Infallibility,” In Sheldon Krimsky & Jeremy Gruber (Eds.) Genetic Explanations: Sense and Nonsense, Harvard University Press (2013); Cole, S.A. “More than zero: Accounting for error in latent fingerprint identification.” Journal of Criminal Law and Criminology, Vol. 95, No.3 (2005): 9851078; and Koehler, J.J. “Forensics or fauxrensics? Ascertaining accuracy in the forensic sciences.” papers.ssrn.com/sol3/papers.cfm?abstract_id=2773255 (accessed June 28, 2016). 35 29 examiner received a perfect score on a particular performance test involving a limited number of samples. 39 Even highly automated tests do not have a zero error rate. 40,41 • that they can “individualize” evidence—for example, using markings on a bullet to attribute it to a specific weapon “to the exclusion of every other firearm in the world”—an assertion that is not supportable by the relevant science. 42 • that a result is true “to a reasonable degree of scientific certainty.” This phrase has no generally accepted meaning in science and is open to widely differing interpretations by different scientists. 43 Moreover, the statement may be taken as implying certainty. DOJ Review of Testimony on Hair Analysis In 2012, the DOJ and FBI announced that they would initiate a formal review of testimony in more than 3,000 criminal cases involving microscopic hair analysis. Initial results of this unprecedented review, conducted in consultation with the Innocence Project and the National Association of Criminal Defense Lawyers, found that FBI examiners had provided scientifically invalid testimony in more than 95 percent of cases where examinerprovided testimony was used to inculpate a defendant at trial. These problems were systemic: 26 of the 28 FBI hair examiners who testified in the 328 cases provided scientifically invalid testimony. 44,45 Cole, S.A. “More than zero: Accounting for error in latent fingerprint identification.” Journal of Criminal Law and Criminology, Vol. 95, No.3 (2005): 985-1078 and Koehler, J.J. “Forensics or fauxrensics? Ascertaining accuracy in the forensic sciences.” papers.ssrn.com/sol3/papers.cfm?abstract_id=2773255 (accessed June 28, 2016). 40 Thompson, W.C., Franco, T., and C.G.G. Aitken. “How the probability of a false positive affects the value of DNA evidence.” Journal of Forensic Science, Vol. 48, No. 1 (2003): 1-8. 41 False positive results can arise from two sources: (1) similarity between two features that occur by chance and (2) human/technical failures. See discussion in Chapter 4, p. 50-1. 42 See: National Research Council. Ballistic Imaging. The National Academies Press. Washington DC. 2008 and Saks, M. J., and J.J. Koehler. “The individualization fallacy in forensic science evidence.” Forensic Science Evidence.” Vanderbilt Law Review, Vol. 61, No. 1 (2008): 199-218. 43 National Commission on Forensic Science, “Recommendations to the Attorney General Regarding Use of the Term ‘Reasonable Scientific Certainty’,” Approved March 22, 2016, available at: www.justice.gov/ncfs/file/839726/download. The NCSF states that “forensic discipline conclusions are often testified to as being held ‘to a reasonable degree of scientific certainty’ or ‘to a reasonable degree of [discipline] certainty.’ These terms have no scientific meaning and may mislead factfinders about the level of objectivity involved in the analysis, its scientific reliability and limitations, and the ability of the analysis to reach a conclusion.” 44 Federal Bureau of Investigation. FBI Testimony on Microscopic Hair Analysis Contained Errors in at Least 90 Percent of Cases in Ongoing Review, (April 20, 2015, press release). www.fbi.gov/news/pressrel/press-releases/fbi-testimony-onmicroscopic-hair-analysis-contained-errors-in-at-least-90-percent-of-cases-in-ongoing-review. 45 The erroneous statements fell into three categories, in which the examiner: (1) stated or implied that evidentiary hair could be associated with a specific individual to the exclusion of all others; (2) assigned to the positive association a statistical weight or a probability that the evidentiary hair originated from a particular source; or (3) cited the number of cases worked in the lab and the number of successful matches to support a conclusion that an evidentiary hair belonged to a specific individual. Reimer, N.L. “The hair microscopy review project: An historic breakthrough for law enforcement and a daunting challenge for the defense bar.” The Champion, (July 2013): 16. www.nacdl.org/champion.aspx?id=29488. 39 30 The importance of the FBI’s hair analysis review was illustrated by the decision in January 2016 by Massachusetts Superior Court Judge Robert Kane to vacate the conviction of George Perrot, based in part on the FBI’s acknowledgment of errors in hair analysis. 46 Expanded DOJ Review In March 2016, DOJ announced its intention to expand its review of forensic testimony by the FBI Laboratory in closed criminal cases to additional forensic science methods. The review will provide the opportunity to assess the extent to which similar testimonial overstatement has occurred in other disciplines. 47 DOJ plans to lay out a framework for auditing samples of testimony that came from FBI units handling additional kinds of featurebased evidence, such as tracing the impressions that guns leave on bullets, shoe treads, fibers, soil and other crime-scene evidence. 2.4 Cognitive Bias In addition to the issues previously described, scientists have studied a subtler but equally important problem that affects the reliability of conclusions in many fields, including forensic science: cognitive bias. Cognitive bias refers to ways in which human perceptions and judgments can be shaped by factors other than those relevant to the decision at hand. It includes “contextual bias,” where individuals are influenced by irrelevant background information; “confirmation bias,” where individuals interpret information, or look for new evidence, in a way that conforms to their pre-existing beliefs or assumptions; and “avoidance of cognitive dissonance,” where individuals are reluctant to accept new information that is inconsistent with their tentative conclusion. The biomedical science community, for example, goes to great lengths to minimize cognitive bias by employing strict protocols, such as double-blinding in clinical trials. Studies have demonstrated that cognitive bias may be a serious issue in forensic science. For example, a study by Itiel Dror and colleagues demonstrated that the judgment of latent fingerprint examiners can be influenced by knowledge about other forensic examiners’ decisions (a form of confirmation bias). 48 These studies are discussed in more detail in Section 5.4. Similar studies have replicated these findings in other forensic domains, including DNA mixture interpretation, microscopic hair analysis, and fire investigation. 49,50 46 Commonwealth v. Perrot, No. 85-5415, 2016 WL 380123 (Mass. Super. Man. 26, 2016). See: www.justice.gov/dag/file/870671/download. Dror, I.E., Charlton, D., and A.E. Peron. “Contextual information renders experts vulnerable to making erroneous identifications.” Forensic Science International, Vol. 156 (2006): 74-8. 49 See, for example: Dror, I.E., and G. Hampikian. “Subjectivity and bias in forensic DNA mixture interpretation.” Science & Justice, Vol. 51, No. 4 (2011): 204-8; Miller, L.S. “Procedural bias in forensic examinations of human hair.” Law and Human Behavior, Vol. 11 (1987): 157; and Bieber, P. “Fire investigation and cognitive bias.” Wiley Encyclopedia of Forensic Science, 2014, available through onlinelibrary.wiley.com/doi/10.1002/9780470061589.fsa1119/abstract. 50 See, generally, Dror, I.E. “A hierarchy of expert performance.” Journal of Applied Research in Memory and Cognition, Vol. 5 (2016): 121-127. 47 48 31 Several strategies have been proposed for mitigating cognitive bias in forensic laboratories, including managing the flow of information in a crime laboratory to minimize exposure of the forensic analyst to irrelevant contextual information (such as confessions or eyewitness identification) and ensuring that examiners work in a linear fashion, documenting their finding about evidence from crime science before performing comparisons with samples from a suspect. 51 2.5 State of Forensic Science The 2009 NRC study concluded that many of these difficulties with forensic science may stem from the historical reality that many methods were devised as rough heuristics to aid criminal investigations and were not grounded in the validation practices of scientific research. 52 Although many forensic laboratories do now require newly-hired forensic science practitioners to have an undergraduate science degree, many practitioners in forensic laboratories do not have advanced degrees in a scientific discipline. 53 In addition, until 2015, there were no Ph.D. programs specific to forensic science in the United States (although such programs exist in Europe). 54 There has been very limited funding for forensic science research, especially to study the validity or reliability of these disciplines. Serious peer-reviewed forensic science journals focused on feature-comparison fields remain quite limited. As the 2009 NRC study and others have noted, fundamentally, the forensic sciences do not yet have a welldeveloped “research culture.” 55 Importantly, a research culture includes the principles that (1) methods must be presumed to be unreliable until their foundational validity has been established based on empirical evidence and (2) even then, scientific questioning and review of methods must continue on an ongoing basis. Notably, some forensic practitioners espouse the notion that extensive “experience” in casework can substitute for empirical studies of scientific validity. 56 Casework is not scientifically valid research, and experience alone Kassin, S.M., Dror, I.E., and J. Kakucka. “The forensic confirmation bias: Problems, perspectives, and proposed solutions.” Journal of Applied Research in Memory and Cognition, Vol. 2, No. 1 (2013): 42-52. See also: Krane, D.E., Ford, S., Gilder, J., Iman, K., Jamieson, A., Taylor, M.S., and W.C. Thompson. “Sequential unmasking: A means of minimizing observer effects in forensic DNA interpretation.” Journal of Forensic Sciences, Vol. 53, No. 4 (July 2008): 1006-7. 52 National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 128. 53 National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 223-230. See also: Cooney, L. “Latent Print Training to Competency: Is it Time for a Universal Training Program?” Journal of Forensic Identification, Vol. 60 (2010): 223–58. (“The areas where there was no consensus included degree requirements (almost a 50/50 split between agencies that required a four-year degree or higher versus those agencies that required less than a four-year degree or no degree at all.”) 54 National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 223. While there are several Ph.D. programs in criminal justice, forensic psychology, forensic anthropology or programs in chemistry or related disciplines that offer a concentration in forensic science, only Sam Houston State University College of Criminal Justice offers a doctoral program in “forensic science.” See: www.shsu.edu/programs/doctorate-of-philosophy-in-forensic-science. 55 Mnookin, J.L., Cole, S.A., Dror, I.E., Fisher, B.A.J., Houck, M.M., Inman, K., Kaye, D.H., Koehler, J.J., Langenburg, G., Risinger, D.M., Rudin, N., Siegel, J., and D.A. Stoney. “The need for a research culture in the forensic sciences.” UCLA Law Review, Vol. 725 (2011): 754-8. 56 See Section 4.7. 51 32 cannot establish scientific validity. In particular, one cannot reliably estimate error rates from casework because one typically does not have independent knowledge of the “ground truth” or “right answer.” 57 Beyond the foundational issue of scientific validity, most feature-comparison fields historically gave insufficient attention to the importance of blinding practitioners to potentially biasing information; developing objective measures of assessment and interpretation; paying careful attention to error rates and their measurement; and developing objective assessments of the meaning of an association between a sample and its potential source. 58 The 2009 NRC report stimulated some in the forensic science community to recognize these flaws. Some forensic scientists have embraced the need to place forensics on a solid scientific foundation and have undertaken initial efforts to do so. 59 2.6 State of Forensic Practice Investigations of forensic practice have likewise unearthed problems stemming from the lack of a strong “quality culture.” Specifically, dozens of investigations of crime laboratories—primarily at the state and local level—have revealed repeated failures concerning the handling and processing of evidence and incorrect interpretation of forensic analysis results. 60 Various commentators have pointed out a fundamental issue that may underlie these serious problems: the fact that nearly all crime laboratories are closely tied to the prosecution in criminal cases. This structure undermines See Section 4.7. National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 8, 124, 184-5, 188-91. See also Koppl, R., and D. Krane. “Minimizing and leveraging bias in forensic science.” In Robertson C.T., and A.S. Kesselheim (Eds.) Blinding as a solution to bias: Strengthening biomedical science, forensic science, and law. Atlanta, GA: Elsevier (2016). 59 See Section 4.8. 60 A few examples of such investigations include: (1) a 2-year independent investigation of the Houston Police Department’s crime lab that resulted in the review of 3,500 cases (Final Report of the Independent Investigator for the Houston Police Department Crime Laboratory and Property Room, prepared by Michael R. Bromwich, June 13, 2007 (www.hpdlabinvestigation.org/reports/070613report.pdf); (2) the investigation and closure of the Detroit Police Crime Lab’s firearms unit following the discovery of evidence contamination and failure to properly maintain testing equipment (see Bunkley, N. “Detroit police lab is closed after audit finds serious errors in many cases.” New York Times, September 25, 2008, www.nytimes.com/2008/09/26/us/26detroit.html?_r=0); (3) a 2010 investigation of North Carolina’s State Bureau of Investigation crime laboratory that found that agents consistently withheld exculpatory evidence or distorted evidence in more than 230 cases over a 16 year period (see Swecker, C., and M. Wolf, “An Independent Review of the SBI Forensic Laboratory” images.bimedia.net/documents/SBI+Report.pdf); and (4) a 2013 review of the New York City medical examiner’s office handling of DNA evidence in more than 800 rape cases (see State of New York, Office of the Inspector General. December 2013, www.ig.ny.gov/sites/default/files/pdfs/OCMEFinalReport.pdf). One analysis estimated that at least fifty major laboratories reported fraud by analysts, evidence destruction, failed proficiency tests, misrepresenting findings in testimony, or tampering with drugs between 2005 and 2011. Twenty-eight of these labs were nationally accredited. Memorandum from Marvin Schechter to New York State Commission on Forensic Science (March 25, 2011): 243-4 (see www.americanbar.org/content/dam/aba/administrative/legal_aid_indigent_defendants/ls_sclaid_def_train_memo_schech ter.authcheckdam.pdf). 57 58 33 the greater objectivity typically found in testing laboratories in other fields and creates situations where personnel may make errors due to subtle cognitive bias or overt pressure. 61 The 2009 NRC report recommended that all public forensic laboratories and facilities be removed from the administrative control of law enforcement agencies or prosecutors’ offices. 62 For example, Houston—after disbanding its crime laboratory twice in three years—followed this recommendation and, despite significant political pushback, succeeded in transitioning the laboratory into an independent forensic science center. 63 2.7 National Research Council Report The 2009 NRC report, Strengthening Forensic Science in the United States: A Path Forward, was the most comprehensive review to date of the forensic sciences in the United States. The report made clear that the types of problems, irregularities, and miscarriages of justice outlined in this report cannot simply be attributed to a handful of rogue analysts or underperforming laboratories. Instead, the report found the problems plaguing the forensic science community are systemic and pervasive—the result of factors including a high degree of fragmentation (including disparate and often inadequate training and educational requirements, resources, and capacities of laboratories); a lack of standardization of the disciplines, insufficient high-quality research and education; and a dearth of peer-reviewed studies establishing the scientific basis and validity of many routinely used forensic methods. Shortcomings in the forensic sciences were especially prevalent among the feature-comparison disciplines. The 2009 NRC report found that many of these disciplines lacked well-defined systems for determining error rates and had not done studies to establish the uniqueness or relative rarity or commonality of the particular marks or features examined. In addition, proficiency testing, where it had been conducted, showed instances of poor performance by specific examiners. In short, the report concluded that “much forensic evidence—including, for example, bitemarks and firearm and toolmark identifications—is introduced in criminal trials without any The 2009 NRC Report (pp. 24-5) states, “The best science is conducted in a scientific setting as opposed to a law enforcement setting. Because forensic scientists often are driven in their work by a need to answer a particular question related to the issues of a particular case, they sometimes face pressure to sacrifice appropriate methodology for the sake of expediency.” See also: Giannelli, P.G. “Independent crime laboratories: The problem of motivational and cognitive bias.” Utah Law Review, (2010): 247-66 and Thompson, S.G. Cops in Lab Coats: Curbing Wrongful Convictions through Independent Forensic Laboratories. Carolina Academic Press (2015). 62 National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): Recommendation 4, p. 24. 63 The Houston Forensic Science Center opened in April 2014, replacing the former Houston Police Department Crime Laboratory. The Center operates as a “local government corporation” with its own directors, officers, and employees. The structure was intentionally designed to insulate the Center from undue influence by police, prosecutors, elected officials, or special interest groups. See: Thompson, S.G. Cops in Lab Coats: Curbing Wrongful Convictions through Independent Forensic Laboratories. Carolina Academic Press (2015): 214. 61 34 meaningful scientific validation, determination of error rates, or reliability testing to explain the limits of the discipline.” 64 The 2009 NRC report found that the problems plaguing the forensic sciences were so severe that they could only be addressed by “a national commitment to overhaul the current structure that supports the forensic science community in this country.” 65 Underlying the report’s 13 core recommendations was a call for leadership at the highest levels of both Federal and State governments and the promotion and adoption of a long-term agenda to pull the forensic science enterprise up from its current weaknesses. The 2009 NRC report called for studies to test whether various forensic methods are foundationally valid, including performing empirical tests of the accuracy of the results. It also called for the creation of a new, independent Federal agency to provide needed oversight of the forensic science system; standardization of terminology used in reporting and testifying about the results of forensic sciences; the removal of public forensic laboratories from the administrative control of law enforcement agencies; implementation of mandatory certification requirements for practitioners and mandatory accreditation programs for laboratories; research on human observer bias and sources of human error in forensic examinations; the development of tools for advancing measurement, validation, reliability, and proficiency testing in forensic science; and the strengthening and development of graduate and continuous education and training programs. 2.8 Recent Progress In response to the 2009 NRC report, the Obama Administration initiated a series of reform efforts aimed at strengthening the forensic sciences, beginning with the creation in 2009 of a Subcommittee on Forensic Science of the National Science and Technology Council’s Committee on Science that was charged with considering how best to achieve the goals of the NRC report. The resulting activities are described in some detail below. National Commission on Forensic Science In 2013, the DOJ and NIST, with support from the White House, signed a Memorandum of Understanding that outlined a framework for cooperation and collaboration between the two agencies in support of efforts to strengthen forensic science. In 2013, DOJ established a National Commission on Forensic Science (NCFS), a Federal advisory committee reporting to the Attorney General. Co-chaired by the Deputy Attorney General and the Director of NIST, the NCFS’s 32 members include seven academic scientists and five other science Ph.D.s; the other members include judges, attorneys and forensic practitioners. It is charged with providing policy recommendations to the Attorney General. 66 The NCFS issues formal recommendations to the Attorney General, as well as “views National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 107-8. 65 National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009). 66 See: www.justice.gov/ncfs. 64 35 documents” that reflect two-thirds majority view of NCFS but do not request specific action by the Attorney General. To date, the NCFS has issued ten recommendations concerning, among other things, accreditation of forensic laboratories and certification of forensic practitioners, advancing the interoperability of fingerprint information systems, development of root cause analysis protocols for forensic service providers, and enhancing communications among medical-examiner and coroner offices. 67 To date, the Attorney General has formally adopted the first set of recommendations on accreditation 68 and has directed the Department to begin to take steps toward addressing some of the other recommendations put forward to date. 69 In 2014, NIST established the Organization of Scientific Area Committees (OSAC), a collaborative body of more than 600 volunteer members largely drawn from the forensic science community. 70 OSAC was established to support the development of voluntary standards and guidelines for consideration by the forensic practitioner community. 71 The structure consists of six Scientific Area Committees (SACs) and 25 subcommittees that work to develop standards, guidelines, and codes of practice for each of the forensic science disciplines and methodologies. 72 Three overarching resource committees provide guidance on questions of law, human factors, and quality assurance. All documents developed by the SACs are approved by a Forensic Science Standards Board (FSSB), a component of the OSAC structure, for listing on the OSAC Registry of Approved Standards. OSAC is not a Federal advisory committee. Federal Funding Of Research The Federal government has also taken steps to address one factor contributing to the problems with forensic science—the lack of a robust and rigorous scientific research community in many disciplines in forensic science. While there are multiple reasons for the absence of such a research community, one reason is that, unlike most scientific disciplines, there has been too little funding to attract and sustain a substantial cadre of excellent scientists focused on fundamental research in forensic science. The National Science Foundation (NSF) has recently begun efforts to help address this foundational shortcoming of forensic science. In 2013, NSF signaled its interest in this area and encouraged researchers to submit research proposals addressing fundamental questions that might advance knowledge and education in the forensic For a full list of documents approved by NCFS, see www.justice.gov/ncfs/work-products-adopted-commission. Department of Justice. “Justice Department announces new accreditation policies to advance forensic science.” (December 7, 2015, press release). www.justice.gov/opa/pr/justice-department-announces-new-accreditation-policiesadvance-forensic-science. 69 Memorandum from the Attorney General to Heads of Department Components Regarding Recommendations of the National Commission on Forensic Science, March 17, 2016. www.justice.gov/ncfs/file/841861/download. 70 Members include forensic science practitioners and other experts who represent local, State, and Federal agencies; academia; and industry. 71 For more information see: www.nist.gov/forensics/osac.cfm. 72 The six Scientific Area Committees under OSAC are: Biology/DNA, Chemistry/Instrumental Analysis, Crime Scene/Death Investigation, Digital/Multimedia, and Physics/Pattern Interpretation (www.nist.gov/forensics/upload/OSAC-Block-OrgChart-3-17-2015.pdf). 67 68 36 sciences. 73 As a result of an interagency process led by OSTP and NSF, in collaboration with the National Institute of Justice (NIJ), invited proposals for the creation of new, multi-disciplinary research centers for funding in 2014. 74 Based on our review of grant abstracts, PCAST estimates that NSF commits a total of approximately $4.5 million per year in support for extramural research projects on foundational forensic science. NIST has also taken steps to address this issue by creating a new Forensic Science Center of Excellence, called the Center for Statistics and Applications in Forensic Evidence (CSAFE), that will focus its research efforts on improving the statistical foundation for latent prints, ballistics, tiremarks, handwriting, bloodstain patterns, toolmarks, pattern evidence analyses, and for computer and information systems, mobile devices, network traffic, social media, and GPS digital evidence analyses. 75 CSAFE is funded under a cooperative agreement with Iowa State University, to set up a center in partnership with investigators at Carnegie Mellon University, the University of Virginia, and the University of California, Irvine; the total support is $20 million over five years. PCAST estimates that NIST commits a total of approximately $5 million per year in support for extramural research projects on foundational forensic science, consisting of approximately $4 million to CSAFE and approximately $1 million to other projects. NIJ has no budget allocated specifically for forensic science research. In order to support research activities, NIJ must draw from its base funding, funding from the Office of Justice Programs’ assistance programs for research and statistics, or from the DNA backlog reduction programs. 76 Most of its research support is directed to applied research. Although it is difficult to classify NIJ’s research projects, we estimate that NIJ commits a total of approximately $4 million per year to support extramural research projects on fundamental forensic science. 77 Even with the recent increases, the total extramural funding for fundamental research in forensic science across NSF, NIST, and NIJ is thus likely to be in the range of only $13.5 million per year. See: Dear Colleague Letter: Forensic Science – Opportunity for Breakthroughs in Fundamental and Basic Research and Education. www.nsf.gov/pubs/2013/nsf13120/nsf13120.jsp. 74 The centers NSF is proposing to create are Industry/University Cooperative Research Centers (I/UCRCs). I/UCRCs are collaborative by design and could be effective in helping to bridge the scientific and cultural gap between academic researchers who work in forensics-relevant fields of science and forensic practitioners. www.nsf.gov/pubs/2014/nsf14066/nsf14066.pdf. 75 National Institute of Standards and Technology. “New NIST Center of Excellence to Improve Statistical Analysis of Forensic Evidence.” (2015). www.nist.gov/forensics/center-excellence-forensic052615.cfm. 76 National Academies of Sciences, Engineering, and Medicine. Support for Forensic Science Research: Improving the Scientific Role of the National Institute of Justice. The National Academies Press. Washington DC. (2015). According to the report, “Congressional appropriations to support NIJ’s research programs declined during the early to mid-2000s and remain insufficient, especially in light of the growing challenges facing the forensic science community…With limited base funding, NIJ funds research and development from the appropriations for DNA backlog reduction programs and other assistance programs. These carved-out funds are essentially supporting NIJ’s current forensic science portfolio, but there are pressures to limit the amount used for research from these programs. In the past 3 years, funding for these assistance programs has declined; therefore, funds available for research have also been reduced.” 77 U.S. Department of Justice, National Institute of Justice. “Report Forensic Science: Fiscal Year 2015 Funding for DNA Analysis, Capacity Enhancement and Other Forensic Activities.” 2016. 73 37 The 2009 NRC report found that Forensic science research is [overall] not well supported. . . . Relative to other areas of science, the forensic science disciplines have extremely limited opportunities for research funding. Although the FBI and NIJ have supported some research in the forensic science disciplines, the level of support has been well short of what is necessary for the forensic science community to establish strong links with a broad base of research universities and the national research community. Moreover, funding for academic research is limited . . . , which can inhibit the pursuit of more fundamental scientific questions essential to establishing the foundation of forensic science. Finally, the broader research community generally is not engaged in conducting research relevant to advancing the forensic science disciplines. 78 A 2015 NRC report, Support for Forensic Science Research: Improving the Scientific Role of the National Institute of Justice, found that the status of forensic science research funding has not improved much since the 2009 NRC report. 79 In addition, the Defense Forensic Science Center has recently begun to support extramural research spanning the forensic science disciplines as part of its mission to provide specialized forensic and biometric research capabilities and support to the Department of Defense. Redesignated as DFSC in 2013, the Center was formerly the U.S. Army Criminal Investigation Laboratory, originally charged with supporting criminal investigations within the military but additionally tasked in 2007 with providing an “enduring expeditionary forensics capability,” in response in part to the need to investigate and prosecute explosives attacks in Iraq and Afghanistan. While the bulk of DFSC support has traditionally supported research in DNA analysis and biochemistry, the Center has recently directed resources toward projects to address critical foundational gaps in other disciplines, including firearms and latent print analysis. Notably, DFSC has helped stimulate research in the forensic science community. Discussions between DFSC and the American Society of Crime Lab Directors (ASCLD) led ASCLD to host a meeting in 2011 to identify research priorities for the forensic science community. DFSC agreed to fund two foundational studies to address the highest priority research needs identified by the Forensic Research Committee of ASCLD: the first independent “black-box” study on firearms analysis and a DNA mixture interpretation study (see Chapter 5). In FY 2015, DFSC allocated approximately $9.2 million to external forensic science research. Seventy-five percent of DFSC’s funding supported projects with regard to DNA/biochemistry; 9 percent digital evidence; 8 percent non-DNA pattern evidence; and 8 percent chemistry. 80 As is the case for NIJ, there is no line item in DFSC’s budget dedicated to forensic science research; DFSC instead must solicit funding from multiple sources within the Department of Defense to support this research. National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 78. 79 National Academies of Sciences, Engineering, and Medicine. Support for Forensic Science Research: Improving the Scientific Role of the National Institute of Justice. The National Academies Press. Washington DC. (2015): 15. 80 Defense Forensic Science Center, Office of the Chief Scientist, Annual Research Portfolio Report, January 5, 2016. 78 38 A Critical Gap: Scientific Validity The Administration has taken important and much needed initial steps by creating mechanisms to discuss policy, develop best practices for practitioners of specific methods, and support scientific research. At the same time, work to date has not addressed the 2009 NRC report’s call to examine the fundamental scientific validity and reliability of many forensic methods used every day in courts. The remainder of our report focuses on that issue. 39 3. The Role of Scientific Validity in the Courts The central focus of this report is the scientific validity of forensic-science evidence—more specifically, evidence from scientific methods for comparison of features (in, for example, DNA, latent fingerprints, bullet marks and other items). The reliability of methods for interpreting evidence is a fundamental consideration throughout science. Accordingly, every scientific field has a well-developed, domain-specific understanding of what scientific validity of methods entails. The concept of scientific validity also plays an important role in the legal system. In particular, as noted in Chapter 1, the Federal Rules of Evidence require that expert testimony about forensic science must be the product of “reliable principles and methods” that have been “reliably applied . . . to the facts of the case.” This report explicates the scientific criteria for scientific validity in the case of forensic feature-comparison methods, for use both within the legal system and by those working to strengthen the scientific underpinnings of those disciplines. Before delving into that scientific explication, we provide in this chapter a very brief summary, aimed principally at scientists and lay readers, of the relevant legal background and terms, as well as the nature of this intersection between law and science. 3.1 Evolution of Admissibility Standards Over the course of the 20th century, the legal system’s approach for determining the admissibility of scientific evidence has evolved in response to advances in science. In 1923, in Frye v. United States, 81 the Court of Appeals for the District of Columbia considered the admissibility of testimony concerning results of a purported “lie detector,” a systolic-blood- pressure deception test that was a precursor to the polygraph machine. After describing the device and its operation, the Court rejected the testimony, stating: [W]hile courts will go a long way in admitting expert testimony deduced from a well-recognized scientific principle or discovery, the thing from which the deduction is made must be sufficiently established to have gained general acceptance in the particular field in which it belongs. 82 The court found that the systolic test had “not yet gained such standing and scientific recognition among physiological and psychological authorities,” and was therefore inadmissible. More than a half-century later, the Federal Rules of Evidence were enacted into law in 1975 to guide criminal and civil litigation in Federal courts. Rule 702, in its original form, stated that: 81 82 Frye v. United States, 293 F. 1013 (D.C. Cir. 1923). Ibid., 1014. 40 If scientific, technical, or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training, or education, may testify thereto in the form of an opinion or otherwise. 83 There was considerable debate among litigants, judges, and legal scholars as to whether the rule embraced the Frye standard or established a new standard. 84 In 1993, the United States Supreme Court sought to resolve these questions in its landmark ruling in Daubert v. Merrell Dow Pharmaceuticals. In interpreting Rule 702, the Daubert Court held that the Federal Rules of Evidence superseded Frye as the standard for admissibility of expert evidence in Federal courts. The Court rejected “general acceptance” as the standard for admissibility and instead held that the admissibility of scientific expert testimony depended on its scientific reliability. Where Frye told judges to defer to the judgment of the relevant expert community, Daubert assigned trial court judges the role of “gatekeepers” charged with ensuring that expert testimony “rests on reliable foundation.” 85 The Court stated that “the trial judge must determine . . . whether the reasoning or methodology underlying the testimony is scientifically valid.” 86 It identified five factors that a judge should, among others, ordinarily consider in evaluating the validity of an underlying methodology. These factors are: (1) whether the theory or technique can be (and has been) tested; (2) whether the theory or technique has been subjected to peer review and publication; (3) the known or potential rate of error of a particular scientific technique; (4) the existence and maintenance of standards controlling the technique’s operation; and (5) a scientific technique’s degree of acceptance within a relevant scientific community. The Daubert court also noted that judges evaluating proffers of expert scientific testimony should be mindful of other applicable rules, including: • • Rule 403, which permits the exclusion of relevant evidence “if its probative value is substantially outweighed by the danger of unfair prejudice, confusion of the issues, or misleading the jury…” (noting that expert evidence can be “both powerful and quite misleading because of the difficulty in evaluating it.”); and Rule 706, which allows the court at its discretion to procure the assistance of an expert of its own choosing. 87 Act of January 2, 1975, Pub. Law No. 93-595, 88 Stat. 1926 (1975). See: federalevidence.com/pdf/FRE_Amendments/1975_Orig_Enact/1975-Pub.L._93-595_FRE.pdf. 84 See: Giannelli, P.C. “The admissibility of novel scientific evidence: Frye v. United States, a half-century later.” Columbus Law Review, Vol. 80, No. 6 (1980); McCabe, J. “DNA fingerprinting: The failings of Frye,” Norther Illinois University Law Review, Vol. 16 (1996): 455-82; and Page, M., Taylor, J., and M. Blenkin. “Forensic identification science evidence since Daubert: Part II—judicial reasoning in decisions to exclude forensic identification evidence on grounds of reliability.” Journal of Forensic Sciences, Vol. 56, No. 4 (2011): 913-7. 85 Daubert, at 597. 86 Daubert, at 580. See also, FN9 (“In a case involving scientific evidence, evidentiary reliability will be based on scientific validity.” [emphasis in original]). 87 Daubert, at 595, citing Weinstein, 138 F.R.D., at 632. 83 41 Congress amended Rule 702 in 2000 to make it more precise, and made further stylistic changes in 2011. In its current form, Rule 702 imposes four requirements: A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if: (a) the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue; (b) the testimony is based on sufficient facts or data; (c) the testimony is the product of reliable principles and methods; and (d) the expert has reliably applied the principles and methods to the facts of the case. An Advisory Committee’s Note to Rule 702 also specified a number of reliability factors that supplement the five factors enumerated in Daubert. Among those factors is “whether the field of expertise claimed by the expert is known to reach reliable results.” 88,89 Many states have adopted rules of evidence that track key aspects of these federal rules. Such rules are now the law in over half of the states, while other states continue to follow the Frye standard or variations of it. 90 3.2 Foundational Validity and Validity as Applied As described in Daubert, the legal system envisions an important conversation between law and science: “The [judge’s] inquiry envisioned by Rule 702 is, we emphasize, a flexible one. Its overarching subject is the scientific validity—and thus the evidentiary relevance and reliability—of the principles that underlie a proposed submission.” 91 See: Fed. R. Evid. 702 Advisory Committee note (2000). The following factors may be relevant under Rule 702: whether the underlying research was conducted independently of litigation; whether the expert unjustifiably extrapolated from an accepted premise to an unfounded conclusion; whether the expert has adequately accounted for obvious alternative explanations; whether the expert was as careful as she would be in her professional work outside of paid litigation; and whether the field of expertise claimed by the expert is known to reach reliable results [emphasis added]. 89 This note has been pointed to as support for efforts to challenge entire fields of forensic science, including fingerprints and hair comparisons. See: Giannelli, P.C. “The Supreme Court’s ‘Criminal’ Daubert Cases.” Seton Hall Law Review, Vol. 33 (2003): 1096. 90 Even under the Frye formulation, the views of scientists about the meaning of reliability are relevant. Frye requires that a scientific technique or method must “have general acceptance” in the relevant scientific community to be admissible. As a scientific matter, the relevant scientific community for assessing the reliability of feature-comparison sciences includes metrologists (including statisticians) as well as other physical and life scientists from disciplines on which the specific methods are based. Importantly, the community is not limited to forensic scientists who practice the specific method. For example, the Frye court evaluated whether the proffered lie detector had gained “standing and scientific recognition among physiological and psychological authorities,” rather than among lie detector experts. Frye v. United States, 293 F. 1013 (D.C. Cir. 1923). 91 Daubert, at 594 88 42 Legal and scientific considerations thus both play important roles. (1) The admissibility of expert testimony depends on a threshold test of, among other things, whether it meets certain legal standards embodied in Rule 702. These decisions about admissibility are exclusively the province of the courts. (2) Yet, as noted above, the overarching subject of the judge’s inquiry under Rule 702 is “scientific validity.” It is the proper province of the scientific community to provide guidance concerning scientific standards for scientific validity. PCAST does not opine here on the legal standards, but seeks only to clarify the scientific standards that underlie them. For complete clarity about our intent, we have adopted specific terms to refer to the scientific standards for two key types of scientific validity, which we mean to correspond, as scientific standards, to the legal standards in Rule 702 (c,d)): (1) by “foundational validity,” we mean the scientific standard corresponding to the legal standard of evidence being based on “reliable principles and methods,” and (2) by “validity as applied,” we mean the scientific standard corresponding to the legal standard of an expert having “reliably applied the principles and methods.” In the next chapter, we turn to discussing the scientific standards for these concepts. We close this chapter by noting that answering the question of scientific validity in the forensic disciplines is important not just for the courts but also because it sets quality standards that ripple out throughout these disciplines—affecting practice and defining necessary research. 43 4. Scientific Criteria for Validity and Reliability of Forensic Feature-Comparison Methods In this report, PCAST has chosen to focus on defining the validity and reliability of one specific area within forensic science: forensic feature-comparison methods. We have done so because it is both possible and important to do so for this particular class of methods. • It is possible because feature comparison is a common scientific activity, and science has clear standards for determining whether such methods are reliable. In particular, feature-comparison methods belong squarely to the discipline of metrology—the science of measurement and its application. 92,93 • It is important because it has become apparent, over the past decade, that faulty forensic feature comparison has led to numerous miscarriages of justice. 94 It has also been revealed that the problems International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (VIM 3rd edition) JCGM 200 (2012). 93 That forensic feature-comparison methods belong to the field of metrology is clear from the fact that NIST—whose mission is to assist the Nation by “advancing measurement science, standards and technology,” and which is the world’s leading metrological laboratory—is the home within the Federal government for research efforts on forensic science. NIST’s programs include internal research, extramural research funding, conferences, and preparation of reference materials and standards. See: www.nist.gov/public_affairs/mission.cfm and www.nist.gov/forensics/index.cfm. Forensic feature-comparison methods involve determining whether two sets of features agree within a given measurement tolerance. 94 DNA-based re-examination of past cases has led so far to the exonerations of 342 defendants, including 20 who had been sentenced to death, and to the identification of 147 real perpetrators. See: Innocence Project, “DNA Exonerations in the United States.” www.innocenceproject.org/dna-exonerations-in-the-united-states. Reviews of these cases have revealed that roughly half relied in part on expert testimony that was based on methods that had not been subjected to meaningful scientific scrutiny or that included scientifically invalid claims of accuracy. See: Gross, S.R., and M. Shaffer. “Exonerations in the United States, 1989-2012.” National Registry of Exonerations, (2012) available at: www.law.umich.edu/special/exoneration/Documents/exonerations_us_1989_2012_full_report.pdf; Garrett, B.L., and P.J. Neufeld. “Invalid forensic science testimony and wrongful convictions.” Virginia Law Review, Vol. 91, No. 1 (2009): 1-97; National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 42-3. The nature of the issues is illustrated by specific examples described in the materials cited: Levon Brooks and Kennedy Brewer, each convicted of separate child murders in the 1990s almost entirely on the basis of bitemark analysis testimony, spent more than 13 years in prison before DNA testing identified the actual perpetrator, who confessed to both crimes; Santae Tribble, convicted of murder after an FBI analyst testified that hair from a stocking mask linked Tribble to the crime and “matched in all microscopic characteristics,” spent more than 20 years in prison before DNA testing revealed that none of the 13 hairs belonged to Tribble and that one came from a dog; Jimmy Ray Bromgard of Montana served 15 years in prison for rape before DNA testing showed that hairs collected from the victim’s bed and reported as a match to Bromgard’s could not have come from him; Stephan Cowans, convicted of shooting a Boston police officer after two fingerprint experts testified that a thumbprint left by the perpetrator was “unique and 92 44 are not due simply to poor performance by a few practitioners, but rather to the fact that the reliability of many forensic feature-comparison methods has never been meaningfully evaluated. 95 Compared to many types of expert testimony, testimony based on forensic feature-comparison methods poses unique dangers of misleading jurors for two reasons: • The vast majority of jurors have no independent ability to interpret the probative value of results based on the detection, comparison, and frequency of scientific evidence. If matching halves of a ransom note were found at a crime scene and at a defendant’s home, jurors could rely on their own experiences to assess how unlikely it is that two torn scraps would match if they were not in fact from a single original note. If a witness were to describe a perpetrator as “tall and bushy haired,” jurors could make a reasonable judgment of how many people might match the description. But, if an expert witness were to say that, in two DNA samples, the third exon of the DYNC1H1 gene is precisely 174 nucleotides in length, most jurors would have no way to know if they should be impressed by the coincidence; they would be completely dependent on expert statements garbed in the mantle of science. (As it happens, they should not be impressed by the preceding statement: At the DNA locus cited, more than 99.9 percent of people have a fragment of the indicated size. 96) • The potential prejudicial impact is unusually high, because jurors are likely to overestimate the probative value of a “match” between samples. Indeed, the DOJ itself historically overestimated the probative value of matches in its longstanding contention, now acknowledged to be inappropriate, that latent fingerprint analysis was “infallible.” 97 Similarly, a former head of the FBI’s fingerprint unit testified that the FBI had “an error rate of one per every 11 million cases.” 98 In an online experiment, researchers asked mock jurors to estimate the frequency that a qualified, experienced forensic scientist would mistakenly conclude that two samples of specified types came from the same person when they actually came from two different people. The mock jurors believed such errors are likely to occur about 1 in 5.5 million for fingerprint analysis comparison; 1 in 1 million for bitemark comparison; 1 in 1 million for hair comparison; and 1 in 100 thousand for handwriting comparison. 99 While precise error rates are not known for most of these techniques, all indications point to the actual error rates being orders of magnitude higher. For example, the FBI’s own studies of latent fingerprint analysis point to error rates in the range of one in several hundred. 100 (Because the term “match” is likely to imply an identical,” spent more than 5 years in prison before DNA testing on multiple items of evidence excluded him as the perpetrator; and Steven Barnes of upstate New York served 20 years in prison for a rape and murder he did not commit after a criminalist testified that a photographic overlay of fabric from the victim’s jeans and an imprint on Barnes’ truck showed patterns that were “similar” and hairs collected from the truck were similar to the victim’s hairs. 95 See: Chapter 5. 96 See: ExAC database: exac.broadinstitute.org/gene/ENSG00000197102. 97 See: www.justice.gov/olp/file/861906/download. 98 U.S. v. Baines 573 F.3d 979 (2009) at 984. 99 Koehler, J.J. “Intuitive error rate estimates for the forensic sciences.” (August 2, 2016). Available at papers.ssrn.com/sol3/papers.cfm?abstract_id=2817443 . 100 See: Section 5.4. 45 inappropriately high probative value, a more neutral term should be used for an examiner’s belief that two samples come from the same source. We suggest the term “proposed identification” to appropriately convey the examiner’s conclusion, along with the possibility that it might be wrong. We will use this term throughout this report.) This chapter lays out PCAST’s conclusions concerning the scientific criteria for scientific validity. The conclusions are based on the fundamental principles of the “scientific method”—applicable throughout science—that valid scientific knowledge can only be gained through empirical testing of specific propositions. 101 PCAST’s conclusions in the chapter might be briefly summarized as follows: Scientific validity and reliability require that a method has been subjected to empirical testing, under conditions appropriate to its intended use, that provides valid estimates of how often the method reaches an incorrect conclusion. For subjective feature-comparison methods, appropriately designed black-box studies are required, in which many examiners render decisions about many independent tests (typically, involving “questioned” samples and one or more “known” samples) and the error rates are determined. Without appropriate estimates of accuracy, an examiner’s statement that two samples are similar—or even indistinguishable—is scientifically meaningless: it has no probative value, and considerable potential for prejudicial impact. Nothing—not training, personal experience nor professional practices—can substitute for adequate empirical demonstration of accuracy. The chapter is organized as follows: • The first section describes the distinction between two fundamentally different types of featurecomparison methods: objective methods and subjective methods. • The next five sections discuss the scientific criteria for the two types of scientific validity: foundational validity and validity as applied. • The final two sections discuss views held in the forensic community. 4.1 Feature-Comparison Methods: Objective and Subjective Methods A forensic feature-comparison method is a procedure by which an examiner seeks to determine whether an evidentiary sample (e.g., from a crime scene) is or is not associated with a source sample (e.g., from a suspect) 102 based on similar features. The evidentiary sample might be DNA, hair, fingerprints, bitemarks, toolmarks, bullets, tire tracks, voiceprints, visual images, and so on. The source sample would be biological material or an item (tool, gun, shoe, or tire) associated with the suspect. For example, the Oxford Online Dictionary defines the scientific method as “a method or procedure that has characterized the natural sciences since the 17th century, consisting in systematic observation, measurement, and experimentation, and the formulation, testing, and modification of hypotheses.” “Scientific method” Oxford Dictionaries Online. Oxford University Press (accessed on August 19, 2016). 102 A “source sample” refers to a specific individual or object (e.g., a tire or gun). 101 46 Feature-comparison methods may be classified as either objective or subjective. By objective featurecomparison methods, we mean methods consisting of procedures that are each defined with enough standardized and quantifiable detail that they can be performed by either an automated system or human examiners exercising little or no judgment. By subjective methods, we mean methods including key procedures that involve significant human judgment—for example, about which features to select or how to determine whether the features are sufficiently similar to be called a proposed identification. Objective methods are, in general, preferable to subjective methods. Analyses that depend on human judgment (rather than a quantitative measure of similarity) are obviously more susceptible to human error, bias, and performance variability across examiners. 103 In contrast, objective, quantified methods tend to yield greater accuracy, repeatability and reliability, including reducing variation in results among examiners. Subjective methods can evolve into or be replaced by objective methods. 104 4.2 Foundational Validity: Requirement for Empirical Studies For a metrological method to be scientifically valid and reliable, the procedures that comprise it must be shown, based on empirical studies, to be repeatable, reproducible, and accurate, at levels that have been measured and are appropriate to the intended application. 105,106 BOX 2. Definition of key terms By “repeatable,” we mean that, with known probability, an examiner obtains the same result, when analyzing samples from the same sources. By “reproducible,” we mean that, with known probability, different examiners obtain the same result, when analyzing the same samples. By “accurate,” we mean that, with known probabilities, an examiner obtains correct results both (1) for samples from the same source (true positives) and (2) for samples from different sources (true negatives). By “reliability,” we mean repeatability, reproducibility, and accuracy. 107 Dror, I.E. “A hierarchy of expert performance.” Journal of Applied Research in Memory and Cognition, Vol. 5 (2016): 121127. 104 For example, before the development of objective tests for intoxication, courts had to rely exclusively on the testimony of police officers and others who in turn relied on behavioral indications of drunkenness and the presence of alcohol on the breath. The development of objective chemical tests drove a change from subjective to objective standards. 105 National Physical Laboratory. “A Beginner’s Guide to Measurement.” (2010) available at: www.npl.co.uk/upload/pdf/NPL-Beginners-Guide-to-Measurement.pdf; Pavese, F. “An Introduction to Data Modelling Principles in Metrology and Testing.” in Data Modeling for Metrology and Testing in Measurement Science, Pavese, F. and A.B. Forbes (Eds.) Birkhäuser (2009). 106 Feature-comparison methods that get the wrong answer too often have, by definition, low probative value. As discussed above, the prejudicial impact will thus likely to outweigh the probative value. 107 We note that “reliability” also has a narrow meaning within the field of statistics referring to “consistency”—that is, the extent to which a method produces the same result, regardless of whether the result is accurate. This is not the sense in which “reliability” is used in this report, or in the law. 103 47 By “scientific validity,” we mean that a method has shown, based on empirical studies, to be reliable with levels of repeatability, reproducibility, and accuracy that are appropriate to the intended application. By an “empirical study,” we mean test in which a method has been used to analyze a large number of independent sets of samples, similar in relevant aspects to those encountered in casework, in order to estimate the method’s repeatability, reproducibility, and accuracy. By a “black-box study,” we mean an empirical study that assesses a subjective method by having examiners analyze samples and render opinions about the origin or similarity of samples. The method need not be perfect, but it is clearly essential that its accuracy has been measured based on appropriate empirical testing and is high enough to be appropriate to the application. Without an appropriate estimate of its accuracy, a metrological method is useless—because one has no idea how to interpret its results. The importance of knowing a method’s accuracy was emphasized by the 2009 NRC report on forensic science and by a 2010 NRC report on biometric technologies. 108 To meet the scientific criteria of foundational validity, two key elements are required: (1) a reproducible and consistent procedure for (a) identifying features within evidence samples; (b) comparing the features in two samples; and (c) determining, based on the similarity between the features in two samples, whether the samples should be declared to be a proposed identification (“matching rule”). (2) empirical measurements, from multiple independent studies, of (a) the method’s false positive rate— that is, the probability it declares a proposed identification between samples that actually come from different sources and (b) the method’s sensitivity—that is, probability that it declares a proposed identification between samples that actually come from the same source. We discuss these elements in turn. Reproducible and Consistent Procedures For a method to be objective, each of the three steps (feature identification, feature comparison, and matching rule) should be precisely defined, reproducible and consistent. Forensic examiners should identify relevant features in the same way and obtain the same result. They should compare features in the same quantitative manner. To declare a proposed identification, they should calculate whether the features in an evidentiary sample and the features in a sample from a suspected source lie within a pre-specified measurement tolerance “Biometric recognition is an inherently probabilistic endeavor…Consequently, even when the technology and the system it is embedded in are behaving as designed, there is inevitable uncertainty and risk of error.” National Research Council, “Biometric Recognition: Challenges and Opportunities.” The National Academies Press. Washington DC. (2010): viii-ix. 108 48 (matching rule). 109 For an objective method, one can establish the foundational validity of each of the individual steps by measuring its accuracy, reproducibility, and consistency. For subjective methods, procedures must still be carefully defined—but they involve substantial human judgment. For example, different examiners may recognize or focus on different features, may attach different importance to the same features, and may have different criteria for declaring proposed identifications. Because the procedures for feature identification, the matching rule, and frequency determinations about features are not objectively specified, the overall procedure must be treated as a kind of “black box” inside the examiner’s head. Subjective methods require careful scrutiny, more generally, their heavy reliance on human judgment means that they are especially vulnerable to human error, inconsistency across examiners, and cognitive bias. In the forensic feature-comparison disciplines, cognitive bias includes the phenomena that, in certain settings, humans (1) may tend naturally to focus on similarities between samples and discount differences and (2) may also be influenced by extraneous information and external pressures about a case. 110 (The latter issues are illustrated by the FBI’s misidentification of a latent fingerprint in the Madrid training bombing, discussed on p.9.) Since the black box in the examiner’s head cannot be examined directly for its foundational basis in science, the foundational validity of subjective methods can be established only through empirical studies of examiner’s performance to determine whether they can provide accurate answers; such studies are referred to as “blackbox” studies (Box 2). In black-box studies, many examiners are presented with many independent comparison problems—typically, involving “questioned” samples and one or more “known” samples—and asked to declare whether the questioned samples came from the same source as one of the known samples. 111 The researchers then determine how often examiners reach erroneous conclusions. 109 If a source is declared not to share the same features, it is “excluded” by the test. The matching rule should be chosen carefully. If the “matching rule” is chosen to be too strict, samples that actually come from the same source will be declared a non-match (false negative). If it is too lax, then the method will not have much discriminatory power because the random match probability will be too high (false positive). 110 See, for example: Boroditsky, L. “Comparison and the development of knowledge.” Cognition, Vol. 102 (2007): 118128; Hassin, R. “Making features similar: comparison processes affect perception.” Psychonomic Bulletin & Review, Vol. 8 (2001): 728–31; Medin, D.L., Goldstone, R.L., and D. Gentner. “Respects for similarity.” Psychological Review, Vol. 100 (1993): 254–78; Tversky, A. “Features of similarity.” Psychological Review, Vol. 84 (1977): 327–52; Kim, J., Novemsky, N., and R. Dhar. “Adding small differences can increase similarity and choice.” Psychological Science, Vol. 24 (2012): 225–9; Larkey, L.B., and A.B. Markman. “Processes of similarity judgment.” Cognitive Science, Vol. 29 (2005): 1061–76; Medin, D.L., Goldstone, R.L., and A.B. Markman. “Comparison and choice: Relations between similarity processes and decision processes.” Psychonomic Bulletin and Review, Vol. 2 (1995): 1–19; Goldstone, R. L. “The role of similarity in categorization: Providing a groundwork.” Cognition, Vol. 52 (1994): 125–57; Nosofsky, R. M. “Attention, similarity, and the identificationcategorization relation.” Journal of Experimental Psychology, General, Vol. 115 (1986): 39–57. 111 Answers may be expressed in such terms as “match/no match/inconclusive” or “identification/exclusion/inconclusive.” 49 As an excellent example, the FBI recently conducted a black-box study of latent fingerprint analysis, involving 169 examiners and 744 fingerprint pairs, and published the results of the study in a leading scientific journal. 112 (Some forensic scientists have cautioned that too much attention to the subjective aspects of forensic methods—such as studies of cognitive bias and black-box studies—might distract from the goal of improving knowledge about the objective features of the forensic evidence and developing truly objective methods. 113 Others have noted that this is not currently a problem, because current efforts and funding to address the challenges associated with subjective forensic methods are very limited. 114) Empirical Measurements of Accuracy It is necessary to have appropriate empirical measurements of a method’s false positive rate and the method’s sensitivity. As explained in Appendix A, it is necessary to know these two measures to assess the probative value of a method. The false positive rate is the probability that the method declares a proposed identification between samples that actually come from different sources. For example, a false positive rate of 5 percent means that two samples from different sources will (due to limitations of the method) be incorrectly declared to come from the same source 5 percent of the time. (The quantity equal to one minus the false positive rate—95 percent, in the example—is referred to as the specificity.) The method’s sensitivity is the probability that the method declares a proposed identification between samples that actually come from the same source. For example, a sensitivity of 90 percent means two samples from the same source will be declared to come from the same source 90 percent of the time, and declared to come from different sources 10 percent of the time. (The latter quantity is referred to as the false negative rate.) The false positive rate is especially important because false positive results can lead directly to wrongful convictions. 115 In some circumstances, it may be possible to estimate a false positive rate related to specific features of the evidence in the case. (For example, the random match probability calculated in DNA analysis depends in part on the specific genotype seen in an evidentiary sample. The false positive rate for latent fingerprint analysis may depend on the quality of the latent print.) For other feature-comparison methods, it may be only possible to make an overall estimate of the average false positive rate across samples. For objective methods, the false positive rate is composed of two distinguishable sources—coincidental matches (where samples from different sources nonetheless have features that fall within the tolerance of the objective matching rule) and human/technical failures (where samples have features that fall outside the matching rule, but where a proposed identification was nonetheless declared due to a human or technical failure). For Ulery, B.T., Hicklin, R.A., Buscaglia, J., and M.A. Roberts. “Accuracy and reliability of forensic latent fingerprint decisions.” Proceedings of the National Academy of Sciences, Vol. 108, No. 19 (2011): 7733-8. 113 Champod, C. “Research focused mainly on bias will paralyse forensic science.” Science & Justice, Vol. 54 (2014): 107–9. 114 Risinger, D.M., Thompson, W.C., Jamieson, A., Koppl, R., Kornfield, I., Krane, D., Mnookin, J.L., Rosenthal, R., Saks, M.J., and S.L. Zabell. “Regarding Champod, editorial: “Research focused mainly on bias will paralyse forensic science.” Science and Justice, Vol. 54 (2014):508-9. 115 See footnote 94, p. 44. Under some circumstances, false-negative results can contribute to wrongful convictions as well. 112 50 objective methods where the probability of coincidental match is very low (such as DNA analysis), the false positive rate in application in a given case will be dominated by the rate of human/technical failures—which may well be hundreds of times larger. For subjective methods, both types of error—coincidental matches and human/technical failures—occur as well, but, without an objective “matching rule,” the two sources cannot be distinguished. In establishing foundational validity, it is thus essential to perform black-box studies that empirically measure the overall error rate across many examiners. (See Box 3 concerning the word “error.”) BOX 3. The meanings of “error” The term “error” has differing meanings in science and law, which can lead to confusion. In legal settings, the term “error” often implies fault—e.g., that a person has made a mistake that could have been avoided if he or she had properly followed correct procedures or a machine has given an erroneous result that could have been avoided it if had been properly calibrated. In science, the term “error” also includes the situation in which the procedure itself, when properly applied, does not yield the correct answer owing to chance occurrence. When one applies a forensic feature-comparison method with the goal of assessing whether two samples did or did not come from the same source, coincidental matches and human/technical failures are both regarded, from a statistical point of view, as “errors” because both can lead to incorrect conclusions. Studies designed to estimate a method’s false positive rate and sensitivity are necessarily conducted using only a finite number of samples. As a consequence, they cannot provide “exact” values for these quantities (and should not claim to do so), but only “confidence intervals,” whose bounds reflect, respectively, the range of values that are reasonably compatible with the results. When reporting a false positive rate to a jury, it is scientifically important to state the “upper 95 percent one-sided confidence bound” to reflect the fact that the actual false positive rate could reasonably be as high as this value. 116 (For more information, see Appendix A.) Studies often categorize their results as being conclusive (e.g., identification or exclusion) or inconclusive (no determination made). 117 When reporting a false positive rate to a jury, it is scientifically important to calculate the rate based on the proportion of conclusive examinations, rather than just the proportion of all examinations. This is appropriate because evidence used against a defendant will typically be based on conclusive, rather than inconclusive, examinations. To illustrate the point, consider an extreme case in which a method had been The upper confidence bound properly incorporates the precision of the estimate based on the sample size. For example, if a study found no errors in 100 tests, it would be misleading to tell a jury that the error rate was 0 percent. In fact, if the tests are independent, the upper 95 percent confidence bound for the true error rate is 3.0 percent. Accordingly a jury should be told that the error rate could be as high as 3.0 percent (that is, 1 in 33). The true error rate could be higher, but with rather small probability (less than 5 percent). If the study were much smaller, the upper 95 percent confidence limit would be higher. For a study that found no errors in 10 tests, the upper 95 percent confidence bound is 26 percent—that is, the actual false positive rate could be roughly 1 in 4 (see Appendix A). 117 See: Chapter 5. 116 51 tested 1000 times and found to yield 990 inconclusive results, 10 false positives, and no correct results. It would be misleading to report that the false positive rate was 1 percent (10/1000 examinations). Rather, one should report that 100 percent of the conclusive results were false positives (10/10 examinations). Whereas exploratory scientific studies may take many forms, scientific validation studies—intended to assess the validity and reliability of a metrological method for a particular forensic feature-comparison application— must satisfy a number of criteria, which are described in Box 4. BOX 4. Key criteria for validation studies to establish foundational validity Scientific validation studies—intended to assess the validity and reliability of a metrological method for a particular forensic feature-comparison application—must satisfy a number of criteria. (1) The studies must involve a sufficiently large number of examiners and must be based on sufficiently large collections of known and representative samples from relevant populations to reflect the range of features or combinations of features that will occur in the application. In particular, the sample collections should be: (a) representative of the quality of evidentiary samples seen in real cases. (For example, if a method is to be used on distorted, partial, latent fingerprints, one must determine the random match probability—that is, the probability that the match occurred by chance—for distorted, partial, latent fingerprints; the random match probability for full scanned fingerprints, or even very high quality latent prints would not be relevant.) (b) chosen from populations relevant to real cases. For example, for features in biological samples, the false positive rate should be determined for the overall US population and for major ethnic groups, as is done with DNA analysis. (c) large enough to provide appropriate estimates of the error rates. (2) The empirical studies should be conducted so that neither the examiner nor those with whom the examiner interacts have any information about the correct answer. (3) The study design and analysis framework should be specified in advance. In validation studies, it is inappropriate to modify the protocol afterwards based on the results. 118 The analogous situation in medicine is a clinical trial to test the safety and efficacy of a drug for a particular application. In the design of clinical trials, FDA requires that criteria for analysis must be pre-specified and notes that post hoc changes to the analysis compromise the validity of the study. See: FDA Guidance: “Adaptive Designs for Medical Device Clinical Studies” (2016) Available at: www.fda.gov/downloads/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm446729.pdf; Alosh, M., Fritsch, K., Huque, M., Mahjoob, K., Pennello, G., Rothmann, M., Russek-Cohen, E., Smith, F., Wilson, S., and L. Yue. “Statistical considerations on subgroup analysis in clinical trials.” Statistics in Biopharmaceutical Research, Vol. 7 (2015): 286-303; FDA Guidance: “Design Considerations for Pivotal Clinical Investigations for Medical Devices” (2013) (available at: 118 52 (4) The empirical studies should be conducted or overseen by individuals or organizations that have no stake in the outcome of the studies. 119 (5) Data, software and results from validation studies should be available to allow other scientists to review the conclusions. (6) To ensure that conclusions are reproducible and robust, there should be multiple studies by separate groups reaching similar conclusions. An empirical measurement of error rates is not simply a desirable feature; it is essential for determining whether a method is foundationally valid. In science, a testing procedure—such as testing whether a person is pregnant or whether water is contaminated—is not considered valid until its reliability has been empirically measured. For example, we need to know how often the pregnancy test declares a pregnancy when there is none, and vice versa. The same scientific principles apply no less to forensic tests, which may contribute to a defendant losing his life or liberty. Importantly, error rates cannot be inferred from casework, but rather must be determined based on samples where the correct answer is known. For example, the former head of the FBI’s fingerprint unit testified that the FBI had “an error rate of one per every 11 million cases” based on the fact that the agency was known to have made only one mistake over the past 11 years, during which time it had made 11 million identifications. 120 The fallacy is obvious: the expert simply assumed without evidence that every error in casework had come to light. Why is it essential to know a method’s false positive rate and sensitivity? Because without appropriate empirical measurement of a method’s accuracy, the fact that two samples in a particular case show similar features has no probative value—and, as noted above, it may have considerable prejudicial impact because juries will likely incorrectly attach meaning to the observation. 121 www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/ucm373750.htm); FDA Guidance for Industry: E9 Statistical Principles for Clinical Trials (September 1998) (available at: www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm073137.pdf); Pocock, S.J. Clinical trials: a practical approach. Wiley, Chichester (1983). 119 In the setting of clinical trials, the sponsor of the trial (a pharmaceutical, device or biotech company or, in some cases, an academic institutions) funds and initiates the study, but the trial is conducted by individuals who are independent of the sponsor (often, academic physicians), in order to ensure the reliability of the data generated by the study and minimize the potential for bias. See, for example, 21 C.F.R. § 312.3 and 21 C.F.R. § 54.4(a). 120 U.S. v. Baines 573 F.3d 979 (2009) at 984. 121 Under Fed. R. Evid., Rule 403, evidence should be excluded “if its probative value is substantially outweighed by the danger of unfair prejudice.” 53 The absolute need, from a scientific perspective, for empirical data is elegantly expressed in an analogy by U.S. District Judge John Potter in his opinion in U.S. v. Yee (1991), an early case on the use of DNA analysis: Without the probability assessment, the jury does not know what to make of the fact that the patterns match: the jury does not know whether the patterns are as common as pictures with two eyes, or as unique as the Mona Lisa. 122,123 4.3 Foundational Validity: Requirement for Scientifically Valid Testimony It should be obvious—but it bears emphasizing—that once a method has been established as foundationally valid based on appropriate empirical studies, claims about the method’s accuracy and the probative value of proposed identifications, in order to be valid, must be based on such empirical studies. Statements claiming or implying greater certainty than demonstrated by empirical evidence are scientifically invalid. Forensic examiners should therefore report findings of a proposed identification with clarity and restraint, explaining in each case that the fact that two samples satisfy a method’s criteria for a proposed match does not necessarily imply that the samples come from a common source. If the false positive rate of a method has been found to be 1 in 50, experts should not imply that the method is able to produce results at a higher accuracy. Troublingly, expert witnesses sometimes go beyond the empirical evidence about the frequency of features— even to the extent of claiming or implying that a sample came from a specific source with near-certainty or even absolute certainty, despite having no scientific basis for such opinions. 124 From the standpoint of scientific validity, experts should never be permitted to state or imply in court that they can draw conclusions with certainty or near-certainty (such as “zero,” “vanishingly small,” “essentially zero,” “negligible,” “minimal,” or “microscopic” error rates; “100 percent certainty” or “to a reasonable degree of scientific certainty;” or identification “to the exclusion of all other sources.” 125 The scientific inappropriateness of such testimony is aptly captured by an analogy by District of Columbia Court of Appeals Judge Catharine Easterly in her concurring opinion in Williams v. United States, a case in which an examiner testified that markings on certain bullets were unique to a gun recovered from a defendant’s apartment: U.S. v. Yee, 134 F.R.D. 161 (N.D. Ohio 1991). Some courts have ruled that there is no harm in admitting feature-comparison evidence on the grounds that jurors can see the features with their own eyes and decide for themselves about whether features are shared. U.S. v. Yee shows why this reasoning is fallacious: jurors have no way to know how often two different samples would share features, and to what level of specificity. 124 As noted above, the long history of exaggerated claims for the accuracy of forensic methods includes the DOJ’s own prior statement that latent fingerprint analysis was “infallible,” which the DOJ has judged to have been inappropriate. www.justice.gov/olp/file/861906/download. 125 Cole, S.A. “Grandfathering evidence: Fingerprint admissibility rulings from Jennings to Llera Plaza and back again.” 41 American Criminal Law Review, 1189 (2004). See also: National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (NRC Report, 2009): 87, 104, and 143. 122 123 54 As matters currently stand, a certainty statement regarding toolmark pattern matching has the same probative value as the vision of a psychic: it reflects nothing more than the individual’s foundationless faith in what he believes to be true. This is not evidence on which we can in good conscience rely, particularly in criminal cases, where we demand proof—real proof—beyond a reasonable doubt, precisely because the stakes are so high. 126 In science, assertions that a metrological method is more accurate than has been empirically demonstrated are rightly regarded as mere speculation, not valid conclusions that merit credence. 4.4 Neither Experience nor Professional Practices Can Substitute for Foundational Validity In some settings, an expert may be scientifically capable of rendering judgments based primarily on his or her “experience” and “judgment.” Based on experience, a surgeon might be scientifically qualified to offer a judgment about whether another doctor acted appropriately in the operating theater or a psychiatrist might be scientifically qualified to offer a judgment about whether a defendant is mentally competent to assist in his or her defense. By contrast, “experience” or “judgment” cannot be used to establish the scientific validity and reliability of a metrological method, such as a forensic feature-comparison method. The frequency with which a particular pattern or set of features will be observed in different samples, which is an essential element in drawing conclusions, is not a matter of “judgment.” It is an empirical matter for which only empirical evidence is relevant. Moreover, a forensic examiner’s “experience” from extensive casework is not informative—because the “right answers” are not typically known in casework and thus examiners cannot accurately know how often they erroneously declare matches and cannot readily hone their accuracy by learning from their mistakes in the course of casework. Importantly, good professional practices—such as the existence of professional societies, certification programs, accreditation programs, peer-reviewed articles, standardized protocols, proficiency testing, and codes of ethics—cannot substitute for actual evidence of scientific validity and reliability. 127 Similarly, an expert’s expression of confidence based on personal professional experience or expressions of consensus among practitioners about the accuracy of their field is no substitute for error rates estimated from relevant studies. For a method to be reliable, empirical evidence of validity, as described above, is required. Finally, the points above underscore that scientific validity of a method must be assessed within the framework of the broader scientific field of which it is a part (e.g., measurement science in the case of feature-comparison methods). The fact that bitemark examiners defend the validity of bitemark examination means little. 126 127 Williams v. United States, DC Court of Appeals, decided January 21, 2016, (Easterly, concurring). For example, both scientific and pseudoscientific disciplines employ such practices. 55 4.5 Validity as Applied: Key Elements Foundational validity means that a method can, in principle, be reliable. Validity as applied means that the method has been reliably applied in practice. It is the scientific concept we mean to correspond to the legal requirement, in Rule 702(d), that an expert “has reliably applied the principles and methods to the facts of the case.” From a scientific standpoint, certain criteria are essential to establish that a forensic practitioner has reliably applied a method to the facts of a case. These elements are described in Box 5. BOX 5. Key criteria for validity as applied (1) The forensic examiner must have been shown to be capable of reliably applying the method and must actually have done so. Demonstrating that an examiner is capable of reliably applying the method is crucial—especially for subjective methods, in which human judgment plays a central role. From a scientific standpoint, the ability to apply a method reliably can be demonstrated only through empirical testing that measures how often the expert reaches the correct answer. (Proficiency testing is discussed more extensively on p. 57-59.) Determining whether an examiner has actually reliably applied the method requires that the procedures actually used in the case, the results obtained, and the laboratory notes be made available for scientific review by others. (2) Assertions about the probability of the observed features occurring by chance must be scientifically valid. (a) The forensic examiner should report the overall false positive rate and sensitivity for the method established in the studies of foundational validity and should demonstrate that the samples used in the foundational studies are relevant to the facts of the case. 128 (b) Where applicable, the examiner should report the random match probability based on the specific features observed in the case. (c) An expert should not make claims or implications that go beyond the empirical evidence and the applications of valid statistical principles to that evidence. For example, for DNA analysis, the frequency of genetic variants is known to vary among ethnic groups; it is thus important that the sample collection reflect relevant ethnic groups to the case at hand. For latent fingerprints, the risk of falsely declaring an identification may be higher when latent fingerprints are of lower quality; so, to be relevant, the sample collections used to estimate accuracy should be based on latent fingerprints comparable in quality and completeness to the case at hand. 128 56 4.6 Validity as Applied: Proficiency Testing Even when a method is foundationally valid, there are many reasons why examiners may not always get the right result. 129 As discussed above, the only way to establish scientifically that an examiner is capable of applying a foundationally valid method is through appropriate empirical testing to measure how often the examiner gets the correct answer. Such empirical testing is often referred to as “proficiency testing.” We note that term “proficiency testing” is sometimes used to refer to many different other types of testing—such as (1) tests to determine whether a practitioner reliably follows the steps laid out in a protocol, without assessing the accuracy of their conclusions, and (2) practice exercises that help practitioners improve their skills by highlighting their errors, without accurately reflect the circumstances of actual casework. In this report, we use the term proficiency testing to mean ongoing empirical tests to “evaluate the capability and performance of analysts.” 130, 131, 132 Proficiency testing should be performed under conditions that are representative of casework and on samples, for which the true answer is known, that are representative of the full range of sample types and quality likely to be encountered in casework in the intended application. (For example, the fact that an examiner passes a proficiency test involving DNA analysis of simple, single-source samples does not demonstrate that they are capable of DNA analysis of complex mixtures of the sort encountered in casework; see p. 76-81.) To ensure integrity, proficiency testing should be overseen by a disinterested third party that has no institutional or financial incentive to skew performance. We note that testing services have stated that forensic community prefers that tests not be too challenging. 133 J.J. Koehler has enumerated a number of possible problems that could, in principle, occur: features may be mismeasured; samples may be interchanged, mislabeled, miscoded, altered, or contaminated; equipment may be miscalibrated; technical glitches and failures may occur without warning and without being noticed; and results may be misread, misinterpreted, misrecorded, mislabeled, mixed up, misplaced, or discarded. Koehler, J.J. “Forensics or fauxrensics? Ascertaining accuracy in the forensic sciences.” papers.ssrn.com/sol3/papers.cfm?abstract_id=2773255 (accessed June 28, 2016). 130 ASCLD/LAB Supplemental Requirements for Accreditation of Forensic Testing Laboratories. des.wa.gov/SiteCollectionDocuments/About/1063/RFP/Add7_Item4ASCLD.pdf. 131 We note that proficiency testing is not intended to estimate the inherent error rates of a method; these rates should be assessed from foundational validity studies. 132 Proficiency testing should also be distinguished from “competency testing,” which is “the evaluation of a person’s knowledge and ability prior to performing independent work in forensic casework.” des.wa.gov/SiteCollectionDocuments/About/1063/RFP/Add7_Item4ASCLD.pdf. 133 Christopher Czyryca, the president of Collaborative Testing Services, Inc., the leading proficiency testing firm in the U.S., has publicly stated that “Easy tests are favored by the community.” August 2015 meeting of the National Commission on Forensic Science, a presentation at the Accreditation and Proficiency Testing Subcommittee. www.justice.gov/ncfs/file/761061/download. 129 57 As noted previously, false positive rates consist of both coincidental match rates and technical/human failure rates. For some technologies (such as DNA analysis), the latter may be hundreds of times higher than the former. Proficiency testing is especially critical for subjective methods: because the procedure is not based solely on objective criteria but relies on human judgment, it is inherently vulnerable to error and inter-examiner variability. Each examiner should be tested, because empirical studies have noted considerable differences in accuracy across examiners. 134,135 The test problems used in proficiency tests should be publicly released after the test is completed, to enable scientists to assess the appropriateness and adequacy of the test for their intended purpose. Finally, proficiency testing should ideally be conducted in a ‘test-blind’ manner—that is, with samples inserted into the flow of casework such that examiners do not know that they are being tested. (For example, the Transportation Security Administration conducts blind tests by sending weapons and explosives inside luggage through screening checkpoints to see how often TSA screeners detect them.) It has been established in many fields (including latent fingerprint analysis) that, when individuals are aware that they are being tested, they perform differently than they do in the course of their daily work (referred to as the “Hawthorne Effect”). 136,137 While test-blind proficiency testing is ideal, there is disagreement in the forensic community about its feasibility in all settings. On the one hand, laboratories vary considerably as to the type of cases they receive, how evidence is managed and processed, and what information is provided to an analyst about the evidence or the case in question. Accordingly, blinded, inter-laboratory proficiency tests may be difficult to design and For example, a 2011 study on latent fingerprint decisions observed that examiners frequently differed on whether fingerprints were suitable for reaching a conclusion. Ulery, B.T., Hicklin, R.A., Buscaglia, J., and M.A. Roberts. “Accuracy and reliability of forensic latent fingerprint decisions.” Proceedings of the National Academy of Sciences, Vol. 108, No. 19 (2011): 7733-8. 135 It is not sufficient to point to proficiency testing on volunteers in a laboratory, because better performing examiners are more likely to participate. Koehler, J.J. “Forensics or fauxrensics? Ascertaining accuracy in the forensic sciences.” papers.ssrn.com/sol3/papers.cfm?abstract_id=2773255 (accessed June 28, 2016). 136 Concerning the Hawthorne effect, see, for example: Bracht, G.H., and G.V. Glass. “The external validity of experiments.” American Educational Research Journal, Vol. 5, No. 4 (1968): 437-74; Weech, T.L. and H. Goldhor. "Obtrusive versus unobtrusive evaluation of reference service in five Illinois public libraries: A pilot study." Library Quarterly: Information, Community, Policy, Vol. 52, No. 4 (1982): 305-24; Bouchet, C., Guillemin, F., and S. Braincon. “Nonspecific effects in longitudinal studies: impact on quality of life measures.” Journal of Clinical Epidemiology, Vol. 49, No. 1 (1996): 15-20; Mangione-Smith, R., Elliott, M.N., McDonald, L., and E.A. McGlynn. “An observational study of antibiotic prescribing behavior and the Hawthorne Effect.” Health Services Research, Vol. 37, No. 6 (2002): 1603-23; Mujis, D. “Measuring teacher effectiveness: Some methodological reflections.” Educational Research and Evaluation, Vol. 12, No. 1 (2006): 53–74; and McCarney, R., Warner, J., Iliffe, S., van Haselen, R., Griffin, M., and P. Fisher. “The Hawthorne Effect: a randomized, controlled trial.” BMC Medical Research Methodology, Vol. 7, No. 30 (2007). 137 For demonstrations that forensic examiners change their behavior when they know their performance is being monitored in particular ways, see Langenburg, G. “A performance study of the ACE-V process: A pilot study to measure the accuracy, precision, reproducibility, repeatability, and biasability of conclusions resulting from the ACE-V process.” Journal of Forensic Identification, Vol. 59, No. 2 (2009). 134 58 orchestrate on a large scale. 138 On the other hand, test-blind proficiency tests have been used for DNA analysis, 139 and select labs have begun to implement this type of testing, in-house, as part of their quality assurance programs. 140 We note that test-blind proficiency testing is much easier to adopt in laboratories that have adopted “context management procedures” to reduce contextual bias. 141 PCAST believes that test-blind proficiency testing of forensic examiners should be vigorously pursued, with the expectation that it should be in wide use, at least in large laboratories, within the next five years. However, PCAST believes that it is not yet realistic to require test-blind proficiency testing because the procedures for testblind proficiency tests have not yet been designed and evaluated. While only non-test-blind proficiency tests are used to support validity as applied, it is scientifically important to report this limitation, including to juries—because, as noted above, non-blind proficiency tests are likely to overestimate the accuracy because the examiners knew they were being tested. 4.7 Non-Empirical Views in the Forensic Community While the scientific validity of metrological methods requires empirical demonstration of accuracy, there have historically been efforts in the forensic community to justify non-empirical approaches. This is of particular concern because such views are sometimes mistakenly codified in policies or practices. These heterodox views typically involve four recurrent themes, which we review below. “Theories” of Identification A common argument is that forensic practices should be regarded as valid because they rest on scientific “theories” akin to the fundamental laws of physics, that should be accepted because they have been tested and not “falsified.” 142 An example is the “Theory of Identification as it Relates to Toolmarks,” issued in 2011 by the Association of Firearm and Tool Mark Examiners. 143,144 It states in its entirety: Some of the challenges associated with designing blind inter-laboratory proficiency tests may be addressed if the forensic laboratories were to move toward a system where an examiner’s knowledge of a case were limited to domainrelevant information. 139 See: Peterson, J.L., Lin, G., Ho, M., Chen, Y., and R.E. Gaensslen. “The feasibility of external blind DNA proficiency testing. II. Experience with actual blind tests.” Journal of Forensic Science, Vol. 48, No. 1 (2003): 32-40. 140 For example, the Houston Forensic Science Center has implemented routine, blind proficiency testing for its firearms examiners and chemistry analysis unit, and is planning to carry out similar testing for its DNA and latent print examiners. 141 For background, see www.justice.gov/ncfs/file/888586/download. 142 See: www.swggun.org/index.php?option=com_content&view=article&id=66:the-foundations-of-firearm-and-toolmarkidentification&catid=13:other&Itemid=43 and www.justice.gov/ncfs/file/888586/download. 143 Association of Firearm and Tool Mark Examiners. “Theory of Identification as it Relates to Tool Marks: Revised.” AFTE Journal, Vol. 43, No. 4 (2011): 287. 144 Firearms analysis is considered in detail in Chapter 5. 138 59 1. The theory of identification as it pertains to the comparison of toolmarks enables opinions of common origin to be made when the unique surface of two toolmarks are in “sufficient agreement.” 2. This “sufficient agreement” is related to the significant duplication of random toolmarks as evidenced by the correspondence of a pattern or combination of patterns of surface contours. Significance is determined by the comparative examination of two or more sets of surface contour patterns comprised of individual peaks, ridges and furrows. Specifically, the relative height or depth, width, curvature and spatial relationship of the individual peaks, ridges and furrows within one set of surface contours are defined and compare to the corresponding features in the second set of surface contours. Agreement is significant when the agreement in individual characteristics exceeds the best agreement demonstrated between toolmarks known to have been produced by different tools and is consistent with agreement demonstrated by toolmarks known to have been produced by the same tool. The statement that “sufficient agreement” exists between two toolmarks means that the agreement of individual characteristics is of a quantity and quality that the likelihood another tool could have made the mark is so remote as to be considered a practical impossibility. 3. Currently the interpretation of individualization/identification is subjective in nature, founded on scientific principles and based on the examiner’s training and experience. The statement is clearly not a scientific theory, which the National Academy of Sciences has defined as “a comprehensive explanation of some aspect of nature that is supported by a vast body of evidence.” 145 Rather, it is a claim that examiners applying a subjective approach can accurately individualize the origin of a toolmark. Moreover, a “theory” is not what is needed. What is needed are empirical tests to see how well the method performs. More importantly, the stated method is circular. It declares that an examiner may state that two toolmarks have a “common origin” when their features are in “sufficient agreement.” It then defines “sufficient agreement” as occurring when the examiner considers it a “practical impossibility” that the toolmarks have different origins. (In response to PCAST’s concern about this circularity, the FBI Laboratory replied that: “‘Practical impossibility’ is the certitude that exists when there is sufficient agreement in the quality and quantity of individual characteristics.” 146 This answer did not resolve the circularity.) Focus on ‘Training and Experience’ Rather Than Empirical Demonstration of Accuracy Many practitioners hold an honest belief that they are able to make accurate judgments about identification based on their training and experience. This notion is explicit in the AFTE’s Theory of Identification, which notes that interpretation is subjective in nature, “based on an examiner’s training and experience.” Similarly, the leading textbook on footwear analysis states, Positive identifications may be made with as few as one random identifying characteristic, but only if that characteristic is confirmable; has sufficient definition, clarity, and features; is in the same location and 145 146 See: www.nas.edu/evolution/TheoryOrFact.html. Communication from FBI Laboratory to PCAST (June 6, 2016). 60 orientation on the shoe outsole; and in the opinion of an experienced examiner, would not occur again on another shoe. 147 [emphasis added] In effect, it says, positive identification depends on the examiner being positive about the identification. “Experience” is an inadequate foundation for drawing judgments about whether two sets of features could have been produced by (or found on) different sources. Even if examiners could recall in sufficient detail all the patterns or sets of features that they have seen, they would have no way of knowing accurately in which cases two patterns actually came from different sources, because the correct answers are rarely known in casework. The fallacy of relying on “experience” was evident in testimony by a former head of the FBI’s fingerprint unit (discussed above) that the FBI had “an error rate of one per every 11 million cases,” based on the fact that the agency was only aware of one mistake. 148 By contrast, recent empirical studies by the FBI Laboratory (discussed in Chapter 5) indicate error rates of roughly one in several hundred. “Training” is an even weaker foundation. The mere fact that an individual has been trained in a method does not mean that the method itself is scientifically valid nor that the individual is capable of producing reliable answers when applying the method. Focus on ‘Uniqueness’ Rather Than Accuracy Many forensic feature-comparison disciplines are based on the premise that various sets of features (for example, fingerprints, toolmarks on bullets, human dentition, and so on) are “unique.” 149 Bodziak, W. J. Footwear Impression Evidence: Detection, Recovery, and Examination. 2nd ed. CRC Press-Taylor & Francis, Boca Raton, Florida (2000). 148 U.S. v. Baines 573 F.3d 979 (2009) at 984. 149 For fingerprints, see, for example: Wertheim, Kasey. “Letter re: ACE-V: Is it scientifically reliable and accurate?” Journal of Forensic Identification, Vol. 52 (2002): 669 (“The law of biological uniqueness states that exact replication of any given organism cannot occur (nature never repeats itself), and, therefore, no biological entity will ever be exactly the same as another”) and Budowle, B., Buscaglia, J., and R.S. Perlman. “Review of the scientific basis for friction ridge comparisons as a means of identification: committee findings and recommendations.” Forensic Science Communications, Vol. 8 (2006) (“The use of friction ridge skin comparisons as a means of identification is based on the assumptions that the pattern of friction ridge skin is both unique and permanent”). For firearms, see, for example, Riva, F., and C. Christope. “Automatic comparison and evaluation of impressions left by a firearm on fired cartridge cases.” Journal of Forensic Sciences, Vol. 59, (2014): 637 (“The ability to identify a firearm as the source of a questioned cartridge case or bullet is based on two tenets constituting the scientific foundation of the discipline. The first assumes the uniqueness of impressions left by the firearms”) and SWGGUN Admissibility Resource Kit (ARK): Foundational Overview of Firearm/Toolmark Identification. available at: afte.org/resources/swggun-ark (“The basis for identification in Toolmark Identification is founded on the principle of uniqueness . . . wherein, all objects are unique to themselves and thus can be differentiated from one another”). For bitemarks, see, for example, Kieser, J.A., Bernal, V., Neil Waddell, J., and S. Raju. “The uniqueness of the human anterior dentition: a geometric morphometric analysis.” Journal of Forensic Sciences, Vol. 52 (2007): 671-7 (“There are two postulates that underlie all bitemark analyses: first, that the characteristics of the anterior teeth involved in the bite are unique, and secondly, that this uniqueness is accurately recorded in the material bitten.”) and Pretty, I.A. “Resolving Issues in Bitemark Analysis” in Bitemark Evidence: A Color Atlas R.B.J Dorian, Ed. CRC Press. Chicago (2011) (“Bitemark 147 61 The forensics science literature contains many “uniqueness” studies that go to great lengths to try to establish the correctness of this premise. 150 For example, a 2012 paper studied 39 Adidas Supernova Classic running shoes (size 12) worn by a single runner over 8 years, during which time he kept a running journal and ran over the same types of surfaces. 151 After applying black shoe polish to the soles of the shoes, the author asked the runner to carefully produce tread marks on sheets of legal paper on a hardwood floor. The author showed that it was possible to identify small identifying differences between the tread marks produced by different pairs of shoes. Yet, uniqueness studies miss the fundamental point. The issue is not whether objects or features differ; they surely do if one looks at a fine enough level. The issue is how well and under what circumstances examiners applying a given metrological method can reliably detect relevant differences in features to reliably identify whether they share a common source. Uniqueness studies, which focus on the properties of features themselves, can therefore never establish whether a particular method for measuring and comparing features is foundationally valid. Only empirical studies can do so. Moreover, it is not necessary for features to be unique in order for them to be useful in narrowing down the source of a feature. Rather, it is essential that there be empirical evidence about how often a method incorrectly attributes the source of a feature. Decoupling Conclusions about Identification from Estimates of Accuracy Finally, some hold the view that, when the application of a scientific method leads to a conclusion of an association or proposed identification, it is unnecessary to report in court the reliability of the method. 152 As a rationale, it is sometimes argued that it is impossible to measure error rates perfectly or that it is impossible to know the error rate in the specific case at hand. This notion is contrary to the fundamental principle of scientific validity in metrology—namely, that the claim that two objects have been compared and found to have the same property (length, weight, or fingerprint pattern) is meaningless without quantitative information about the reliability of the comparison process. It is standard practice to study and report error rates in medicine—both to establish the reliability of a method in principle and to assess its implementation in practice. No one argues that measuring or reporting clinical error rates is inappropriate because they might not perfectly reflect the situation for a specific patient. If analysis is based on two postulates: (a) the dental characteristics of anterior teeth involved in biting are unique among individuals, and (b) this asserted uniqueness is transferred and recorded in the injury.”). 150 Some authors have criticized attempts to affirm the uniqueness proposition based on observations, noting that they rest on pure inductive reasoning, a method for scientific investigation that “fell out of favour during the epoch of Sir Francis Bacon in the 16th century.” Page, M., Taylor, J., and M. Blenkin. “Uniqueness in the forensic identification sciences—fact or fiction?” Forensic Science International, Vol. 206 (2011): 12-8. 151 Wilson, H.D. “Comparison of the individual characteristics in the outsoles of thirty-nine pairs of Adidas Supernova Classic shoes.” Journal of Forensic Identification, Vol. 62, No. 3 (2012): 194-204. 152 See: www.justice.gov/olp/file/861936/download. 62 transparency about error rates is appropriate for matching blood types before a transfusion, it is appropriate for matching forensic samples—where errors may have similar life-threatening consequences. We return to this topic in Chapter 8, where we observe that the DOJ’s recent proposed guidelines on expert testimony are based, in part, on this scientifically inappropriate view. 4.8 Empirical Views in the Forensic Community Although some in the forensic community continue to hold views such as those described in the previous section, a growing segment of the forensic science community has responded to the 2009 NRC report with an increased recognition of the need for empirical studies and with initial efforts to undertake them. Examples include published research studies by forensic scientists, assessments of research needs by Scientific Working Groups and OSAC committees, and statements from the NCFS. Below we highlight several examples from recent papers by forensic scientists: ● Researchers at the National Academy of Sciences and elsewhere (e.g., Saks & Koehler, 2005; Spinney, 2010) have argued that there is an urgent need to develop objective measures of accuracy in fingerprint identification. Here we present such data. 153 ● Tool mark impression evidence, for example, has been successfully used in courts for decades, but its examination has lacked scientific, statistical proof that would independently corroborate conclusions based on morphology characteristics (2–7). In our study, we will apply methods of statistical pattern recognition (i.e., machine learning) to the analysis of toolmark impressions. 154 ● The NAS report calls for further research in the area of bitemarks to demonstrate that there is a level of probative value and possibly restricting the use of analyses to the exclusion of individuals. This call to respond must be heard if bite-mark evidence is to be defensible as we move forward as a discipline. 155 ● The National Research Council of the National Academies and the legal and forensic sciences communities have called for research to measure the accuracy and reliability of latent print examiners’ decisions, a challenging and complex problem in need of systematic analysis. Our research is focused on the development of empirical approaches to studying this problem. 156 Tangen, J.M., Thompson, M.B., and D.J. McCarthy. “Identifying fingerprint expertise.” Psychological Science, Vol. 22, No. 8 (2011): 995-7. 154 Petraco, N.D., Shenkin, P., Speir, J., Diaczuk, P., Pizzola, P.A., Gambino, C., and N. Petraco. “Addressing the National Academy of Sciences’ Challenge: A Method for Statistical Pattern Comparison of Striated Tool Marks.” Journal of Forensic Sciences, Vol. 57 (2012): 900-11. 155 Pretty, I.A., and D. Sweet. “A paradigm shift in the analysis of bitemarks.” Forensic Science International, Vol. 201 (2010): 38-44. 156 Ulery, B.T., Hicklin, R.A., Buscaglia, J., and M.A., Roberts. “Accuracy and reliability of forensic latent fingerprint decisions.” PNAS, Vol. 108, No. 19 (2011): 7733-8. 153 63 ● We believe this report should encourage the legal community to require that the emerging field of forensic neuroimaging, including fMRI based lie detection, have a proper scientific foundation before being admitted in courts. 157 ● An empirical solution which treats the system [referring to voiceprints] as a black box and its output as point values is therefore preferred. 158 Similarly, the OSAC and other groups have acknowledged critical research gaps in the evidence supporting various forensic science disciplines and have begun to develop plans to close some of these gaps. We highlight several examples below: ● While validation studies of firearms and toolmark analysis schemes have been conducted, most have been relatively small data sets. If a large study were well designed and has sufficient participation, it is our anticipation that similar lessons could be learned for the firearms and toolmark discipline. 159 ● We are unaware of any study that assesses the overall firearm and toolmark discipline’s ability to correctly/consistently categorize evidence by class characteristics, identify subclass marks, and eliminate items using individual characteristics. 160 ● Currently there is not a reliable assessment of the discriminating strength of specific friction ridge feature types. 161 ● To date there is little scientific data that quantifies the overall risk of close non-matches in AFIS databases. It is difficult to create standards regarding sufficiency for examination or AFIS search searching without this type of research. 162 Langleben, D.D., and J.C. Moriarty. “Using brain imaging for lie detection: Where science, law, and policy collide.” Psychology, Public Policy, and Law, Vol. 19, No. 2 (2013): 222–34. 158 Morrison, G.S., Zhang, C., and P. Rose. “An empirical estimate of the precision of likelihood ratios from a forensic-voicecomparison system.” Forensic Science International, Vol. 208, (2011): 59–65. 159 OSAC Research Needs Assessment Form. “Study to Assess The Accuracy and Reliability of Firearm and Toolmark.” Issued October 2015 (Approved January 2016). Available at: www.nist.gov/forensics/osac/upload/FATM-Research-NeedsAssessment_Blackbox.pdf. 160 OSAC Research Needs Assessment Form. “Assessment of Examiners’ Toolmark Categorization Accuracy.” Issued October 2015 (Approved January 2016). Available at: www.nist.gov/forensics/osac/upload/FATM-Research-NeedsAssessment_Class-and-individual-marks.pdf. 161 OSAC Research Needs Assessment Form. “Assessing the Sufficiency and Strength of Friction Ridge Features.” Issued October 2015. Available at: www.nist.gov/forensics/osac/upload/FRS-Research-Need-Assessment-of-Features.pdf. 162 OSAC Research Needs Assessment Form. “Close Non-Match Assessment.” Issued October 2015. Available at: www.nist.gov/forensics/osac/upload/FRS-Research-Need-Close-Non-Match-Assessment.pdf. 157 64 ● Research is needed that studies whether sequential unmasking reduces the negative effects of bias during latent print examination. 163 ● The IAI has, for many years, sought support for research that would scientifically validate many of the comparative analyses conducted by its member practitioners. While there is a great deal of empirical evidence to support these exams, independent validation has been lacking. 164 The National Commission on Forensic Science has similarly recognized the need for rigorous empirical evaluation of forensic methods in a Views Document approved by the commission: All forensic science methodologies should be evaluated by an independent scientific body to characterize their capabilities and limitations in order to accurately and reliably answer a specific and clearly defined forensic question. 165 PCAST applauds this growing focus on empirical evidence. We note that increased research funding will be needed to achieve these critical goals (see Chapter 6). 4.9 Summary of Scientific Findings We summarize our scientific findings concerning the scientific criteria for foundational validity and validity as applied. Finding 1: Scientific Criteria for Scientific Validity of a Forensic Feature-Comparison Method (1) Foundational validity. To establish foundational validity for a forensic feature-comparison method, the following elements are required: (a) a reproducible and consistent procedure for (i) identifying features in evidence samples; (ii) comparing the features in two samples; and (iii) determining, based on the similarity between the features in two sets of features, whether the samples should be declared to be likely to come from the same source (“matching rule”); and (b) empirical estimates, from appropriately designed studies from multiple groups, that establish (i) the method’s false positive rate—that is, the probability it declares a proposed identification between samples that actually come from different sources and (ii) the method’s sensitivity—that is, the probability it declares a proposed identification between samples that actually come from the same source. OSAC Research Needs Assessment Form. “ACE-V Bias.” Issued October 2015. Available at: www.nist.gov/forensics/osac/upload/FRS-Research-Need-ACE-V-Bias.pdf. 164 International Association for Identification. Letter to Patrick J. Leahy, Chairman, Senate Committee on the Judiciary, March 18, 2009. Available at: www.theiai.org/current_affairs/nas_response_leahy_20090318.pdf. 165 National Commission on Forensic Science: “Views of the Commission Technical Merit Evaluation of Forensic Science Methods and Practices.” Available at: www.justice.gov/ncfs/file/881796/download. 163 65 As described in Box 4, scientific validation studies should satisfy a number of criteria: (a) they should be based on sufficiently large collections of known and representative samples from relevant populations; (b) they should be conducted so that the examinees have no information about the correct answer; (c) the study design and analysis plan should be specified in advance and not modified afterwards based on the results; (d) the study should be conducted or overseen by individuals or organizations with no stake in the outcome; (e) data, software and results should be available to allow other scientists to review the conclusions; and (f) to ensure that the results are robust and reproducible, there should be multiple independent studies by separate groups reaching similar conclusions. Once a method has been established as foundationally valid based on adequate empirical studies, claims about the method’s accuracy and the probative value of proposed identifications, in order to be valid, must be based on such empirical studies. For objective methods, foundational validity can be established by demonstrating the reliability of each of the individual steps (feature identification, feature comparison, matching rule, false match probability, and sensitivity). For subjective methods, foundational validity can be established only through black-box studies that measure how often many examiners reach accurate conclusions across many feature-comparison problems involving samples representative of the intended use. In the absence of such studies, a subjective feature-comparison method cannot be considered scientifically valid. Foundational validity is a sine qua non, which can only be shown through empirical studies. Importantly, good professional practices—such as the existence of professional societies, certification programs, accreditation programs, peer-reviewed articles, standardized protocols, proficiency testing, and codes of ethics—cannot substitute for empirical evidence of scientific validity and reliability. (2) Validity as applied. Once a forensic feature-comparison method has been established as foundationally valid, it is necessary to establish its validity as applied in a given case. As described in Box 5, validity as applied requires that: (a) the forensic examiner must have been shown to be capable of reliably applying the method, as shown by appropriate proficiency testing (see Section 4.6), and must actually have done so, as demonstrated by the procedures actually used in the case, the results obtained, and the laboratory notes, which should be made available for scientific review by others; and (b) assertions about the probative value of proposed identifications must be scientifically valid— including that examiners should report the overall false positive rate and sensitivity for the method established in the studies of foundational validity; demonstrate that the samples used in the foundational studies are relevant to the facts of the case; where applicable, report probative value of the observed match based on the specific features observed in the case; and not make claims or implications that go beyond the empirical evidence. 66 5. Evaluation of Scientific Validity for Seven Feature-Comparison Methods In the previous chapter, we described the scientific criteria that a forensic feature-comparison method must meet to be considered scientifically valid and reliable, and we underscored the need for empirical evidence of accuracy and reliability. In this chapter, we illustrate the meaning of these criteria by applying them to six specific forensic featurecomparison methods: (1) DNA analysis of single-source and simple-mixture samples, (2) DNA analysis of complex-mixture samples, (3) bitemarks, (4) latent fingerprints, (5) firearms identification, and (6) footwear analysis. 166 For a seventh forensic feature- comparison method, hair analysis, we do not undertake a full evaluation, but review a recent evaluation by the DOJ. We evaluate whether these methods have been established to be foundationally valid and reliable and, if so, what estimates of accuracy should accompany testimony concerning a proposed identification, based on current scientific studies. We also briefly discuss some issues related to validity as applied. PCAST compiled a list of 2019 papers from various sources—including bibliographies prepared by the National Science and Technology Council’s Subcommittee on Forensic Science, the relevant Scientific Working Groups (predecessors to the current OSAC), 167 and the relevant OSAC committees; submissions in response to PCAST’s request for information from the forensic-science stakeholder community; and our own literature searches. 168 PCAST members and staff identified and reviewed those papers that were relevant to establishing scientific validity. After reaching a set of initial conclusions, input was obtained from the FBI Laboratory and individual scientists at NIST, as well as other experts—including asking them to identify additional papers supporting scientific validity that we might have missed. For each of the methods, we provide a brief overview of the methodology, discuss background information and studies, and review evidence for scientific validity. As discussed in Chapter 4, objective methods have well-defined procedures to (1) identify the features in samples, (2) measure the features, (3) determine whether the features in two samples match to within a stated measurement tolerance (matching rule), and (4) estimate the probability that samples from different sources would match (false match probability). It is possible to examine each of these separate steps for their validity 166 The American Association for the Advancement of Science (AAAS) is conducting an analysis of the underlying scientific bases for the forensic tools and methods currently used in the criminal justice system. As of September 1, 2016 no reports have been issued. See: www.aaas.org/page/forensic-science-assessments-quality-and-gap-analysis. 167 See: www.nist.gov/forensics/workgroups.cfm. 168 See: www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensics_references.pdf. 67 and reliability. Of the six methods considered in this chapter, only the first two methods (involving DNA analysis) employ objective methods. The remaining four methods are subjective. For subjective methods, the procedures are not precisely defined, but rather involve substantial expert human judgment. Examiners may focus on certain features while ignoring others, may compare them in different ways, and may have different standards for declaring proposed identification between samples. As described in Chapter 4, the sole way to establish foundational validity is through multiple independent “black-box” studies that measure how often examiners reach accurate conclusions across many feature-comparison problems involving samples representative of the intended use. In the absence of such studies, a feature-comparison method cannot be considered scientifically valid. PCAST found few black-box studies appropriately designed to assess scientific validity of subjective methods. Two notable exceptions, discussed in this chapter, were a study on latent fingerprints conducted by the FBI Laboratory and a study on firearms identification sponsored by the Department of Defense and conducted by the Department of Energy’s Ames Laboratory. We considered whether proficiency testing, which is conducted by commercial organizations for some disciplines, could be used to establish foundational validity. We concluded that it could not, at present, for several reasons. First, proficiency tests are not intended to establish foundational validity. Second, the test problems or test sets used in commercial proficiency tests are not at present routinely made public—making it impossible to ascertain whether the tests appropriately assess the method across the range of applications for which it is used. The publication and critical review of methods and data is an essential component in establishing scientific validity. Third, the dominant company in the market, Collaborative Testing Services, Inc. (CTS), explicitly states that its proficiency tests are not appropriate for estimating error rates of a discipline, because (a) the test results, which are open to anyone, may not reflect the skills of forensic practitioners and (b) “the reported results do not reflect ‘correct’ or ‘incorrect’ answers, but rather responses that agree or disagree with the consensus conclusions of the participant population.” 169 Fourth, the tests for forensic featurecomparison methods typically consist of only one or two problems each year. Fifth, “easy tests are favored by the community,” with the result that tests that are too challenging could jeopardize repeat business for a commercial vendor. 170 See: www.ctsforensics.com/assets/news/CTSErrorRateStatement.pdf. PCAST thanks Collaborative Testing Services, Inc. (CTS) President Christopher Czyryca for helpful conversations concerning proficiency testing. Czyryca explained that that (1) CTS defines consensus as at least 80 percent agreement among respondents and (2) proficiency testing for latent fingerprints only occasionally involves a problem in which a questioned print matches none of the possible answers. Czyryca noted that the forensic community disfavors more challenging tests—and that testing companies are concerned that they could lose business if their tests are viewed as too challenging. An example of a “challenging” test is the very important scenario in which none of the questioned samples match any of the known samples: because examiners may expect they should find some matches, such scenarios provide an opportunity to assess how often examiners declare false-positive matches. (See also presentation to the National Commission on Forensic Science by CTS President Czyryca, noting that “Easy tests are favored by the community.” www.justice.gov/ncfs/file/761061/download.) 169 170 68 PCAST’s observations and findings below are largely consistent with the conclusions of earlier NRC reports. 171 5.1 DNA Analysis of Single-source and Simple-mixture samples DNA analysis of single-source and simple mixture samples includes excellent examples of objective methods whose foundational validity has been properly established. 172 Methodology DNA analysis involves comparing DNA profiles from different samples to see if a known sample may have been the source of an evidentiary sample. To generate a DNA profile, DNA is first chemically extracted from a sample containing biological material, such as blood, semen, hair, or skin cells. Next, a predetermined set of DNA segments (“loci”) containing small repeated sequences 173 are amplified using the Polymerase Chain Reaction (PCR), an enzymatic process that replicates a targeted DNA segment over and over to yield millions of copies. After amplification, the lengths of the resulting DNA fragments are measured using a technique called capillary electrophoresis, which is based on the fact that longer fragments move more slowly than shorter fragments through a polymer solution. The raw data collected from this process are analyzed by a software program to produce a graphical image (an electropherogram) and a list of numbers (the DNA profile) corresponding to the sizes of the each of fragments (by comparing them to known “molecular size standards”). As currently practiced, the method uses 13 specific loci and the amplification process is designed so that the DNA fragments corresponding to different loci occupy different size ranges—making it simple to recognize which fragments come from each locus. 174 At each locus, every human carries two variants (called “alleles”)— one inherited from his or her mother, one from his or her father—that may be of different lengths or the same length. 175 National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009). National Research Council, Ballistic Imaging. The National Academies Press. Washington DC. (2008). 172 Forensic DNA analysis belongs to two parent disciplines—metrology and human molecular genetics—and has benefited from the extensive application of DNA technology in biomedical research and medical application. 173 The repeats, called short tandem repeats (STRs), consist of consecutive repeated copies of a segments of 2-6 base pairs. 174 The current kit used by the FBI (Identifiler Plus) has 16 total loci: 15 STR loci and the amelogenin locus. A kit that will be implemented later this year has 24 loci. 175 The FBI announced in 2015 that it plans to expand the core loci by adding seven additional loci commonly used in databases in other countries. (Population data have been published for the expanded set, including frequencies in 11 ethnic populations www.fbi.gov/about-us/lab/biometric-analysis/codis/expanded-fbi-str-2015-final-6-16-15.pdf.) Starting in 2017, these loci will be required for uploading and searching DNA profiles in the national system. The expanded data in each profile are expected to provide greater discrimination potential for identification, especially in matching samples with only partial DNA profiles, missing person inquiries, and international law enforcement and counterterrorism cases. 171 69 Analysis of single-source samples DNA analysis of a sample from a single individual is an objective method. In addition to the laboratory protocols being precisely defined, the interpretation also involves little or no human judgment. An examiner can assess if a sample came from a single source based on whether the DNA profile typically contains, for each locus, exactly one fragment from each chromosome containing the locus—which yields one or two distinct fragment lengths from each locus. 176 The DNA profile can then be compared with the DNA profile of a known suspect. It can also be entered into the FBI’s National DNA Index System (NDIS) and searched against a database of DNA profiles from convicted offenders (and arrestees in more than half of the states) or unsolved crimes. Two DNA profiles are declared to match if the lists of alleles are the same. 177 The probability that two DNA profiles from different sources would have the same DNA profile (the random match probability) is then calculated based on the empirically measured frequency of each allele and established principles of population genetics (see p. 53). 178 Analysis of simple mixtures Many sexual assault cases involve DNA mixtures of two individuals, where one individual (i.e., the victim) is known. DNA analysis of these simple mixtures is also relatively straightforward. Methods have been used for 30 years to differentially extract DNA from sperm cells vs. vaginal epithelial cells, making it possible to generate DNA profiles from the two sources. Where the two cell types are the same but one contributor is known, the alleles of the known individual can be subtracted from the set of alleles identified in the mixture. 179 Once the known source is removed, the analysis of the unknown sample then proceeds as above for singlesource samples. Like the analysis of single-source samples, the analysis of simple mixtures is a largely objective method. The examiner reviews the electropherogram to determine whether each of the peaks is a true allelic peak or an artifact (e.g., background noise in the form of stutter, spikes, and other phenomena) and to determine whether more than one individual could have contributed to the profile. In rare cases, an individual may have two fragments at a locus due to rare copy-number variation in the human genome. 177 When only a partial profile could be generated from the evidence sample (for example, in cases with limited quantities of DNA, degradation of the sample, or the presence of PCR inhibitors), an examiner may also report an “inclusion” if the partial profile is consistent with the DNA profile obtained from a reference sample. An examiner may also report an inclusion when the DNA results from a reference sample are present in a mixture. These cases generally require significantly more human analysis and interpretation than single-source samples. 178 Random match probabilities can also be expressed in terms of a likelihood ratio (LR), which is the ratio of (1) the probability of observing the DNA profile if the individual in question is the source of the DNA sample and (2) the probability of observing the DNA profile if the individual in question is not the source of the DNA sample. In the situation of a singlesource sample, the LR should be simply the reciprocal of the random match probability (because the first probability in the LR is 1 and the second probability is the random match probability). 179 In many cases, DNA will be present in the mixture in sufficiently different quantities so that the peak heights in the electropherogram from the two sources will be distinct, allowing the examiner to more readily separate out the sources. 176 70 Foundational Validity To evaluate the foundational validity of an objective method (such as single-source and simple mixture analysis), one can examine the reliability of each of the individual steps rather than having to rely on black-box studies. Single-source samples Each step in the analysis is objective and involves little or no human judgment. (1) Feature identification. In contrast to the other methods discussed in this report, the features used in DNA analysis (the fragments lengths of the loci) are defined in advance. (2) Feature measurement and comparison. PCR amplification, invented in 1983, is widely used by tens of thousands of molecular biology laboratories, including for many medical applications in which it has been rigorously validated. Multiplex PCR kits designed by commercial vendors for use by forensic laboratories must be validated both externally (through developmental validation studies published in peer reviewed publication) and internally (by each lab that wishes to use the kit) before they may be used. 180 Fragment sizes are measured by an automated procedure whose variability is well characterized and small; the standard deviation is approximately 0.05 base pairs, which provides highly reliable measurements. 181,182 Developmental validation studies were performed—including by the FBI— to verify the accuracy, precision, and reproducibility of the procedure. 183,184 Laboratories that conduct forensic DNA analysis are required to follow FBI’s Quality Assurance Standards for DNA Testing Laboratories as a condition of participating in the National DNA Index System (www.fbi.gov/about-us/lab/biometricanalysis/codis/qas-standards-for-forensic-dna-testing-laboratories-effective-9-1-2011). FBI’s Scientific Working Group on DNA Analysis Methods (SWGDAM) has published guidelines for laboratories in validating procedures consistent the FBI’s Quality Assurance Standards (QAS). SWGDAM Validation Guidelines for DNA Analysis Methods, December 2012. See: media.wix.com/ugd/4344b0_cbc27d16dcb64fd88cb36ab2a2a25e4c.pdf. 181 Forensic laboratories typically use genetic analyzer systems developed by the Applied Biosystems group of ThermoFisher Scientific (ABI 310, 3130, or 3500). 182 To incorrectly estimate a fragment length by 1 base pair (the minimum size difference) requires a measurement error of 0.5 base pair, which corresponds to 10 standard deviations. Moreover, alleles typically differ by at least 4 base pairs (although some STR loci have fairly common alleles that differ by 1 or 2 nucleotides). 183 For examples of these studies see: Budowle, B., Moretti, T.R., Keys, K.M., Koons, B.W., and J.B. Smerick. “Validation studies of the CTT STR multiplex system.” Journal of Forensic Sciences, Vol. 42, No. 4 (1997): 701-7; Kimpton, C.P., Oldroyd, N.J., Watson, S.K., Frazier, R.R., Johnson, P.E., Millican, E.S., Urguhart, A., Sparkes, B.L., and P. Gill. “Validation of highly discriminating multiplex short tandem repeat amplification systems for individual identification.” Electrophoresis, Vol. 17, No. 8 (1996): 1283-93; Lygo, J.E., Johnson, P.E., Holdaway, D.J., Woodroffe, S., Whitaker, J.P., Clayton, T.M., Kimpton, C.P., and P. Gill. “The validation of short tandem repeat (STR) loci for use in forensic casework.” International Journal of Legal Medicine, Vol. 107, No. 2 (1994): 77-89; and Fregeau, C.J., Bowen, K.L., and R.M. Fourney. “Validation of highly polymorphic fluorescent multiplex short tandem repeat systems using two generations of DNA sequencers.” Journal of Forensic Sciences, Vol. 44, No. 1 (1999): 133-66. 184 For example, a 2001 study that compared the performance characteristics of several commercially available STR testing kits tested the consistency and reproducibility of results using previously typed case samples, environmentally insulted samples, and body fluid samples deposited on various substrates. The study found that all of the kits could be used to amplify and type STR loci successfully and that the procedures used for each of the kits were robust and valid. No evidence 180 71 (3) Feature comparison. For single-source samples, there are clear and well-specified “matching rules” for declaring whether the DNA profiles match. When complete DNA profiles are searched against the NDIS at “high stringency,” a “match” is returned only when each allele in the unknown profile is found to match an allele of the known profile, and vice versa. When partial DNA profiles obtained from a partially degraded or contaminated sample are searched at “moderate stringency,” candidate profiles are returned if each of the alleles in the unknown profile is found to match an allele of the known profile. 185,186 (4) Estimation of random match probability. The process for calculating the random match probability (that is, the probability of a match occurring by chance) is based on well-established principles of population genetics and statistics. The frequencies of the individual alleles were obtained by the FBI based on DNA profiles from approximately 200 unrelated individuals from each of six population groups and were evaluated prior to use. 187 The frequency of an overall pattern of alleles—that is, the random match probability—is typically estimated by multiplying the frequencies of the individual loci, under the assumption that the alleles are independent of one another. 188 The resulting probability is typically less than 1 in 10 billion, excluding the possibility of close relatives. 189 (Note: Multiplying the frequency of alleles can overstates the rarity of a pattern because the alleles are not completely independent, owing of false positive or false negative results and no substantial evidence of preferential amplification within a locus were found for any of the testing kits. Moretti, T.R., Baumstark, A.L., Defenbaugh, D.A., Keys, K.M., Smerick, J.B., and B. Budowle. “Validation of Short Tandem Repeats (STRs) for forensic usage: performance testing of fluorescent multiplex STR systems and analysis of authentic and simulated forensic samples.” Journal of Forensic Sciences, Vol. 46, No. 3 (2001): 647-60. 185 See: FBI’s Frequently Asked Questions (FAQs) on the CODIS Program and the National DNA Index System. www.fbi.gov/about-us/lab/biometric-analysis/codis/codis-and-ndis-fact-sheet. 186 Contaminated samples are not retained in NDIS. 187 The initial population data generated by FBI included data for 6 ethnic populations with database sizes of 200 individuals. See: Budowle, B., Moretti, T.R., Baumstark, A.L., Defenbaugh, D.A., and K.M. Keys. “Population data on the thirteen CODIS core short tandem repeat loci in African Americans, U.S. Caucasians, Hispanics, Bahamians, Jamaicans, and Trinidadians.” Journal of Forensic Sciences, Vol. 44, No. 6 (1999): 1277-86 and Budowle, B., Shea, B., Niezgoda, S., and R. Chakraborty. “CODIS STR loci data from 41 sample populations.” Journal of Forensic Sciences, Vol. 46, No. 3 (2001): 453-89. Errors in the original database were reported in July 2015 (Erratum, Journal of Forensic Sciences, Vol. 60, No. 4 (2015): 1114-6, the impact of these discrepancies on profile probability calculations were assessed (and found to be less than a factor of 2 in a full profile), and the allele frequency estimates were amended accordingly. At the same time as amending the original datasets, the FBI Laboratory also published expanded datasets in which the original samples were retyped for additional loci. In addition, the population samples that were originally studied at other laboratories were typed for additional loci, so the full dataset includes 9 populations. These “expanded” datasets are in use at the FBI Laboratory and can be found at www.fbi.gov/about-us/lab/biometric-analysis/codis/expanded-fbi-str-final-6-16-15.pdf. 188 More precisely, the frequency at each locus is calculated first. If the locus has two copies of the same allele with frequency p, the frequency is calculated as p2. If the locus has two different alleles with respective frequencies p and q, the frequency is calculated as 2pq. The frequency of the overall pattern is calculated by multiplying together the values for the individual loci. 189 The random match probability will be higher for close relatives. For identical twins, the DNA profiles are expected to match perfectly. For first degree relatives, the random match probability may be on the order of 1 in 100,000 when examining the 13 CODIS core STR loci. See: Butler, J.M. “The future of forensic DNA analysis.” Philosophical Transactions of the Royal Society B, 370: 20140252 (2015). 72 to population substructure. A 1996 NRC report concluded that the effect of population substructure on the calculated value was likely to be within a factor of 10 (for example, for a random match probability estimate of 1 in 10 million, the true probability is highly likely to be between 1 in 1 million and 1 in 100 million). 190 However, a recent study by NIST scientists suggests that the variation may be substantially greater than 10-fold. 191 The random match probability should be calculated using an appropriate statistical formula that takes account of population substructure. 192) Simple mixtures The steps for analyzing simple mixtures are the same as for analyzing single-source samples, up until the point of interpretation. DNA profiles that contain a mixture of two contributors, where one contributor is known, can be interpreted in much the same way as single-source samples. This occurs frequently in sexual assault cases, where a DNA profile contains a mixture of DNA from the victim and the perpetrator. Methods that are used to differentially extract DNA from sperm cells vs. vaginal epithelial cells in sexual assault cases are wellestablished. 193 Where the two cell types are the same, one DNA source may be dominant, resulting in a distinct contrast in peak heights between the two contributors; in these cases, the alleles from both the major contributor (corresponding to the larger allelic peaks) and the minor contributor can usually be reliably interpreted, provided the proportion of the minor contributor is not too low. 194 Validity as Applied While DNA analysis of single-source samples and simple mixtures is a foundationally valid and reliable method, it is not infallible in practice. Errors can and do occur in DNA testing. Although the probability that two samples from different sources have the same DNA profile is tiny, the chance of human error is much higher. Such errors may stem from sample mix-ups, contamination, incorrect interpretation, and errors in reporting. 195 National Research Council. The Evaluation of Forensic DNA Evidence. The National Academies Press. Washington DC. (1996). Goode, M. “Some observations on evidence of DNA frequency.” Adelaide Law Review, Vol. 23 (2002): 45-77. 191 Gittelson, S. and J. Buckleton. “Is the factor of 10 still applicable today?” Presentation at the 68th Annual American Academy of Forensic Sciences Scientific Meeting, 2016. See: www.cstl.nist.gov/strbase/pub_pres/Gittelson-AAFS2016Factor-of-10.pdf. 192 Balding, D.J., and R.A. Nichols. “DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands.” Forensic Science International, Vol. 64 (1994): 125-140. 193 Gill, P., Jeffreys, A.J., and D.J. Werrett. “Forensic application of DNA ‘fingerprints.’” Nature, Vol. 318, No. 6046 (1985): 577-9. 194 Clayton, T.M., Whitaker, J.P., Sparkes, R., and P. Gill. “Analysis and interpretation of mixed forensic stains using DNA STR profiling.” Forensic Science International, Vol. 91, No. 1 (1998): 55-70. 195 Krimsky, S., and T. Simoncelli. Genetic Justice: DNA Data Banks, Criminal Investigations, and Civil Liberties. Columbia University Press, (2011). Perhaps the most spectacular human error to date involved the German government’s investigation of the “Phantom of Heilbronn,” a woman whose DNA appeared at the scenes of more than 40 crimes in three countries, including 6 murders, several muggings and dozens of break-ins over the course of more than a decade. After an effort that included analyzing DNA samples from more than 3,000 women from four countries and that cost $18 million, authorities discovered that the woman of interest was a worker in the Austrian factory that fabricated the swabs used in DNA collection. The woman had inadvertently contaminated a large number of swabs with her own DNA, which was thus found in many DNA tests. 190 73 To minimize human error, the FBI requires, as a condition of participating in NDIS, that laboratories follow the FBI’s Quality Assurance Standards (QAS). 196 Before the results of the DNA analysis can be compared, the examiner is required to run a series of controls to check for possible contamination and ensure that the PCR process ran properly. The QAS also requires semi-annual proficiency testing of all DNA analysts that perform DNA testing for criminal cases. The results of the tests do not have to be published, but the laboratory must retain the results of the tests, any discrepancies or errors made, and corrective actions taken. 197 Forensic practitioners in the U.S. do not typically report quality issues that arise in forensic DNA analysis. By contrast, error rates in medical DNA testing are commonly measured and reported. 198 Refreshingly, a 2014 paper from the Netherlands Forensic Institute (NFI), a government agency, reported a comprehensive analysis of all “quality issue notifications” encountered in casework, categorized by type, source and impact. 199,200 The authors call for greater “transparency” and “culture change,” writing that: Forensic DNA casework is conducted worldwide in a large number of laboratories, both private companies and in institutes owned by the government. Quality procedures are in place in all laboratories, but the nature of the quality system varies a lot between the different labs. In particular, there are many forensic DNA laboratories that operate without a quality issue notification system like the one described in this paper. In our experience, such a system is extremely important for the detection and proper handling of errors. This is crucial in forensic casework that can have a major impact on people’s lives. We therefore propose that the implementation of a quality issue notification system is necessary for any laboratory that is involved in forensic DNA casework. Such system can only work in an optimal way, however, when there is a blame-free culture in the laboratory that extends to the police and the legal justice system. People have a natural tendency to hide their mistakes, and it is essential to create an atmosphere where there are no adverse personal consequences when mistakes are reported. The management should take the lead in this culture change... As far as we know, the NFI is the first forensic DNA laboratory in the world to reveal such detailed data and reports. It shows that this is possible without any disasters or abuse happening, and there are no FBI. “Quality assurance standards for forensic DNA testing laboratories.” (2011). See: www.fbi.gov/aboutus/lab/biometric-analysis/codis/qas-standards-for-forensic-dna-testing-laboratories-effective-9-1-2011. 197 Ibid., Sections 12, 13, and 14. 198 See, for example: Plebani, M., and P. Carroro. “Mistakes in a stat laboratory: types and frequency.” Clinical Chemistry, Vol. 43 (1997): 1348-51; Stahl, M., Lund, E.D., and I. Brandslund. “Reasons for a laboratory’s inability to report results for requested analytical tests.” Clinical Chemistry, Vol. 44 (1998): 2195-7; Hofgartner, W.T., and J.F. Tait. “Frequency of problems during clinical molecular-genetic testing.” American Journal of Clinical Pathology, Vol. 112 (1999): 14-21; and Carroro, P., and M. Plebani. “Errors in a stat laboratory: types and frequencies 10 years later.” Clinical Chemistry, Vol. 53 (2007): 1338-42. 199 Kloosterman, A., Sjerps, M., and A. Quak. “Error rates in forensic DNA analysis: Definition, numbers, impact and communication.” Forensic Science International: Genetics, Vol. 12 (2014): 77-85 and J.M. Butler “DNA Error Rates” presentation at the International Forensics Symposium, Washington, D.C. (2015). www.cstl.nist.gov/strbase/pub_pres/Butler-ErrorManagement-DNA-Error.pdf. 200 The Netherlands uses an “inquisitorial” approach to method of criminal justice rather than the adversarial system used in the U.S. Concerns about having to explain quality issues in court may explain in part why U.S. laboratories do not routinely report quality issues. 196 74 reasons for nondisclosure. As mentioned in the introduction, in laboratory medicine publication of data on error rates has become standard practice. Quality failure rates in this domain are comparable to ours. Finally, we note that there is a need to improve proficiency testing. There are currently no requirements concerning how challenging the proficiency tests should be. The tests should be representative of the full range of situations likely to be encountered in casework. Finding 2: DNA Analysis Foundational validity. PCAST finds that DNA analysis of single-source samples or simple mixtures of two individuals, such as from many rape kits, is an objective method that has been established to be foundationally valid. Validity as applied. Because errors due to human failures will dominate the chance of coincidental matches, the scientific criteria for validity as applied require that an expert (1) should have undergone rigorous and relevant proficiency testing to demonstrate their ability to reliably apply the method, (2) should routinely disclose in reports and testimony whether, when performing the examination, he or she was aware of any facts of the case that might influence the conclusion, and (3) should disclose, upon request, all information about quality testing and quality issues in his or her laboratory. 5.2 DNA Analysis of Complex-mixture Samples Some investigations involve DNA analysis of complex mixtures of biological samples from multiple unknown individuals in unknown proportions. Such samples might arise, for example, from mixed blood stains. As DNA testing kits have become more sensitive, there has been growing interest in “touch DNA”—for example, tiny quantities of DNA left by multiple individuals on a steering wheel of a car. Methodology The fundamental difference between DNA analysis of complex-mixture samples and DNA analysis of singlesource and simple mixtures lies not in the laboratory processing, but in the interpretation of the resulting DNA profile. DNA analysis of complex mixtures—defined as mixtures with more than two contributors—is inherently difficult and even more for small amounts of DNA. 201 Such samples result in a DNA profile that superimposes multiple individual DNA profiles. Interpreting a mixed profile is different for multiple reasons: each individual may contribute two, one or zero alleles at each locus; the alleles may overlap with one another; the peak heights may differ considerably, owing to differences in the amount and state of preservation of the DNA from each source; and the “stutter peaks” that surround alleles (common artifacts of the DNA amplification process) can 201 See, for example, SWGDAM document on interpretation of DNA mixtures. www.swgdam.org/#!public-comments/c1t82. 75 obscure alleles that are present or suggest alleles that are not present. 202 It is often impossible to tell with certainty which alleles are present in the mixture or how many separate individuals contributed to the mixture, let alone accurately to infer the DNA profile of each individual. 203 Instead, examiners must ask: “Could a suspect’s DNA profile be present within the mixture profile? And, what is the probability that such an observation might occur by chance?” The questions are challenging for the reasons given above. Because many different DNA profiles may fit within some mixture profiles, the probability that a suspect “cannot be excluded” as a possible contributor to complex mixture may be much higher (in some cases, millions of times higher) than the probabilities encountered for matches to single-source DNA profiles. As a result, proper calculation of the statistical weight is critical for presenting accurate information in court. Subjective Interpretation of Complex Mixtures Initial approaches to the interpretation of complex mixtures relied on subjective judgment by examiners, together with the use of simplified statistical methods such as the “Combined Probability of Inclusion” (CPI). These approaches are problematic because subjective choices made by examiners, such as about which alleles to include in the calculation, can dramatically alter the result and lead to inaccurate answers. The problem with subjective analysis of complex-mixture samples is illustrated by a 2003 double-homicide case, Winston v. Commonwealth. 204 A prosecution expert reported that the defendant could not be excluded as a possible contributor to DNA on a discarded glove that contained a mixed DNA profile of at least three contributors; the defendant was convicted and sentenced to death. The prosecutor told the jury that the chance the match occurred by chance was 1 in 1.1 billion. A 2009 paper, however, makes a reasonable scientific case that that the chance is closer to 1 in 2—that is, 50 percent of the relevant population could not be excluded. 205 Such a large discrepancy is unacceptable, especially in cases where a defendant was sentenced to death. Two papers clearly demonstrate that these commonly used approaches for DNA analysis of complex mixtures can be problematic. In a 2011 study, Dror and Hampikian tested whether irrelevant contextual information biased their conclusions of examiners, using DNA evidence from an actual adjudicated criminal case (a gang rape case in Georgia). 206 In this case, one of the suspects implicated another in connection with a plea bargain. The two experts who examined evidence from the crime scene were aware of this testimony against the suspect and knew that the plea bargain testimony could be used in court only with corroborating DNA evidence. Due to the Challenges with “low-template” DNA are described in a recent paper, Butler, J.M. “The future of forensic DNA analysis.” Philosophical Transactions of the Royal Society B, 370: 20140252 (2015). 203 See: Buckleton, J.S., Curran, J.M., and P. Gill. “Towards understanding the effect of uncertainty in the number of contributors to DNA stains.” Forensic Science International Genetics, Vol. 1, No. 1 (2007): 20-8 and Coble, M.D., Bright, J.A., Buckleton, J.S., and J.M. Curran. “Uncertainty in the number of contributors in the proposed new CODIS set.” Forensic Science International Genetics, Vol. 19 (2015): 207-11. 204 Winston v. Commonwealth, 604 S.E.2d 21 (Va. 2004). 205 Thompson, W.C. “Painting the target around the matching profile: the Texas sharpshooter fallacy in forensic DNA interpretation.” Law, Probability and Risk, Vol. 8, No. 3 (2009): 257-76. 206 Dror, I.E., and G. Hampikian. “Subjectivity and bias in forensic DNA mixture interpretation.” Science & Justice, Vol. 51, No. 4 (2011): 204-8. 202 76 complex nature of the DNA mixture collected from the crime scene, the analysis of this evidence required judgment and interpretation on the part of the examiners. The two experts both concluded that the suspect could not be excluded as a contributor. Dror and Hampikian presented the original DNA evidence from this crime to 17 expert DNA examiners, but without any of the irrelevant contextual information. They found that only 1 out of the 17 experts agreed with the original experts who were exposed to the biasing information (in fact, 12 of the examiners excluded the suspect as a possible contributor). In another paper, de Keijser and colleagues presented 19 DNA experts with a mock case involving an alleged violent robbery outside a bar: There is a male suspect, who denies any wrongdoing. The items that were sampled for DNA analysis are the shirt of the (alleged) female victim (who claims to have been grabbed by her assailant), a cigarette butt that was picked up by the police and that was allegedly smoked by the victim and/or the suspect, and nail clippings from the victim, who claims to have scratched the perpetrator. 207 Although all the experts were provided the same DNA profiles (prepared from the three samples above and the two people), their conclusions varied wildly. One examiner excluded the suspect as a possible contributor, while another examiner declared a match between the suspect’s profile and a few minor peaks in the mixed profile from the nails—reporting a random match probability of roughly 1 in 209 million. Still other examiners declared the evidence inconclusive. In the summer of 2015, a remarkable chain of events in Texas revealed that the problems with subjective analysis of complex DNA mixtures were not limited to a few individual cases: they were systemic. 208 The Texas Department of Public Safety (TX-DPS) issued a public letter on June 30, 2015 to the Texas criminal justice community noting that (1) the FBI had recently reported that it had identified and corrected minor errors in its population databases used to calculate statistics in DNA cases, (2) the errors were not expected to have any significant effect on results, and (2) the TX-DPS Crime Laboratory System would, upon request, recalculate statistics previously reported in individual cases. When several prosecutors submitted requests for recalculation to TX-DPS and other laboratories, they were stunned to find that the statistics had changed dramatically—e.g., from 1 in 1.4 billion to 1 in 36 in one case, from 1 in 4000 to inconclusive in another. These prosecutors sought the assistance of the Texas Forensic Science Commission (TFSC) in understanding the reason for the change and the scope of potentially affected cases. de Keijser, J.W., Malsch, M., Luining, E.T., Kranenbarg, M.W., and D.J.H.M. Lenssen. “Differential reporting of mixed DNA profiles and its impact on jurists’ evaluation of evidence: An international analysis.” Forensic Science International: Genetics, Vol. 23 (2016): 71-82. 208 Relevant documents and further details can be found at www.fsc.texas.gov/texas-dna-mixture-interpretation-casereview. Lynn Garcia, General Counsel for the Texas Forensic Science Commission, also provided a helpful summary to PCAST. 207 77 In consultation with forensic DNA experts, the TFSC determined that the large shifts observed in some cases were unrelated to the minor corrections in the FBI’s population database, but rather were due to the fact that forensic laboratories had changed the way in which they calculated the CPI statistic—especially how they dealt with phenomena such as “allelic dropout” at particular DNA loci. The TFSC launched a statewide DNA Mixture Notification Subcommittee, which included representatives of conviction integrity units, district and county attorneys, defense attorneys, innocence projects, the state attorney general, and the Texas governor. By September 2015, the TX-DPS had generated a county-by-county list of more than 24,000 DNA mixture cases analyzed from 1999-2015. Because TX-DPS is responsible for roughly half of the casework in the state, the total number of Texas DNA cases requiring review may exceed 50,000. (Although comparable efforts have not been undertaken in other states, the problem is likely to be national in scope, rather than specific to forensic laboratories in Texas.) The TFSC also convened an international panel of scientific experts—from the Harvard Medical School, the University of North Texas Health Science Center, New Zealand’s forensic research unit, and NIST—to clarify the proper use of CPI. These scientists presented observations at a public meeting, where many attorneys learned for the first time the extent to which DNA-mixture analysis involved subjective interpretation. Many of the problems with the CPI statistic arose because existing guidelines did not clearly, adequately, or correctly specify the proper use or limitations of the approach. In summary, the interpretation of complex DNA mixtures with the CPI statistic has been an inadequately specified—and thus inappropriately subjective—method. As such, the method is clearly not foundationally valid. In an attempt to fill this gap, the experts convened by TFSC wrote a joint scientific paper, which was published online on August 31, 2016. 209 The paper underscores the “pressing need . . . for standardization of an approach, training and ongoing testing of DNA analysts.” The authors propose a set of specific rules for the use of the CPI statistic. The proposed rules are clearly necessary for a scientifically valid method for the application of CPI. Because the paper appeared just as this report was being finalized, PCAST has not had adequate time to assess whether the rules are also sufficient to define an objective and scientifically valid method for the application of CPI. Current Efforts to Develop Objective Methods Given these problems, several groups have launched efforts to develop “probabilistic genotyping” computer programs that apply various algorithms to interpret complex mixtures. As of March 2014, at least 8 probabilistic genotyping software programs had been developed (called LRmix, Lab Retriever, likeLTD, FST, Armed Xpert, TrueAllele, STRmix, and DNA View Mixture Solution), with some being open source software and some being Bieber, F.R., Buckleton, J.S., Budowle, B., Butler, J.M., and M.D. Coble. “Evaluation of forensic DNA mixture evidence: protocol for evaluation, interpretation, and statistical calculations using the combined probability of inclusion.” BMC Genetics. bmcgenet.biomedcentral.com/articles/10.1186/s12863-016-0429-7. 209 78 commercial products. 210 The FBI Laboratory began using the STRmix program less than a year ago, in December 2015, and is still in the process of publishing its own internal developmental validation. These probabilistic genotyping software programs clearly represent a major improvement over purely subjective interpretation. However, they still require careful scrutiny to determine (1) whether the methods are scientifically valid, including defining the limitations on their reliability (that is, the circumstances in which they may yield unreliable results) and (2) whether the software correctly implements the methods. This is particularly important because the programs employ different mathematical algorithms and can yield different results for the same mixture profile. 211 Appropriate evaluation of the proposed methods should consist of studies by multiple groups, not associated with the software developers, that investigate the performance and define the limitations of programs by testing them on a wide range of mixtures with different properties. In particular, it is important to address the following issues: (1) How well does the method perform as a function of the number of contributors to the mixture? How well does it perform when the number of contributors to the mixture is unknown? (2) How does the method perform as a function of the number of alleles shared among individuals in the mixture? Relatedly, how does it perform when the mixtures include related individuals? (3) How well does the method perform—and how does accuracy degrade—as a function of the absolute and relative amounts of DNA from the various contributors? For example, it can be difficult to determine whether a small peak in the mixture profile represents a true allele from a minor contributor or a stutter peak from a nearby allele from a different contributor. (Notably, this issue underlies a current case that has received considerable attention. 212) 210 The topic is reviewed in Butler, J.M. "Chapter 13: Coping with Potential Missing Alleles." Advanced Topics in Forensic DNA Typing: Interpretation. Waltham, MA: Elsevier/Academic, (2015): 333-48. 211 Some programs use discrete (semi-continuous) methods, which use only allele information in conjunction with probabilities of allelic dropout and dropin, while other programs use continuous methods, which also incorporate information about peak height and other information. Within these two classes, the programs differ with respect to how they use the information. Some of the methods involve making assumptions about the number of individuals contributing to the DNA profile, and use this information to clean up noise (such as “stutter” in DNA profiles). 212 In this case, examiners used two different DNA software programs (STRMix and TrueAllele) and obtained different conclusions concerning whether DNA from the defendant could be said to be included within the low-level DNA mixture profile obtained from a sample collected from one of the victim’s fingernails. The judge ruled that the DNA evidence implicating the defendant was inadmissible. McKinley, J. “Potsdam Boy’s Murder Case May Hinge on Minuscule DNA Sample From Fingernail.” New York Times. See: www.nytimes.com/2016/07/25/nyregion/potsdam-boys-murder-case-mayhinge-on-statistical-analysis.html (accessed August 22, 2016). Sommerstein, D. “DNA results will not be allowed in Hillary murder trail.” North Country Public Radio (accessed September 1, 2016). The decision can be found here: www.northcountrypublicradio.org/assets/files/08-26-16DecisionandOrder-DNAAnalysisAdmissibility.pdf. 79 (4) Under what circumstances—and why—does the method produce results (random inclusion probabilities) that differ substantially from those produced by other methods? A number of papers have been published that analyze known mixtures in order to address some of these issues. 213 Two points should be noted about these studies. First, most of the studies evaluating software packages have been undertaken by the software developers themselves. While it is completely appropriate for method developers to evaluate their own methods, establishing scientific validity also requires scientific evaluation by other scientific groups that did not develop the method. Second, there have been few comparative studies across the methods to evaluate the differences among them—and, to our knowledge, no comparative studies conducted by independent groups. 214 Most importantly, current studies have adequately explored only a limited range of mixture types (with respect to number of contributors, ratio of minor contributors, and total amount of DNA). The two most widely used methods (STRMix and TrueAllele) appear to be reliable within a certain range, based on the available evidence and the inherent difficulty of the problem. 215 Specifically, these methods appear to be reliable for three-person mixtures in which the minor contributor constitutes at least 20 percent of the intact DNA in the mixture and in which the DNA amount exceeds the minimum level required for the method. 216 For example: Perlin, M.W., Hornyak, J.M., Sugimoto, G., and K.W.P. Miller. “TrueAllele genotype identification on DNA mixtures containing up to five unknown contributors.” Journal of Forensic Sciences, Vol. 60, No. 4 (2015): 857-868; Greenspoon S.A., Schiermeier-Wood L., and B.C. Jenkins. “Establishing the limits of TrueAllele® Casework: A validation study.” Journal of Forensic Sciences. Vol. 60, No. 5 (2015):1263–76; Bright, J.A., Taylor, D., McGovern, C., Cooper, S., Russell, L., Abarno, D., and J.S. Buckleton. “Developmental validation of STRmixTM, expert software for the interpretation of forensic DNA profiles.” Forensic Science International: Genetics. Vol. 23 (2016): 226-39; Bright, J-A., Taylor D., Curran, J.S., and J.S. Buckleton. “Searching mixed DNA profiles directly against profile databases.” Forensic Science International: Genetics. Vol. 9 (2014):102-10; Taylor D., Buckleton J, and I. Evett. “Testing likelihood ratios produced from complex DNA profiles.” Forensic Science International: Genetics. Vol. 16 (2015): 165-171; Taylor D. and J.S. Buckleton. “Do low template DNA profiles have useful quantitative data?” Forensic Science International: Genetics, Vol. 16 (2015): 13-16. 214 Bille, T.W., Weitz, S.M., Coble, M.D., Buckleton, J., and J.A. Bright. “Comparison of the performance of different models for the interpretation of low level mixed DNA profiles.” Electrophoresis. Vol. 35 (2014): 3125–33. 215 The interpretation of DNA mixtures becomes increasingly challenging as the number of contributors increases. See, for example: Taylor D., Buckleton J, and I. Evett. “Testing likelihood ratios produced from complex DNA profiles.” Forensic Science International: Genetics. Vol. 16 (2015): 165-171; Bright, J.A., Taylor, D., McGovern, C., Cooper, S., Russell, L., Abarno, D., and J.S. Buckleton. “Developmental validation of STRmixTM, expert software for the interpretation of forensic DNA profiles.” Forensic Science International: Genetics. Vol. 23 (2016): 226-39; Bright, J-A., Taylor D., Curran, J.S., and J.S. Buckleton. “Searching mixed DNA profiles directly against profile databases.” Forensic Science International: Genetics. Vol. 9 (2014):102-10; Bieber, F.R., Buckleton, J.S., Budowle, B., Butler, J.M., and M.D. Coble. “Evaluation of forensic DNA mixture evidence: protocol for evaluation, interpretation, and statistical calculations using the combined probability of inclusion.” BMC Genetics. bmcgenet.biomedcentral.com/articles/10.1186/s12863-016-0429-7. 216 Such three-person samples involving similar proportions are more straightforward to interpret owing to the limited number of alleles and relatively similar peak height. The methods can also be reliably applied to single-source and simplemixture samples, provided that, in cases where the two contributions cannot be separated by differential extraction, the proportion of the minor contributor is not too low (e.g., at least 10 percent). 213 80 For more complex mixtures (e.g. more contributors or lower proportions), there is relatively little published evidence. 217 In human molecular genetics, an experimental validation of an important diagnostic method would typically involve hundreds of distinct samples. 218 One forensic scientist told PCAST that many more distinct samples have, in fact, been analyzed, but that the data have not yet been collated and published. 219 Because empirical evidence is essential for establishing the foundational validity of a method, PCAST urges forensic scientists to submit and leading scientific journals to publish high-quality validation studies that properly establish the range of reliability of methods for the analysis of complex DNA mixtures. When further studies are published, it will likely be possible to extend the range in which scientific validity has been established to include more challenging samples. As noted above, such studies should be performed by or should include independent research groups not connected with the developers of the methods and with no stake in the outcome. Conclusion Based on its evaluation of the published literature to date, PCAST reached several conclusions concerning the foundational validity of methods for the analysis of complex DNA mixtures. We note that foundational validity must be established with respect to a specified method applied to a specified range. In addition to forming its own judgment, PCAST also consulted with John Butler, Special Assistant to the Director for Forensic Science at NIST and Vice Chair of the NCFS. 220 Butler concurred with PCAST’s finding. For four-person mixtures, for example, papers describing experimental validations with known mixtures using TrueAllele involve 7 and 17 distinct mixtures, respectively, with relatively large amounts of DNA (at least 200 pg), while those using STRMix involve 2 and 3 distinct mixtures, respectively, but use much lower amounts of DNA (in the range of 10 pg). Greenspoon S.A., Schiermeier-Wood L., and B.C. Jenkins. “Establishing the limits of TrueAllele® Casework: A validation study.” Journal of Forensic Sciences. Vol. 60, No. 5 (2015):1263–76; Perlin, M.W., Hornyak, J.M., Sugimoto, G., and K.W.P. Miller. “TrueAllele genotype identification on DNA mixtures containing up to five unknown contributors.” Journal of Forensic Sciences, Vol. 60, No. 4 (2015): 857-868; Taylor, D. “Using continuous DNA interpretation methods to revisit likelihood ratio behavior.” Forensic Science International: Genetics, Vol. 11 (2014): 144-153; Taylor D., Buckleton J, and I. Evett. “Testing likelihood ratios produced from complex DNA profiles.” Forensic Science International: Genetics. Vol. 16 (2015): 165-171; Taylor D. and J.S. Buckleton. “Do low template DNA profiles have useful quantitative data?” Forensic Science International: Genetics, Vol. 16 (2015): 13-16; Bright, J.A., Taylor, D., McGovern, C., Cooper, S., Russell, L., Abarno, D., J.S. Buckleton. “Developmental validation of STRmixTM, expert software for the interpretation of forensic DNA profiles.” Forensic Science International: Genetics. Vol. 23 (2016): 226-39. 218 Preparing and performing PCR amplication on hundreds of DNA mixtures is straightforward; it can be accomplished within a few weeks or less. 219 PCAST interview with John Buckleton, Principal Scientist at New Zealand’s Institute of Environmental Science and Research and a co-developer of STRMix. 220 Butler is a world authority on forensic DNA analysis, whose Ph.D. research, conducted at the FBI Laboratory, pioneered techniques of modern forensic DNA analysis and who has written five widely acclaimed textbooks on forensic DNA typing. See: Butler, J.M. Forensic DNA Typing: Biology and Technology behind STR Markers. Academic Press, London (2001); Butler, J.M. Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers (2nd Edition). Elsevier Academic Press, New York (2005); Butler, J.M. Fundamentals of Forensic DNA Typing. Elsevier Academic Press, San Diego (2010); Butler, J.M. Advanced Topics in Forensic DNA Typing: Methodology. Elsevier Academic Press, San Diego (2012); Butler, J.M. Advanced Topics in Forensic DNA Typing: Interpretation. Elsevier Academic Press, San Diego (2015). 217 81 Finding 3: DNA analysis of complex-mixture samples Foundational validity. PCAST finds that: (1) Combined-Probability-of-Inclusion (CPI)-based methods. DNA analysis of complex mixtures based on CPI-based approaches has been an inadequately specified, subjective method that has the potential to lead to erroneous results. As such, it is not foundationally valid. A very recent paper has proposed specific rules that address a number of problems in the use of CPI. These rules are clearly necessary. However, PCAST has not adequate time to assess whether they are also sufficient to define an objective and scientifically valid method. If, for a limited time, courts choose to admit results based on the application of CPI, validity as applied would require that, at a minimum, they be consistent with the rules specified in the paper. DNA analysis of complex mixtures should move rapidly to more appropriate methods based on probabilistic genotyping. (2) Probabilistic genotyping. Objective analysis of complex DNA mixtures with probabilistic genotyping software is relatively new and promising approach. Empirical evidence is required to establish the foundational validity of each such method within specified ranges. At present, published evidence supports the foundational validity of analysis, with some programs, of DNA mixtures of 3 individuals in which the minor contributor constitutes at least 20 percent of the intact DNA in the mixture and in which the DNA amount exceeds the minimum required level for the method. The range in which foundational validity has been established is likely to grow as adequate evidence for more complex mixtures is obtained and published. Validity as applied. For methods that are foundationally valid, validity as applied involves similar considerations as for DNA analysis of single-source and simple-mixtures samples, with a special emphasis on ensuring that the method was applied correctly and within its empirically established range. The Path Forward There is a clear path for extending the range over which objective methods have been established to be foundationally valid—specifically, through the publication of appropriate scientific studies. Such efforts will be aided by the creation and dissemination (under appropriate data-use and data-privacy restrictions) of large collections of hundreds of DNA profiles created from known mixtures—representing widely varying complexity with respect to (1) the number of contributors, (2) the relationships among contributors, (3) the absolute and relative amounts of materials, and (4) the state of preservation of materials—that can be used by independent groups to evaluate and compare the methods. Notably, the PROVEDIt Initiative (Project Research Openness for Validation with Experimental Data) at Boston University has made available a resource of 82 25,000 profiles from DNA mixtures. 221,222 In addition to scientific studies on common sets of samples for the purpose of evaluating foundational validity, individual forensic laboratories will want to conduct their own internal developmental validation studies to assess the validity of the method in their own hands. 223 NIST should play a leadership role in this process, by ensuring the creation and dissemination of materials and stimulating studies by independent groups through grants, contracts, and prizes; and by evaluating the results of these studies. 5.3 Bitemark Analysis Methodology Bitemark analysis is a subjective method. It typically involves examining marks left on a victim or an object at the crime scene, and comparing those marks with dental impressions taken from a suspect. 224 Bitemark comparison is based on the premises that (1) dental characteristics, particularly the arrangement of the front teeth, differ substantially among people and (2) skin (or some other marked surface at a crime scene) can reliably capture these distinctive features. Bitemark analysis begins with an examiner deciding whether an injury is a mark caused by human teeth. 225 If so, the examiner creates photographs or impressions of the questioned bitemark and of the suspect’s dentition; compares the bitemark and the dentition; and determines if the dentition (1) cannot be excluded as having made the bitemark, (2) can be excluded as having made the bitemark, or (3) is inconclusive. The bitemark standards do not provide well-defined standards concerning the degree of similarity that must be identified to support a reliable conclusion that the mark could have or could not have been created by the dentition in question. Conclusions about all these matters are left to the examiner’s judgment. Background Studies Before turning to the question of foundational validity, we discuss some background studies (concerning such topics as uniqueness and consistency) that shed some light on the field. These studies cast serious doubt on the fundamental premises of the field. See: www.bu.edu/dnamixtures. The collection contains DNA samples with 1- to 5-person DNA mixtures, amplified with targets ranging from 1 to 0.007 ng. In the multi-person mixtures, the ratio of contributors range from 1:1 to 1:19. Additionally, the profiles were generated using a variety of laboratory conditions from samples containing pristine DNA; UV damaged DNA; enzymatically or sonically degraded DNA; and inhibited DNA. 223 The FBI Laboratory has recently completed a developmental validation study and is preparing it for publication. 224 Less frequently, marks are found on a suspected perpetrator that may have come from a victim. 225 ABFO Bitemark Methodology Standards and Guidelines, abfo.org/wp-content/uploads/2016/03/ABFO-BitemarkStandards-03162016.pdf (accessed July 2, 2016). 221 222 83 A widely cited 1984 paper claimed that “human dentition was unique beyond any reasonable doubt.” 226 The study examined 397 bitemarks carefully made in a wax wafer, measured 12 parameters from each, and— assuming, without any evidence, that the parameters were uncorrelated with each other—suggested that the chance of two bitemarks having the same parameters is less than one in six trillion. The paper was theoretical rather than empirical: it did not attempt to actually compare the bitemarks to one another. A 2010 paper debunked these claims. 227 By empirically studying 344 human dental casts and measuring them by three-dimensional laser scanning, these authors showed that matches occurred vastly more often than expected under the theoretical model. For example, the theoretical model predicted that the probability of finding even a single five-tooth match among the collection of bitemarks is less than one in one million; yet, the empirical comparison revealed 32 such matches. Notably, these studies examined human dentition patterns measured under idealized conditions. By contrast, skin has been shown to be an unreliable medium for recording the precise pattern of teeth. Studies that have involved inflicting bitemarks either on living pigs 228 (used as a model of human skin) or human cadavers 229 have demonstrated significant distortion in all directions. A 2010 study of experimentally created bitemarks produced by known biters concluded that skin deformation distorts bitemarks so substantially and so variably that current procedures for comparing bitemarks are unable to reliably exclude or include a suspect as a potential biter (“The data derived showed no correlation and was not reproducible, that is, the same dentition could not create a measurable impression that was consistent in all of the parameters in any of the test circumstances.”) 230 Such distortion is further complicated in the context of criminal cases, where biting often occurs during struggles, in which skin may be stretched and contorted at the time a bitemark is created. Empirical research suggests that forensic odontologists do not consistently agree even on whether an injury is a human bitemark at all. A study by the American Board of Forensic Odontology (AFBO) 231 involved showing photos of 100 patterned injuries to ABFO board-certified bitemark analysts, and asking them to answer three basic questions concerning (1) whether there was sufficient evidence to render an opinion as to whether the patterned injury is a human bitemark; (2) whether the mark is a human bitemark, suggestive of a human 226 Rawson, R.D., Ommen, R.K., Kinard, G., Johnson, J., and A. Yfantis. “Statistical evidence for the individuality of the human dentition.” Journal of Forensic Sciences, Vol. 29, No. 1 (1984): 245-53. 227 Bush, M.A., Bush, P.J., and H.D. Sheets. “Statistical evidence for the similarity of the human dentition.” Journal of Forensic Sciences, Vol. 56, No. 1 (2011): 118-23. 228 Dorion, R.B.J., ed. Bitemark Evidence: A Color Atlas and Text. 2nd ed. CRC Press-Taylor & Francis, Boca Raton, Florida (2011). 229 Sheets, H.D., Bush, P.J., and M.A. Bush. “Bitemarks: distortion and covariation of the maxillary and mandibular dentition as impressed in human skin.” Forensic Science International, Vol. 223, No. 1-3 (2012): 202-7. Bush, M.A., Miller, R.G., Bush, P.J., and R.B. Dorion. “Biomechanical factors in human dermal bitemarks in a cadaver model.” Journal of Forensic Sciences, Vol. 54, No. 1 (2009): 167-76. 230 Bush, M.A., Cooper, H.I., and R.B. Dorion. “Inquiry into the scientific basis for bitemark profiling and arbitrary distortion compensation.” Journal of Forensic Sciences, Vol. 55, No. 4 (2010): 976-83. 231 Adam Freeman and Iain Pretty “Construct validity of bitemark assessments using the ABFO decision tree,” presentation at the 2016 Annual Meeting of the American Academy of Forensic Sciences. See: online.wsj.com/public/resources/documents/ConstructValidBMdecisiontreePRETTYFREEMAN.pdf. 84 bitemark, or not a human bitemark; and (3) whether distinct features (arches and toothmarks) were identifiable. 232 Among the 38 examiners who completed the study, it was reported that there was unanimous agreement on the first question in only 4 of the 100 cases and agreement of at least 90 percent in only 20 of the 100 cases. Across all three questions, there was agreement of at least 90 percent in only 8 of the 100 cases. In a similar study in Australia, 15 odontologists were shown a series of six bitemarks from contemporary cases, five of which were marks confirmed by living victims to have been caused by teeth, and were asked to explain, in narrative form, whether the injuries were, in fact, bitemarks. 233 The study found wide variability among the practitioners in their conclusions about the origin, circumstance, and characteristics of the patterned injury for all six images. Surprisingly, those with the most experience (21 or more years) tended to have the widest range of opinions as to whether a mark was of human dental origin or not. 234 Examiners’ opinions varied considerably as to whether they thought a given mark was suitable for analysis, and individual practitioners demonstrated little consistency in their approach in analyzing one bitemark to the next. The study concluded that this “inconsistency indicates a fundamental flaw in the methodology of bitemark analysis and should lead to concerns regarding the reliability of any conclusions reached about matching such a bitemark to a dentition.” 235 Studies of Scientific Validity and Reliability As discussed above, the foundational validity of a subjective method can only be established through multiple independent black-box studies. The 2009 NRC report found that the scientific validity of bitemark analysis had not been established. 236 In its own review of the literature PCAST found few empirical studies that attempted to study the validity and reliability of the methods to identify the source of a bitemark. In a 1975 paper, two examiners were asked to match photographs of bitemarks made by 24 volunteers in skin from freshly slaughtered pigs with dental models from these same volunteers. 237 The photographs were taken at 0, 1, and 24 hours after the bitemark was produced. Examiners’ performance was poor and deteriorated with The raw data are made available by the authors upon request. They were reviewed by Professor Karen Kafadar, a member of the panel of Senior Advisors for this study. 233 Page, M., Taylor, J., and M. Blenkin. “Expert interpretation of bitemark injuries – a contemporary qualitative study.” Journal of Forensic Sciences, Vol. 58, No. 3 (2013): 664-72. 234 For example, one examiner expressed certainty that one of the images was a bitemark, stating, “I know from experience that that’s teeth because I did a case at the beginning of the year, that when I first looked at the images I didn’t think they were teeth, because the injuries were so severe. But when I saw the models, and scratched them down my arm, they looked just like that.” Another expressed doubt that the same image was a bitemark, also based on his or her experience: “Honestly I don’t think it’s a bite mark… there could be any number of things that could have caused that. Whether this is individual tooth marks here I doubt. I’ve never seen anything like that.” Ibid., 666. 235 Ibid., 670. 236 “There is continuing dispute over the value and scientific validity of comparing and identifying bite marks.” National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 151. 237 Whittaker, D.K. “Some laboratory studies on the accuracy of bitemark comparison.” International Dental Journal, Vol. 25, No. 3 (1975): 166–71. 232 85 time following the bite. The proportion of photographs incorrectly attributed was 28 percent, 65 percent, and 84 percent at the 0, 1, and 24 hour time points. In a 1999 paper, 29 forensic dental experts—as well as 80 others, including general dentists, dental students, and lay participants—were shown color prints of human bitemarks from 50 court cases and asked to decide whether each bitemark was made by an adult or a child. 238 The decisions were compared to the verdict from the cases. All groups performed poorly. 239 In a 2001 paper, 32 AFBO-certified diplomates were asked to report their certainty that 4 specific bitemarks might have come from each of 7 dental models, consisting of the four correct sources and three unrelated samples. 240,241 Such a “closed-set” design (where the correct source is present for each questioned samples) is inappropriate for assessing reliability, because it will tend to underestimate the false positive rate. 242 Even with this closed-set design, 11 percent of comparisons to the incorrect source were declared to be “probable,” “possible,” or “reasonable medical certainty” matches. In another 2001 paper, 10 AFBO-certified diplomates were given 10 independent tests, each consisting of bitemark evidence and two possible sources. The evidence was produced by clamping a dental model onto freshly slaughtered pigs, subjectively confirming that “sufficient detail was recorded,” and photographing the bitemark. The correct source was present in all but two of the tests (mostly closed-set design). The mean false positive rate was 15.9 percent—that is, roughly 1 in 6. In a 2010 paper, 29 examiners with various levels of training (including 9 AFBO-certified diplomates) were provided with photographs of 18 human bitemarks and dentition from three human individuals (A, B, C) and were asked to decide whether the bitemarks came from A, B, C, or none of the above. The bitemarks had been produced in live pigs, using a biting machine with dentition from individuals A, B, and D (for which the dentition was not provided to the examiners). For bitemarks produced by D, the diplomates erroneously declared a match to A, B, or C in 17 percent of cases—again, roughly 1 in 6. Whittaker, D.K., Brickley, M.R., and L. Evans. “A comparison of the ability of experts and non-experts to differentiate between adult and child human bite marks using receiver operating characteristic (ROC) analysis.” Forensic Science International, Vol. 92, No. 1 (1998): 11-20. 239 The authors asked observers to indicate how certain they were a bitemark was made by an adult, using a 6 point scale. Receiver-Operator Characteristic (ROC) curves were derived from the data. The Area under the Curve (AUC) was calculated for each group (where AUC = 1 represents perfect classification and AUC = 0.5 is equivalent to random decision-making). The Area under the Curve (AUC) was between 0.62-0.69, which is poor. 240 Arheart, K.L., and I.A. Pretty. “Results of the 4th AFBO Bitemark Workshop-1999.” Forensic Science International, Vol. 124, No. 2-3 (2001): 104-11. 241 The four bitemarks consisted of three from criminal cases and one produced by an individual deliberately biting into a block of cheese. The seven dental models corresponded to the three defendants convicted in the criminal cases (presumed to be the biters), the individual who bit the cheese, and three unrelated individuals. 242 In closed-set tests, examiners will perform well as long as they choose the closest matching dental model. In an open-set design in which none of models may be correct, the opportunity for false positives is higher. The open-set design resembles the application in casework. See the extensive discussion of closed-set designs in firearms analysis (Section 5.5). 238 86 Conclusion Few empirical studies have been undertaken to study the ability of examiners to accurately identify the source of a bitemark. Among those studies that have been undertaken, the observed false positive rates were so high that the method is clearly scientifically unreliable at present. (Moreover, several of these studies employ inappropriate closed-set designs that are likely to underestimate the false-positive rate.) Finding 4: Bitemark analysis Foundational validity. PCAST finds that bitemark analysis does not meet the scientific standards for foundational validity, and is far from meeting such standards. To the contrary, available scientific evidence strongly suggests that examiners cannot consistently agree on whether an injury is a human bitemark and cannot identify the source of bitemark with reasonable accuracy. The Path Forward Some practitioners have expressed concern that the exclusion of bitemarks in court could hamper efforts to convict defendants in some cases. 243 If so, the correct solution, from a scientific perspective, would not be to admit expert testimony based on invalid and unreliable methods, but rather to attempt to develop scientifically valid methods. However, PCAST considers the prospects of developing bitemark analysis into a scientifically valid method to be low. We advise against devoting significant resources to such efforts. 5.4 Latent Fingerprint Analysis Latent fingerprint analysis was first proposed for use in criminal identification in the 1800s and has been used for more than a century. The method was long hailed as infallible, despite the lack of appropriate studies to assess its error rate. As discussed above, this dearth of empirical testing indicated a serious weakness in the scientific culture of forensic science—where validity was assumed rather than proven. Citing earlier guidelines now acknowledged to have been inappropriate, 244 the DOJ recently noted, Historically, it was common practice for an examiner to testify that when the … methodology was correctly applied, it would always produce the correct conclusion. Thus any error that occurred would be human error and the resulting error rate of the methodology would be zero. This view was described by the Department of Justice in 1984 in the publication The Science of Fingerprints, where it states, “Of all the methods of identification, fingerprinting alone has proved to be both infallible and feasible.” 245 In response to the 2009 NRC report, the latent print analysis field has made progress in recognizing the need to perform empirical studies to assess foundational validity and measure reliability. Much credit goes to the FBI The precise proportion of cases in which bitemarks play a key role is unclear, but is clearly small. Federal Bureau of Investigation. The Science of Fingerprints. U.S. Government Printing Office. (1984): iv. 245 See: www.justice.gov/olp/file/861906/download. 243 244 87 Laboratory, which has led the way in performing both black-box studies, designed to measure reliability, and “white-box studies,” designed to understand the factors that affect examiners’ decisions. 246 PCAST applauds the FBI’s efforts. There are also nascent efforts to begin to move the field from a purely subjective method toward an objective method—although there is still a considerable way to go to achieve this important goal. Methodology Latent fingerprint analysis typically involves comparing (1) a “latent print” (a complete or partial friction-ridge impression from an unknown subject) that has been developed or observed on an item) with (2) one or more “known prints” (fingerprints deliberately collected under a controlled setting from known subjects; also referred to as “ten prints”), to assess whether the two may have originated from the same source. (It may also involve comparing latent prints with one another.) It is important to distinguish latent prints from known prints. A known print contains fingerprint images of up to ten fingers captured in a controlled setting, such as an arrest or a background check. 247 Because known prints tend to be of high quality, they can be searched automatically and reliably against large databases. By contrast, latent prints in criminal cases are often incomplete and of variable quality (smudged or otherwise distorted), with quality and clarity depending on such factors as the surface touched and the mechanics of touch. An examiner might be called upon to (1) compare a latent print to the fingerprints of a known suspect that has been identified by other means (“identified suspect”) or (2) search a large database of fingerprints to identify a suspect (“database search”). See: Hicklin, R.A., Buscaglia, J., Roberts, M.A., Meagher, S.B., Fellner, W., Burge, M.J., Monaco, M., Vera, D., Pantzer, L.R., Yeung, C.C., and N. Unnikumaran. “Latent fingerprint quality: a survey of examiners.” Journal of Forensic Identification. Vol. 61, No. 4 (2011): 385-419; Hicklin, R.A., Buscaglia, J., and M.A. Roberts. “Assessing the clarity of friction ridge impressions.” Forensic Science International, Vol. 226, No. 1 (2013): 106-17; Ulery, B.T., Hicklin, R.A., Kiebuzinski, G.I., Roberts, M.A., and J. Buscaglia. “Understanding the sufficiency of information for latent fingerprint value determinations.” Forensic Science International, Vol. 230, No. 1-3 (2013): 99-106; Ulery, B.T., Hicklin, R.A., and J. Buscaglia. “Repeatability and reproducibility of decisions by latent fingerprint examiners.” PLoS ONE, (2012); and Ulery, B.T., Hicklin, R.A., Roberts, M.A., and J. Buscaglia. “Changes in latent fingerprint examiners’ markup between analysis and comparison.” Forensic Science International, Vol. 247 (2015): 54-61. 247 See: Committee on Science, Subcommittee on Forensic Science of the National Science and Technology Council. “Achieving Interoperability for Latent Fingerprint Identification in the United States.” (2014). www.whitehouse.gov/sites/default/files/microsites/ostp/NSTC/afis_10-20-2014_draftforcomment.pdf. 246 88 Examiners typically follow an approach called “ACE” or “ACE-V,” for Analysis, Comparison, Evaluation, and Verification. 248,249 The approach calls on examiners to make a series of subjective assessments. An examiner uses subjective judgment to select particular regions of a latent print for analysis. If there are no identified persons of interest, the examiner will run the latent print against an Automated Fingerprint Identification System (AFIS), 250 containing large numbers of known prints, which uses non-public, proprietary imagerecognition algorithms 251 to generate a list of potential candidates that share similar fingerprint features. 252 The examiner then manually compares the latent print to the fingerprints from the specific person of interest or from the closest candidate matches generated by the computer by studying selected features 253 and then comes to a subjective decision as to whether they are similar enough to declare a proposed identification. ACE-V adds a verification step. For the verification step, implementation varies widely. 254 In many laboratories, only identifications are verified, because it is considered too burdensome, in terms of time and cost, to conduct “A latent print examination using the ACE-V process proceeds as follows: Analysis refers to an initial informationgathering phase in which the examiner studies the unknown print to assess the quality and quantity of discriminating detail present. The examiner considers information such as substrate, development method, various levels of ridge detail, and pressure distortions. A separate analysis then occurs with the exemplar print. Comparison is the side-by-side observation of the friction ridge detail in the two prints to determine the agreement or disagreement in the details. In the Evaluation phase, the examiner assesses the agreement or disagreement of the information observed during Analysis and Comparison and forms a conclusion. Verification in some agencies is a review of an examiner’s conclusions with knowledge of those conclusions; in other agencies, it is an independent re-examination by a second examiner who does not know the outcome of the first examination.” National Institute of Standards and Technology. “Latent Print Examination and Human Factors: Improving the Practice through a Systems Approach.” (2012), available at: www.nist.gov/oles/upload/latent.pdf. 249 Reznicek, M., Ruth, R.M., and D.M. Schilens. “ACE-V and the scientific method.” Journal of Forensic Identification, Vol. 60, No. 1 (2010): 87-103. 250 State and local jurisdictions began purchasing AFIS systems in the 1970s and 1980s from private vendors, each with their own proprietary software and searching algorithms. In 1999, the FBI launched the Integrated Automated Fingerprint Identification System (IAFIS), a national fingerprint database that houses fingerprints and criminal histories on more than 70 million subjects submitted by state, local and federal law enforcement agencies (recently replaced by the Next Generation Identification (NGI) System). Some criminal justice agencies have the ability to search latent prints not only against their own fingerprint database but also against a hierarchy of local, state, and federal databases. System-wide interoperability, however, has yet to be achieved. See: Committee on Science, Subcommittee on Forensic Science of the National Science and Technology Council. “Achieving Interoperability for Latent Fingerprint Identification in the United States.” (2014). www.whitehouse.gov/sites/default/files/microsites/ostp/NSTC/afis_10-20-2014_draftforcomment.pdf. 251 The algorithms used in generating candidate matches are proprietary and have not been made publicly available. 252 The FBI Laboratory requires examiners to complete and document their analysis of the latent fingerprint before reviewing any known fingerprints or moving to the comparison and evaluation phase, this this requirement is not shared by all labs. 253 Fingerprint features are compared at three levels of detail—level 1 (“ridge flow”), level 2 (“ridge path”), and level 3 (“ridge features” or “shapes”). “Ridge flow” refers to classes of pattern types shared by many individuals, such as loop or whorl formations; this level is only sufficient for exclusions, not for declaring identifications. “Ridge path” refers to minutiae that can be used for declaring identifications, such as bifurcations or dots. “Ridge shapes” include the edges of ridges and location of pores. See: National Institute of Standards and Technology. “Latent Print Examination and Human Factors: Improving the Practice through a Systems Approach.” (2012), available at: www.nist.gov/oles/upload/latent.pdf. 254 Black, J.P. “Is there a need for 100% verification (review) of latent print examination conclusions?” Journal of Forensic Identification, Vol. 62, No.1 (2012): 80-100. 248 89 independent examinations in all cases (for example, exclusions). This procedure is problematic because it is not blind: the second examiner knows the first examiner reached a conclusion of proposed identification, which creates the potential for confirmation bias. In the aftermath of the Madrid train bombing case misidentification (see below), the FBI Laboratory adopted requirements to conduct, in certain cases, “independent application of ACE to a friction ridge print by another qualified examiner, who does not know the conclusion of the primary examiner.” 255 In particular, the FBI Laboratory uses blind verification in cases considered to present the greatest risk of error, such as where a single fingerprint is identified, excluded, or deemed inconclusive. 256 As noted in Chapter 2, earlier concerns 257 about the reliability of latent fingerprint analysis increased substantially following a prominent misidentification of a latent fingerprint recovered from the 2004 bombing of the Madrid commuter train system. An FBI examiner concluded with “100 percent certainty” that the fingerprint matched Brandon Mayfield, an American in Portland, Oregon, even though Spanish authorities were unable to confirm the identification. Reviewers believe the misidentification resulted in part from “confirmation bias” and “reverse reasoning”—that is, going from the known print to the latent image in a way that led to overreliance on apparent similarities and inadequate attention to differences. 258 As described in a recent paper by scientists at the FBI Laboratory, A notable example of the problem of bias from the exemplar resulting in circular reasoning occurred in the Madrid misidentification, in which the initial examiner reinterpreted five of the original seven analysis points to be more consistent with the (incorrect) exemplar: ‘‘Having found as many as 10 points of unusual similarity, the FBI examiners began to ‘find’ additional features in LFP 17 [the latent print] that were not really there, but rather suggested to the examiners by features in the Mayfield prints.’’ 259 In contrast to DNA analysis, the rules for declaring an identification that were historically used in fingerprint analysis were not set in advance nor uniform among examiners. As described by a February 2012 report from an Expert Working Group commissioned by NIST and NIJ: U.S. Department of Justice, Office of the Inspector General. “A Review of the FBI’s Progress in Responding to the Recommendations in the Office of the Inspector General Report on the Fingerprint Misidentification in the Brandon Mayfield Case.” (2011). www.oig.justice.gov/special/s1105.pdf. See also: Federal Bureau of Investigation. Laboratory Division. Latent Print Operations Manual: Standard Operating Procedures for Examining Friction Ridge Prints. FBI Laboratory, Quantico, Virginia, 2007 (updated May 24, 2011). 256 Federal Bureau of Investigation. Laboratory Division. Latent Print Operations Manual: Standard Operating Procedures for Examining Friction Ridge Prints. FBI Laboratory, Quantico, Virginia, 2007 (updated May 24, 2011). 257 Faigman, D.L., Kaye, D.H., Saks, M.J., and J. Sanders (Eds). Modern Scientific Evidence: The Law and Science of Expert Testimony, 2015-2016 ed. Thomson/West Publishing (2016). Saks, M.J. “Implications of Daubert for forensic identification science.” Shepard’s Expert and Science Evidence Quarterly 427, (1994). 258 A Review of the FBI’s handling of the Brandon Mayfield Case. U.S. Department of Justice, Office of the Inspector General (2006). oig.justice.gov/special/s0601/final.pdf. 259 Ulery, B.T., Hicklin, R.A., Roberts, M.A., and J. Buscaglia. “Changes in latent fingerprint examiners’ markup between analysis and comparison.” Forensic Science International, Vol. 247 (2015): 54-61. The internal quotation is from U.S. Department of Justice Office of the Inspector General: A review of the FBI's handling of the Brandon Mayfield case (March 2006), www.justice.gov/oig/special/s0601/PDF_list.htm. US Department of Justice Office of the Inspector General: A review of the FBI's handling of the Brandon Mayfield case (March 2006), www.justice.gov/oig/special/s0601/PDF_list.htm. 255 90 The thresholds for these decisions can vary among examiners and among forensic service providers. Some examiners state that they report identification if they find a particular number of relatively rare concurring features, for instance, eight or twelve. Others do not use any fixed numerical standard. Some examiners discount seemingly different details as long as there are enough similarities between the two prints. Other examiners practice the one-dissimilarity rule, excluding a print if a single dissimilarity not attributable to perceptible distortion exists. If the examiner decides that the degree of similarity falls short of satisfying the standard, the examiner can report an inconclusive outcome. If the conclusion is that the degree of similarity satisfies the standard, the examiner reports an identification. 260 In September 2011, the Scientific Working Group on Friction Ridge Analysis, Study and Technology (SWGFAST) issued “Standards for Examining Friction Ridge Impressions and Resulting Conclusions (Latent/Tenprint)” that begins to move latent print analysis in the direction of an objective framework. In particular, it suggests criteria concerning what combination of image quality and feature quantity (for example, the number of “minutiae” shared between two fingerprints) would be sufficient to declare an identification. The criteria are not yet fully objective, but they are a step in the right direction. The Friction Ridge Subcommittee of the OSAC has recognized the need for objective criteria in its identification of “Research Needs.” 261 We note that the blackbox studies described below did not set out to test these specific criteria, and so they have not yet been scientifically validated. Studies of Scientific Validity and Reliability As discussed above, the foundational validity of a subjective method can only be established through multiple independent black-box studies appropriately designed to assess validity and reliability. Below, we discuss various studies of latent fingerprint analysis. The first five studies were not intended as validation studies, although they provide some incidental information about performance. Remarkably, there have been only two black-box studies that were intentionally and appropriately designed to assess validity and reliability—the first published by the FBI Laboratory in 2011; the second completed in 2014 but not yet published. Conclusions about foundational validity thus must rest on these two recent studies. In summarizing these studies, we apply the guidelines described earlier in this report (see Chapter 4 and Appendix A). First, while we note (1) both the estimated false positive rates and (2) the upper 95 percent confidence bound on the false positive rate, we focus on the latter as, from a scientific perspective, the appropriate rate to report to a jury—because the primary concern should be about underestimating the false positive rate and the true rate could reasonably be as high as this value. 262 Second, while we note both the false positive rate among conclusive examinations (identifications or exclusions) or among all examinations (including inconclusives) are relevant, we focus primarily on the former as being, from a scientific perspective, the See: NIST. “Latent Print Examination and Human Factors: Improving the Practice through a Systems Approach.” (2012), available at: www.nist.gov/oles/upload/latent.pdf. 261 See: workspace.forensicosac.org/kws/groups/fric_ridge/documents. 262 By convention, the 95 percent confidence bound is most widely used in statistics as reflecting the range of plausible values (see Appendix A). 260 91 appropriate rate to report to a jury—because fingerprint evidence used against a defendant in court will typically be the result of a conclusive examination. Evett and Williams (1996) This paper is a discursive historical review essay that contains a brief description of a small “collaborative study” relevant to the accuracy of fingerprint analysis. 263 In this study, 130 highly experienced examiners in England and Wales, each with at least ten years of experience in forensic fingerprint analysis, were presented with ten latent print-known pairs. Nine of the pairs came from past casework at New Scotland Yard and were presumed to be ‘mated pairs’ (that is, from the same source). The tenth pair was a ‘non-mated pair’ (from different sources), involving a latent print deliberately produced on a “dimpled beer mug.” For the single non-mated pair, the 130 experts made no false identifications. Because the paper does not distinguish between exclusions and inconclusive examinations (and the authors no longer have the data), 264 it is impossible to infer the upper 95 percent confidence bound. 265 Langenburg (2009a) In a small pilot study, the author examined the performance of six examiners on 60 tests each. 266 There were only 15 conclusive examinations involving non-mated pairs (see Table 1 of the paper). There was one false positive, which the author excluded because it appeared to be a clerical error and was not repeated on subsequent retest. Even if this error is excluded, the tiny sample size results in a huge confidence interval (upper 95 percent confidence bound of 19 percent), with this upper bound corresponding to 1 error in 5 cases. Langenburg (2009b) In this small pilot study for the following paper, the author tested examiners in a conference room at a convention of forensic identification specialists. 267 The examiners were divided into three groups: high-bias (n=16), low-bias (n=12), and control (n=15). Each group was presented with 6 latent-known pairs, consisting of 3 mated and 3 non-mated pairs. The first two groups received information designed to bias their judgment by heightening their attention, while the control group received a generic description. For the non-mated pairs, the control group had 1 false positive among 43 conclusive examinations. The false positive rate was 2.3 Evett, I.W., and R.L. Williams. “Review of the 16 point fingerprint standard in England and Wales.” Forensic Science International, Vol. 46, No. 1 (1996): 49-73. 264 I.W. Evett, personal communication. 265 For example, the upper 95 percent confidence bound would be 1 in 44 if all 130 examinations were conclusive and 1 in 22 if half of the examinations were conclusive. 266 Langenburg, G. “A performance study of the ACE-V Process: A pilot study to measure the accuracy, precision, reproducibility, repeatability, and biasability of conclusions resulting from the ACE-V process.” Journal of Forensic Identification, Vol. 59, No. 2 (2009): 219–57. 267 Langenburg, G., Champod, C., and P. Wertheim. “Testing for potential contextual bias effects during the verification stage of the ACE-V methodology when conducting fingerprint comparisons.” Journal of Forensic Sciences, Vol. 54, No. 3 (2009): 571-82. 263 92 percent (upper 95 percent confidence bound of 11 percent), with the upper bound corresponding to 1 error in 9 cases. 268,269 Langenburg, Champod, and Genessay (2012) This study was not designed to assess the accuracy of latent fingerprint analysis, but rather to explore how fingerprint analysts would incorporate information from newly developed tools (such as a quality tool to aid in the assessment of the clarity of the friction ridge details; a statistical tool to provide likelihood ratios representing the strength of the corresponding features between compared fingerprints; and consensus information from a group of trained fingerprint experts) into their decision making processes. 270 Nonetheless, the study provided some information on the accuracy of latent print analysis. Briefly, 158 experts (as well as some trainees) were asked to analyze 12 latent print-exemplar pairs, consisting of 7 mated and 5 non-mated pairs. For the non-mated pairs, there were 17 false positive matches among 711 conclusive examinations by the experts. 271 The false positive rate was 2.4 percent (upper 95 percent confidence bound of 3.5 percent). The estimated error rate corresponds to 1 error in 42 cases, with an upper bound corresponding to 1 error in 28 cases. 272 Tangen et al. (2011) This Australian study was designed to study the reliability of latent fingerprint analysis by fingerprint experts. 273 The authors asked 37 fingerprint experts, as well as 37 novices, to examine 36 latent print-known pairs— consisting of 12 mated pairs, 12 non-mated pairs chosen to be “similar” (the most highly ranked exemplar from a different source in the Australian National Automated Fingerprint Identification System), and 12 “non-similar” non-mated pairs (chosen at random from the other prints). Examiners were asked to rate the likelihood they came from the same source on a scale from 1 to 12. The authors chose to define scores of 1-6 as identifications and scores of 7-12 as exclusions. 274 This approach does not correspond to the procedures used in conventional fingerprint examination. For the “similar” non-mated pairs, the experts made 3 errors among 444 comparisons; the false positive rate was 0.68 percent (upper 95 percent confidence bound of 1.7 percent), with the upper bound corresponding to 1 error in 58 cases. For the “non-similar” non-mated pairs, the examiners made no errors in 444 comparisons; the If the two inconclusive examinations are included, the values are only slightly different: 2.2 percent (upper 95 percent confidence bound of 10.1 percent), with the odds being 1 in 10. 269 The biased groups made no errors among 69 conclusive examinations. 270 Langenburg, G., Champod, C., and T. Genessay. “Informing the judgments of fingerprint analysts using quality metric and statistical assessment tools.” Forensic Science International, Vol. 219, No. 1-3 (2012): 183-98. 271 We thank G. Langenburg for providing the data for the experts alone. 272 If the 79 inconclusive examinations are included, the false positive rate was 2.15 percent (upper 95 percent confidence bound of 3.2 percent). The estimated false positive rate corresponds to 1 error in 47 cases, with the upper bound corresponding to 1 in 31. 273 Tangen, J.M., Thompson, M.B., and D.J. McCarthy. “Identifying fingerprint expertise.” Psychological Science, Vol. 22, No. 8 (2011): 995-7. 274 There were thus no inconclusive results in this study. 268 93 false positive rate was thus 0 percent (upper 95 percent confidence bound of 0.62 percent), with the upper bound corresponding to 1 error in 148 cases. The experts substantially outperformed the novices. Although interesting, the study does not constitute a black-box validation study of latent fingerprint analysis because its design did not resemble the procedures used in forensic practice (in particular, the process of assigning rating on a 12-point scale that the authors subsequently converted into identifications and exclusions). FBI studies The first study designed to test foundational validity and measure reliability of latent fingerprint analysis was a major black-box study conducted by FBI scientists and collaborators. Undertaken in response to the 2009 NRC report, the study was published in 2011 in a leading international science journal, Proceedings of the National Academy of Sciences. 275 The authors assembled a collection of 744 latent-known pairs, consisting of 520 mated pairs and 224 non-mated pairs. To attempt to ensure that the non-mated pairs were representative of the type of matches that might arise when police identify a suspect by searching fingerprint databases, the known prints were selected by searching the latent prints against the 58 million fingerprints in the AFIS database and selecting one of the closest matching hits. Each of 169 fingerprint examiners was shown 100 pairs and asked to classify them as an identification, an exclusion, or inconclusive. The study reported 6 false positive identifications among 3628 nonmated pairs that examiners judged to have “value for identification.” The false positive rate was thus 0.17 percent (upper 95 percent confidence bound of 0.33 percent). The estimated rate corresponds to 1 error in 604 cases, with the upper bound indicating that the rate could be as high as 1 error in 306 cases. 276,277 In 2012, the same authors reported a follow-up study testing repeatability and reproducibility. After a period of about seven months, 75 of the examiners from the previous study re-examined a subset of the latent-known comparisons from the previous study. Among 476 nonmated pairs leading to conclusive examinations (including 4 of the pairs that led to false positives in the initial study and were reassigned to the examiner who had made the erroneous decision), there were no false positives. These results (upper 95 percent confidence bound of 0.63 percent, corresponding to 1 error in 160) are broadly consistent with the false positive rate measured in the previous study. 278 Miami-Dade study (Pacheco et al. (2014)) The Miami-Dade Police Department Forensic Services Bureau, with funding from the NIJ, conducted a black-box study designed to assess foundational validity and measure reliability; the results were reported to the sponsor Ulery, B.T., Hicklin, R.A., Buscaglia, J., and M.A. Roberts. “Accuracy and reliability of forensic latent fingerprint decisions.” Proceedings of the National Academy of Sciences, Vol. 108, No. 19 (2011): 7733-8. 276 If one includes the 455 inconclusive results for latent prints judged to have “value for identification,” the false positive rate is 0.15 percent (upper 95 percent confidence bound of 0 of 0.29 percent). The estimated false positive rate corresponds to 1 error in 681 cases, with the upper bound corresponding to 1 in 344. 277 The sensitivity (proportion of mated samples that were correctly declared to match) was 92.5 percent. 278 Overall, 85-90 percent of the conclusive results were unchanged, with roughly 30 percent of false exclusions being repeated. 275 94 and posted on the internet, but they have not yet published in a peer-reviewed scientific journal. 279 The study differed significantly from the 2011 FBI black-box study in important respects, including that the known prints were not selected by means of a large database search to be similar to the latent prints (which should, in principle, have made it easier to declare exclusions for the non-mated pairs). The study found 42 false positives among 995 conclusive examinations. The false positive rate was 4.2 percent (upper 95 percent confidence bound of 5.4 percent). The estimated rate corresponds to 1 error in 24 cases, with the upper bound indicating that the rate could be as high as 1 error in 18 cases. 280 (Note: The paper observes that “in 35 of the erroneous identifications the participants appeared to have made a clerical error, but the authors could not determine this with certainty.” In validation studies, it is inappropriate to exclude errors in a post hoc manner (see Box 4). However, if these 35 errors were to be excluded, the false positive rate would be 0.7 percent (confidence interval 1.4 percent), with the upper bound corresponding to 1 error in 73 cases.) Conclusions from the studies While it is distressing that meaningful studies to assess foundational validity and reliability did not begin until recently, we are encouraged that serious efforts are now being made to try to put the field on a solid scientific foundation—including by measuring accuracy, defining quality of latent prints, studying the reason for errors, and so on. Much credit belongs to the FBI Laboratory, as well as to academic researchers who had been pressing the need for research. Importantly, the FBI Laboratory is responsible for the only black-box study to date that has been published in a peer-reviewed journal. The studies above cannot be directly compared for many reasons—including differences in experimental design, selection and difficulty level of latent-known pairs, and degree to which they represent the circumstances, procedures and pressures found in casework. Nonetheless, certain conclusions can be drawn from the results of the studies (summarized in Table 1 below): (1) The studies collectively demonstrate that many examiners can, under some circumstances, produce correct answers at some level of accuracy. (2) The empirically estimated false positive rates are much higher than the general public (and, by extension, most jurors) would likely believe based on longstanding claims about the accuracy of fingerprint analysis. 281,282 Pacheco, I., Cerchiai, B., and S. Stoiloff. “Miami-Dade research study for the reliability of the ACE-V process: Accuracy & precision in latent fingerprint examinations.” (2014). www.ncjrs.gov/pdffiles1/nij/grants/248534.pdf. 280 If the 403 inconclusive examinations are included, the false positive rate was 3.0 percent (upper 95 percent confidence bound of 3.9 percent). The estimated false positive rate corresponds to 1 error in 33 cases, with the upper bound corresponding to 1 in 26. 281 The conclusion holds regardless of whether the rates are based on the point estimates or the 95 percent confidence bound, and on conclusive examinations or all examinations. 282 These claims include the DOJ’s own longstanding previous assertion that fingerprint analysis is “infallible” (www.justice.gov/olp/file/861906/download); testimony by a former head of the FBI’s fingerprint unit testified that the FBI had “an error rate of one per every 11 million cases” (see p. 53); and a study finding that mock jurors estimated that the false positive rate for latent fingerprint analysis is 1 in 5.5 million (see p. 45). Koehler, J.J. “Intuitive error rate estimates for the forensic sciences.” (August 2, 2016). Available at: papers.ssrn.com/sol3/papers.cfm?abstract_id=2817443. 279 95 (3) Of the two appropriately designed black-box studies, the larger study (FBI 2011 study) yielded a false positive rate that is unlikely to exceed 1 in 306 conclusive examinations while the other (Miami-Dade 2014 study) yielded a considerably higher false positive rate of 1 in 18. 283 (The earlier studies, which were not designed as validation studies, also yielded high false positive rates.) Overall, it would be appropriate to inform jurors that (1) only two properly designed studies of the accuracy of latent fingerprint analysis have been conducted and (2) these studies found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study. This would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence. It is likely that a properly designed program of systematic, blind verification would decrease the false-positive rate, because examiners in the studies tend to make different mistakes. 284 However, there has not been empirical testing to obtain a quantitative estimate of the false positive rate that might be achieved through such a program. 285 And, it would not be appropriate simply to infer the impact of independent verification based on the theoretical assumption that examiners’ errors are uncorrelated. 286 It is important to note that, for a verification program to be truly blind and thereby avoid cognitive bias, examiners cannot only verify individualizations. As the authors of the FBI black-box study propose, “this can be ensured by performing verifications on a mix of conclusion types, not merely individualizations”—that is, a mix that ensures that verifiers cannot make inferences about the conclusions being verified. 287 We are not aware of any blind verification programs that currently follow this practice. At present, testimony asserting any specific level of increased accuracy (beyond that measured in the studies) due to blind independent verification would be scientifically inappropriate, as speculation unsupported by empirical evidence. As noted above, the rate is 1 in 73 if one ignores the presumed clerical errors—although such post hoc adjustment is not appropriate in validation studies. 284 The authors of the FBI black-box study note that five of the false positive occurred on test problem where a large majority of examiners correctly declared an exclusion, while one occurred on a test problem where the majority of examiners made inconclusive decisions. They state that “this suggests that these erroneous individualizations would have been detected if blind verification were routinely performed.” Ulery, B.T., Hicklin, R.A., Buscaglia, J., and M.A. Roberts. “Accuracy and reliability of forensic latent fingerprint decisions.” Proceedings of the National Academy of Sciences, Vol. 108, No. 19 (2011): 7733-8. 285 The Miami-Dade study involved a small test of verification step, involving verification of 15 of the 42 false positives. In these 15 cases, the second examiner declared 13 cases to be exclusions and 2 to be inconclusive. The sample size is too small to draw a meaningful conclusion. And, the paper does not report verification results for the other 27 false positives. 286 The DOJ has proposed to PCAST that “basic probability states that given an error rate for one examiner, the likelihood of a second examiner making the exact same error (verification/blind verification), would dictate that the rates should be multiplied.” However, such a theoretical model would assume that errors by different examiners will be uncorrelated; yet they may depend on the difficulty of the problem and thus be correlated. Empirical studies are necessary to estimate error rates under blind verification. 287 Ulery, B.T., Hicklin, R.A., Buscaglia, J., and M.A. Roberts. “Accuracy and reliability of forensic latent fingerprint decisions.” Proceedings of the National Academy of Sciences, Vol. 108, No. 19 (2011): 7733-8. 283 96 We note that the DOJ believes that the high false positive rate observed in the Miami-Dade study (1 in 24, with upper confidence limit of 1 in 18) is unlikely to apply to casework at the FBI Laboratory, because it believes such a high rate would have been detected by the Laboratory’s verification procedures. An independent evaluation of the verification protocols could shed light on the extent to which such inferences could be drawn based on the current Laboratory’s verification procedures. We also note it is conceivable that the false-positive rate in real casework could be higher than that observed in the experimental studies, due to exposure to potentially biasing information in the course of casework. Introducing test samples blindly into the flow of casework could provide valuable insight about the actual error rates in casework. In conclusion, the FBI Laboratory black-box study has significantly advanced the field. There is a need for ongoing studies of the reliability of latent print analysis, building on its study design. Studies should ideally estimate error rates for latent prints of varying “quality” levels, using well defined measures (ideally, objective measures implemented by automated software 288). As noted above, studies should be designed and conducted in conjunction with third parties with no stake in the outcome. This important feature was not present in the FBI study. An example is the Latent Quality Assessment (LQAS), which is designed as a proof-of-concept tool to evaluate the clarity of prints. Studies have found that error rates are correlated to the quality of the print. The software provides a manual and automated definitions of clarity maps, functions to process clarity maps, and annotation of corresponding points providing a method for overlapping of impression areas. Hicklin, R.A., Buscaglia, J., and M.A. Roberts. “Assessing the clarity of friction ridge impressions.” Forensic Science International, Vol. 226, No. 1 (2013): 106-17. Another example is the Picture Annotation System (PiAnoS), developed by the University of Lausanne, which is being tested as a quality metric and statistical assessment tool for analysts. This platform uses tools that (1) assess the clarity of the friction ridge details, (2) provide likelihood ratios representing the strength of corresponding features between fingerprints, and (3) gives consensus information from a group of trained fingerprint experts. PiAnoS is an open-source software package available at: ipslabs.unil.ch/pianos. 288 97 Table 1: Error Rates in Studies of Latent Print Analysis* Study False Positives Raw Data Freq. (Confidence bound) Estimated Rate Bound on Rate Langenburg (2009a) 0/14 0% (19%) 1 in ∞ 1 in 5 Langenburg (2009b) 1/43 2.3% (11%) 1 in 43 1 in 9 Langenburg et al. (2012) 17/711 2.4% (3.5%) 1 in 42 1 in 28 Tangen et al. (2011) (“similar pairs”) 3/444 0.68% (1.7%) 1 in 148 1 in 58 Tangen et al. (2011) (“dissimilar pairs”) 0/444 0% (0.67%) 1 in ∞ 1 in 148 Ulery et al. 2011 (FBI)** 6/3628 0.17% (0.33%) 1 in 604 1 in 306 Pacheco et al. 2014 (Miami-Dade) 42/995 4.2% (5.4%) 1 in 24 1 in 18 Pacheco et al. 2014 (Miami-Dade) (excluding clerical errors) 7/960 0.7% (1.4%) 1 in 137 1 in 73 Early studies Black-box studies * “Raw Data”: Number of false positives divided by number of conclusive examinations involving non-mated pairs. “Freq. (Confidence Bound)”: Point estimate of false positive frequency, and upper 95 percent confidence bound. “Estimated Rate”: The odds of a false positive occurring, based on the observed proportion of false positives. “Bound on Rate”: The odds of a false positive occurring, based on the upper 95 percent confidence bound—that is, the rate could reasonably be as high as this value. ** If inconclusive examinations are included for the FBI study, the rates are 1 in 681 and 1 in 344, respectively. Scientific Studies of How Latent-print Examiners Reach Conclusions Complementing the black-box studies, various studies have shed important light on how latent fingerprint examiners reach conclusions and how these conclusions may be influenced by extraneous factors. These studies underscore the serious risks that may arise in subjective methods. Cognitive-bias studies Itiel Dror and colleagues have done pioneering work on the potential role of cognitive bias in latent fingerprint analysis. 289 In an exploratory study in 2006, they demonstrated that examiners’ judgments can be influenced by knowledge about other forensic examiners’ decisions (a form of “confirmation bias”). 290 Five fingerprint examiners were given fingerprint pairs that they had studied five years earlier in real cases and had judged to “match.” They were asked to re-examine the prints, but were led to believe that they were the pair of prints that had been erroneously matched by the FBI in a high-profile case. Although they were instructed to ignore this information, four out of five examiners no longer judged the prints to “match.” Although these studies are 289 Dror, I.E., Charlton, D., and A.E. Peron. “Contextual information renders experts vulnerable to making erroneous identifications.” Forensic Science International, Vol. 156 (2006): 74-878. Dror, I.E., and D. Charlton. “Why experts make errors.” Journal of Forensic identification, Vol. 56, No.4 (2006): 600-16. 290 Dror, I.E., Charlton, D., and A.E. Peron. “Contextual information renders experts vulnerable to making erroneous identifications.” Forensic Science International, Vol. 156 (2006): 74-878. 98 too small to provide precise estimates of the impact of cognitive bias, they have been instrumental in calling attention to the issue. Several strategies have been proposed for mitigating cognitive bias in forensic laboratories, including managing the flow of information in a crime laboratory to minimize exposure of the forensic analyst to irrelevant contextual information (such as confessions or eyewitness identification) and ensuring that examiners work in a linear fashion, documenting their finding about evidence from crime science before performing comparisons with samples from a suspect. 291,292 FBI white-box studies In the past few years, FBI scientists and their collaborators have also undertaken a series of “white-box” studies to understand the factors underlying the process of latent fingerprint analysis. These studies include analyses of fingerprint quality, 293,294 examiners’ processes to determine the value of a latent print for identification or exclusion, 295 the sufficiency of information for identifications, 296 and how examiners’ assessments of a latent print change when they compare it with a possible match. 297 Among work on subjective feature-comparison methods, this series of papers is unique in its breadth, rigor and willingness to explore challenging issues. We could find no similarly self-reflective analyses for other subjective disciplines. The two most recent papers are particularly notable because they involve the serious issue of confirmation bias. In a 2014 paper, the FBI scientists wrote ACE distinguishes between the Comparison phase (assessment of features) and Evaluation phase (determination), implying that determinations are based on the assessment of features. However, our results suggest that this is not a simple causal relation: examiners’ markups are also influenced by their determinations. How this reverse influence occurs is not obvious. Examiners may subconsciously reach a Kassin, S.M., Dror, I.E., and J. Kakucka. “The forensic confirmation bias: Problems, perspectives, and proposed solutions.” Journal of Applied Research in Memory and Cognition, Vol. 2, No. 1 (2013): 42-52. See also: Krane, D.E., Ford, S., Gilder, J., Iman, K., Jamieson, A., Taylor, M.S., and W.C. Thompson. “Sequential unmasking: A means of minimizing observer effects in forensic DNA interpretation.” Journal of Forensic Sciences, Vol. 53, No. 4 (July 2008): 1006-7. 292 Irrelevant contextual information could, depending on its nature, bias an examiner toward an incorrect identification or an incorrect exclusion. Either outcome is undesirable. 293 Hicklin, R.A., Buscaglia, J., Roberts, M.A., Meagher, S.B., Fellner, W., Burge, M.J., Monaco, M., Vera, D., Pantzer, L.R., Yeung, C.C., and N. Unnikumaran. “Latent fingerprint quality: a survey of examiners.” Journal of Forensic Identification. Vol. 61, No. 4 (2011): 385-419. 294 Hicklin, R.A., Buscaglia, J., and M.A. Roberts. “Assessing the clarity of friction ridge impressions.” Forensic Science International, Vol. 226, No. 1 (2013): 106-17. 295 Ulery, B.T., Hicklin, R.A., Kiebuzinski, G.I., Roberts, M.A., and J. Buscaglia. “Understanding the sufficiency of information for latent fingerprint value determinations.” Forensic Science International, Vol. 230, No. 1-3 (2013): 99-106. 296 Ulery, B.T., Hicklin, R.A., and J. Buscaglia. “Repeatability and reproducibility of decisions by latent fingerprint examiners.” PLoS ONE, (2012). 297 Ulery, B.T., Hicklin, R.A., Roberts, M.A., and J. Buscaglia. “Changes in latent fingerprint examiners’ markup between analysis and comparison.” Forensic Science International, Vol. 247 (2015): 54-61. 291 99 preliminary determination quickly and this influences their behavior during Comparison (e.g., level of effort expended, how to treat ambiguous features). After making a decision, examiners may then revise their annotations to help document that decision, and examiners may be more motivated to provide thorough and careful markup in support of individualizations than other determinations. As evidence in support of our conjecture, we note in particular the distributions of minutia counts, which show a step increase associated with decision thresholds: this step occurred at about seven minutiae for most examiners, but at 12 for those examiners following a 12-point standard. 298 Similar observations had been made by Dror et al., who noted that the number of minutiae marked in a latent print was greater when a matching exemplar was present. 299 In addition, Evett and Williams described how British examiners, who used a 16-point standard for declaring identifications, used an exemplar to ‘‘tease the points out’’ of the latent print after they had reached an ‘‘inner conviction’’ that the prints matched. 300 In a follow-up paper in 2015, the FBI scientists carefully studied how examiners analyzed prints and confirmed that, in the vast majority (>90 percent) of identification decisions, examiners modified the features marked in the latent fingerprint in response to an apparently matching known fingerprint (more often adding than subtracting features). 301 (The sole false positive in their study was an extreme case in which the conclusion was based almost entirely on subsequent marking of minutiae that had not been initially found and deletion of features that had been initially marked.) The authors concluded that “there is a need for examiners to have some means of unambiguously documenting what they see during analysis and comparison (in the ACE-V process)” and that “rigorously defined and consistently applied methods of performing and documenting ACE-V would improve the transparency of the latent print examination process.” PCAST compliments the FBI scientists for calling attention to the risk of confirmation bias arising from circular reasoning. As a matter of scientific validity, examiners must be required to “complete and document their analysis of a latent fingerprint before looking at any known fingerprint” and “must separately document any data relied upon during comparison or evaluation that differs from the information relied upon during analysis.” 302 The FBI adopted these rules following the Madrid train bombing case misidentification; they need to be universally adopted by all laboratories. Ulery, B.T., Hicklin, R.A., Roberts, M.A., and J. Buscaglia. “Measuring what latent fingerprint examiners consider sufficient information for individualization determinations.” PLoS ONE, (2014). 299 Dror, I.E., Champod, C., Langenburg, G., Charlton, D., Hunt, H., and R. Rosenthal. “Cognitive issues in fingerprint analysis: Inter- and intra-expert consistency and the effect of a ‘target’ comparison.” Forensic Science International, Vol. 208, No. 1-3 (2011): 10-7. 300 Evett, I.W., and R.L. Williams. “Review of the 16 point fingerprint standard in England and Wales.” Forensic Science International, Vol. 46, No. 1 (1996): 49–73. 301 Ulery, B.T., Hicklin, R.A., Roberts, M.A., and J. Buscaglia. “Changes in latent fingerprint examiners’ markup between analysis and comparison.” Forensic Science International, Vol. 247 (2015): 54-61. 302 U.S. Department of Justice, Office of the Inspector General. “A Review of the FBI’s Progress in Responding to the Recommendations in the Office of the Inspector General Report on the Fingerprint Misidentification in the Brandon Mayfield Case.” (2011): 5, 27. www.oig.justice.gov/special/s1105.pdf. 298 100 Validity as Applied Foundational validity means that a large group of examiners analyzing a specific type of sample can, under test conditions, produce correct answers at a known and useful frequency. It does not mean that a particular examiner has the ability to reliably apply the method; that the samples in the foundational studies are representative of the actual evidence of the case; or that the circumstances of the foundational study represent a reasonable approximation of the circumstances of casework. To address these matters, courts should take into account several key considerations. (1) Because latent print analysis, as currently practiced, depends on subjective judgment, it is scientifically unjustified to conclude that a particular examiner is capable of reliably applying the method unless the examiner has undergone regular and rigorous proficiency testing. Unfortunately, it is not possible to assess the appropriateness of current proficiency testing because the test problems are not publically released. (As emphasized previously, training and experience are no substitute, because neither provides any assurance that the examiner can apply the method reliably.) (2) In any given case, it must be established that the latent print(s) are of the quality and completeness represented in the foundational validity studies. (3) Because contextual bias may have an impact on experts’ decisions, courts should assess the measures taken to mitigate bias during casework—for example, ensuring that examiners are not exposed to potentially biasing information and ensuring that analysts document ridge features of an unknown print before referring to the known print (a procedure known as “linear ACE-V” 303). Finding 5: Latent fingerprint analysis Foundational validity. Based largely on two recent appropriately designed black-box studies, PCAST finds that latent fingerprint analysis is a foundationally valid subjective methodology—albeit with a false positive rate that is substantial and is likely to be higher than expected by many jurors based on longstanding claims about the infallibility of fingerprint analysis. Conclusions of a proposed identification may be scientifically valid, provided that they are accompanied by accurate information about limitations on the reliability of the conclusion—specifically, that (1) only two properly designed studies of the foundational validity and accuracy of latent fingerprint analysis have been conducted, (2) these studies found false positive rates that could be as high as 1 error in 306 cases in one study and 1 error in 18 cases in the other, and (3) because the examiners were aware they were being tested, the actual false positive rate in casework may be higher. At present, claims of higher accuracy are U.S. Department of Justice, Office of the Inspector General. “A Review of the FBI’s Progress in Responding to the Recommendations in the Office of the Inspector General Report on the Fingerprint Misidentification in the Brandon Mayfield Case.” (2011): 27. www.oig.justice.gov/special/s1105.pdf. 303 101 not warranted or scientifically justified. Additional black-box studies are needed to clarify the reliability of the method. Validity as applied. Although we conclude that the method is foundationally valid, there are a number of important issues related to its validity as applied. (1) Confirmation bias. Work by FBI scientists has shown that examiners typically alter the features that they initially mark in a latent print based on comparison with an apparently matching exemplar. Such circular reasoning introduces a serious risk of confirmation bias. Examiners should be required to complete and document their analysis of a latent fingerprint before looking at any known fingerprint and should separately document any additional data used during their comparison and evaluation. (2) Contextual bias. Work by academic scholars has shown that examiners’ judgments can be influenced by irrelevant information about the facts of a case. Efforts should be made to ensure that examiners are not exposed to potentially biasing information. (3) Proficiency testing. Proficiency testing is essential for assessing an examiner’s capability and performance in making accurate judgments. As discussed elsewhere in this report, proficiency testing needs to be improved by making it more rigorous, by incorporating it within the flow of casework, and by disclosing tests for evaluation by the scientific community. From a scientific standpoint, validity as applied requires that an expert: (1) has undergone appropriate proficiency testing to ensure that he or she is capable of analyzing the full range of latent fingerprints encountered in casework and reports the results of the proficiency testing; (2) discloses whether he or she documented the features in the latent print in writing before comparing it to the known print; (3) provides a written analysis explaining the selection and comparison of the features; (4) discloses whether, when performing the examination, he or she was aware of any other facts of the case that might influence the conclusion; and (5) verifies that the latent print in the case at hand is similar in quality to the range of latent prints considered in the foundational studies. The Path Forward Continuing efforts are needed to improve the state of latent print analysis—and these efforts will pay clear dividends for the criminal justice system. One direction is to continue to improve latent print analysis as a subjective method. With only two black-box studies so far (with very different error rates), there is a need for additional black-box studies building on the study design of the FBI black-box study. Studies should estimate error rates for latent prints of varying quality and completeness, using well-defined measures. As noted above, the studies should be designed and conducted in conjunction with third parties with no stake in the outcome. 102 A second—and more important—direction is to convert latent print analysis from a subjective method to an objective method. The past decade has seen extraordinary advances in automated image analysis based on machine learning and other approaches—leading to dramatic improvements in such tasks as face recognition. 304,305 In medicine, for example, it is expected that automated image analysis will become the gold standard for many applications involving interpretation of X-rays, MRIs, fundoscopy, and dermatological images. 306 Objective methods based on automated image analysis could yield major benefits—including greater efficiency and lower error rates; it could also enable estimation of error rates from millions of pairwise comparisons. Initial efforts to develop automated systems could not outperform humans. 307 However, given the pace of progress in image analysis and machine learning, we believe that fully automated latent print analysis is likely to be possible in the near future. There have already been initial steps in this direction, both in academia and industry. 308 The most important resource to propel the development of objective methods would be the creation of huge databases containing known prints, each with many corresponding ”simulated” latent prints of varying qualities and completeness, which would be made available to scientifically-trained researchers in academia and industry. The simulated latent prints could be created by “morphing” the known prints, based on transformations derived from collections of actual latent print-record print pairs. 309 See: cs.stanford.edu/people/karpathy/cvpr2015.pdf. Lu, C., and X. Tang. “Surpassing human-level face verification performance on LFW with GaussianFace.” arxiv.org/abs/1404.3840 (accessed July 2, 2016). Taigman, Y., Yang, M., Ranzato, M., and L. Wolf. “Deepface: Closing the gap to human-level performance in face verification.” www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf (accessed July 2, 2016) and Schroff, F., Kalenichenko, D., and J. Philbin. “FaceNet: A unified embedding for face recognition and clustering.” arxiv.org/abs/1503.03832 (accessed July 2, 2016). 306 Doi, K. “Computer-aided diagnosis in medical imaging: historical review, current status and future potential.” Computerized Medical Imaging and Graphics, Vol. 31, No. 4-5 (2007): 198-211 and Shiraishi, J., Li, Q., Appelbaum, D., and K. Doi. “Computer-aided diagnosis and artificial intelligence in clinical imaging.” Seminars in Nuclear Medicine, Vol. 41, No. 6 (2011): 449-62. 307 For example, a study in 2010 reported that that humans outperformed an automated program for toolmark comparisons. See: Chumbley, L.S., Morris, M.D., Kreiser, M.J., Fisher, C., Craft J., Genalo, L.J., Davis, S., Faden, D., and J. Kidd. “Validation of Tool Mark Comparisons Obtained Using a Quantitative, Comparative, Statistical Algorithm." Journal of Forensic Sciences, Vol. 55, No. 4 (2010): 953-961. 308 Arunalatha, J.A., Tejaswi, V., Shaila, K., Anvekar, D., Venugopal, K.R., Iyengar, S.S., and L.M. Patnaik. “FIVDL: Fingerprint Image Verification using Dictionary Learning.” Procedia Computer Science, Vol. 54 (2015): 482-490 and Srihari, S.N. “Quantitative Measures in Support of Latent Print Comparison: Final Technical Report.” NIJ Award Number: 2009-DN-BXK208, University at Buffalo, SUNY, 2013. www.crime-sceneinvestigator.net/QuantitativeMeasuresinSupportofLatentPrint.pdf. In addition, Christophe Champod’s group at Université de Lausanne has an active program in this area. 309 For privacy, fingerprints from deceased individuals could be used. 304 305 103 5.5 Firearms Analysis Methodology In firearms analysis, examiners attempt to determine whether ammunition is or is not associated with a specific firearm based on toolmarks produced by guns on the ammunition. 310,311 (Briefly, gun barrels are typically rifled to improve accuracy, meaning that spiral grooves are cut into the barrel’s interior to impart spin on the bullet. Random individual imperfections produced during the tool-cutting process and through “wear and tear” of the firearm leave toolmarks on bullets or casings as they exit the firearm. Parts of the firearm that come into contact with the cartridge case are machined by other methods.) The discipline is based on the idea that the toolmarks produced by different firearms vary substantially enough (owing to variations in manufacture and use) to allow components of fired cartridges to be identified with particular firearms. For example, examiners may compare “questioned” cartridge cases from a gun recovered from a crime scene to test fires from a suspect gun. Briefly, examination begins with an evaluation of class characteristics of the bullets and casings, which are features that are permanent and predetermined before manufacture. If these class characteristics are different, an elimination conclusion is rendered. If the class characteristics are similar, the examination proceeds to identify and compare individual characteristics, such as the striae that arise during firing from a particular gun. According to the Association of Firearm and Tool Mark Examiners (AFTE) the “most widely accepted method used in conducting a toolmark examination is a side-by-side, microscopic comparison of the markings on a questioned material item to known source marks imparted by a tool.” 312 Background In the previous section, PCAST expressed concerns about certain foundational documents underlying the scientific discipline of firearm and tool mark examination. In particular, we observed that AFTE’s “Theory of Identification as it Relates to Toolmarks”—which defines the criteria for making an identification—is circular. 313 The “theory” states that an examiner may conclude that two items have a common origin if their marks are in “sufficient agreement,” where “sufficient agreement” is defined as the examiner being convinced that the items are extremely unlikely to have a different origin. In addition, the “theory” explicitly states that conclusions are subjective. Examiners can also undertake other kinds of analysis, such as for distance determinations, operability of firearms, and serial number restorations as well as the analyze primer residue to determine whether someone recently handled a weapon. 311 For more complete descriptions, see, for example, National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009), and archives.fbi.gov/archives/aboutus/lab/forensic-science-communications/fsc/july2009/review/2009_07_review01.htm. 312 See: Foundational Overview of Firearm/Toolmark Identification tab on afte.org/resources/swggun-ark (accessed May 12, 2016). 313 Association of Firearm and Tool Mark Examiners. “Theory of Identification as it Relates to Tool Marks: Revised,” AFTE Journal, Vol. 43, No. 4 (2011): 287. 310 104 Much attention in this scientific discipline has focused on trying to prove the notion that every gun produces “unique” toolmarks. In 2004, the NIJ asked the NRC to study the feasibility, accuracy, reliability, and advisability of developing a comprehensive national ballistics database of images from bullets fired from all, or nearly all, newly manufactured or imported guns for the purpose of matching ballistics from a crime scene to a gun and information on its initial owner. In its 2008 report, an NRC committee, responding to NIJ’s request, found that “the validity of the fundamental assumptions of uniqueness and reproducibility of firearms-related toolmarks” had not yet been demonstrated and that, given current comparison methods, a database search would likely “return too large a subset of candidate matches to be practically useful for investigative purposes.” 314 Of course, it is not necessary that toolmarks be unique for them to provide useful information whether a bullet may have been fired from a particular gun. However, it is essential that the accuracy of the method for comparing them be known based on empirical studies. Firearms analysts have long stated that their discipline has near-perfect accuracy. In a 2009 article, the chief of the Firearms-Toolmarks Unit of the FBI Laboratory stated that “a qualified examiner will rarely if ever commit a false-positive error (misidentification),” citing his review, in an affidavit, of empirical studies that showed virtually no errors. 315 With respect to firearms analysis, the 2009 NRC report concluded that “sufficient studies have not been done to understand the reliability and reproducibility of the methods”—that is, that the foundational validity of the field had not been established. 316 The Scientific Working Group on Firearms Analysis (SWGGUN) responded to the criticisms in the 2009 NRC report by stating that: The SWGGUN has been aware of the scientific and systemic issues identified in this report for some time and has been working diligently to address them. . . . [the NRC report] identifies the areas where we must fundamentally improve our procedures to enhance the quality and reliability of our scientific results, as well as better articulate the basis of our science. 317 National Research Council. Ballistic Imaging. The National Academies Press. Washington DC. (2008): 3-4. See: www.fbi.gov/about-us/lab/forensic-science-communications/fsc/july2009/review/2009_07_review01.htm. 316 The report states that “Toolmark and firearms analysis suffers from the same limitations discussed above for impression evidence. Because not enough is known about the variabilities among individual tools and guns, we are not able to specify how many points of similarity are necessary for a given level of confidence in the result. Sufficient studies have not been done to understand the reliability and repeatability of the methods. The committee agrees that class characteristics are helpful in narrowing the pool of tools that may have left a distinctive mark.” National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 154. 317 See: www.swggun.org/index.php?option=com_content&view=article&id=37&Itemid=22. 314 315 105 Non-black-box studies of firearms analysis: Set-based analyses Because firearms analysis is at present a subjective feature-comparison method, its foundational validity can only be established through multiple independent black box studies, as discussed above. Although firearms analysis has been used for many decades, only relatively recently has its validity been subjected to meaningful empirical testing. Over the past 15 years, the field has undertaken a number of studies that have sought to estimate the accuracy of examiners’ conclusions. While the results demonstrate that examiners can under some circumstances identify the source of fired ammunition, many of the studies were not appropriate for assessing scientific validity and estimating the reliability because they employed artificial designs that differ in important ways from the problems faced in casework. Specifically, many of the studies employ “set-based” analyses, in which examiners are asked to perform all pairwise comparisons within or between small samples sets. For example, a “within-set” analysis involving n objects asks examiners to fill out an n x n matrix indicating which of the n(n-1)/2 possible pairs match. Some forensic scientists have favored set-based designs because a small number of objects gives rise to a large number of comparisons. The study design has a serious flaw, however: the comparisons are not independent of one another. Rather, they entail internal dependencies that (1) constrain and thereby inform examiners’ answers and (2) in some cases, allow examiners to make inferences about the study design. (The first point is illustrated by the observation that if A and B are judged to match, then every additional item C must match either both or neither of them—cutting the space of possible answers in half. If A and B match one another but do not match C, this creates additional dependencies. And so on. The second point is illustrated by “closed-set” designs, described below.) Because of the complex dependencies among the answers, set-based studies are not appropriately-designed black-box studies from which one can obtain proper estimates of accuracy. Moreover, analysis of the empirical results from at least some set-based studies (“closed-set” designs) suggest that they may substantially underestimate the false positive rate. The Director of the Defense Forensic Science Center analogized set-based studies to solving a “Sudoku” puzzle, where initial answers can be used to help fill in subsequent answers. 318 As discussed below, DFSC’s discomfort with set-based studies led it to fund the first (and, to date, only) appropriately designed black-box study for firearms analysis. We discuss the most widely cited of the set-based studies below. We adopt the same framework as for latent prints, focusing primarily on (1) the 95 percent upper confidence limit of the false positive rate and (2) false positive rates based on the proportion of conclusive examinations, as the appropriate measures to report (see p. 91). 318 PCAST interview with Jeff Salyards, Director, DFSC. 106 Within-set comparison Some studies have involved within-set comparisons, in which examiners are presented, for example, with a collection of samples and asked them to determine which samples were fired from the same firearm. We reviewed two often-cited studies with this design. 319,320 In these studies, most of the samples were from distinct sources, with only 2 or 3 samples being from the same source. Across the two studies, examiners identified 55 of 61 matches and made no false positives. In the first study, the vast majority of different-source samples (97 percent) were declared inconclusive; there were only 18 conclusive examinations for different-source cartridge cases and no conclusive examinations for different-source bullets. 321 In the second study, the results are only described in brief paragraph and the number of conclusive examinations for different-source pairs was not reported. It is thus impossible to estimate the false positive rate among conclusive examinations, which is the key measure for consideration (as discussed above). Set-to-set comparison/closed set Another common design has been between-set comparisons involving a “closed set.” In this case, examiners are given a set of questioned samples and asked to compare them to a set of known standards, representing the possible guns from which the questioned ammunition had been fired. In a “closed-set” design, the source gun is Smith, E. “Cartridge case and bullet comparison validation study with firearms submitted in casework.” AFTE Journal, Vol. 37, No. 2 (2005): 130-5. In this study from the FBI, cartridges and bullets were fired from nine Ruger P89 pistols from casework. Examiners were given packets (of cartridge cases or bullets) containing samples fired from each of the 9 guns and one additional sample fired from one of the guns; they were asked to determine which samples were fired from the same gun. Among the 16 same-source comparisons, there were 13 identifications and 3 inconclusives. Among the 704 differentsource comparisons, 97 percent were declared inconclusives, 2.5 percent were declared exclusions and 0 percent false positives. 320 DeFrance, C.S., and M.D. Van Arsdale. “Validation study of electrochemical rifling.” AFTE Journal, Vol. 35, No. 1 (2003): 35-7. In this study from the FBI, bullets were fired from 5 consecutively manufactured Smith & Wesson .357 Magnum caliber rifle barrels. Each of 9 examiners received two test packets, each containing a bullet from each of the 5 guns and two additional bullets (from the different guns in one packet, from the same gun in the other); they were asked to perform all 42 possible pairwise comparisons, which included 37 different-source comparisons. Of the 45 total same-source comparisons, there were 42 identifications and 3 inconclusives. For the 333 total different-source comparisons, the paper states that there were no false positives, but does not report the number of inconclusive examinations. 321 Some laboratory policies mandate a very high bar for declaring exclusions. 319 107 always present. We analyzed four such studies in detail. 322,323,324,325 In these studies, examiners were given a collection of questioned bullets and/or cartridge cases fired from a small number of consecutively manufactured firearms of the same make (3, 10, 10, and 10 guns, respectively) and a collection of bullets (or casings) known to have been fired from these same guns. They were then asked to perform a matching exercise—assigning the bullets (or casings) in one set to the bullets (or casings) in the other set. This “closed-set” design is simpler than the problem encountered in casework, because the correct answer is always present in the collection. In such studies, examiners can perform perfectly if they simply match each bullet to the standard that is closest. By contrast, in an open-set study (as in casework), there is no guarantee that the correct source is present—and thus no guarantee that the closest match is correct. Closed-set comparisons would thus be expected to underestimate the false positive rate. Importantly, it is not necessary that examiners be told explicitly that the study design involves a closed set. As one of the studies noted: The participants were not told whether the questioned casings constituted an open or closed set. However, from the questionnaire/answer sheet, participants could have assumed it was a closed set and that every questioned casing should be associated with one of the ten slides. 326 Stroman, A. “Empirically determined frequency of error in cartridge case examinations using a declared double-blind format.” AFTE Journal, Vol. 46, No. 2 (2014):157-175. In this study, bullets were fired from three Smith & Wesson guns. Each of 25 examiners received a test set containing three questioned cartridge cases and three known cartridge cases from each gun. Of the 75 answers returned, there were 74 correct assignments and one inconclusive examination. 323 Brundage, D.J. “The identification of consecutively rifled gun barrels.” AFTE Journal, Vol. 30, No. 3 (1998): 438-44. In this study, bullets were fired from 10 consecutively manufactured 9 millimeter Ruger P-85 semi-automatic pistol barrels. Each of 30 examiners received a test set containing 20 questioned bullets to compare to a set of 15 standards, containing at least one bullet fired from each of the 10 guns. Of the 300 answers returned, there were no incorrect assignments and one inconclusive examination. 324 Fadul, T.G., Hernandez, G.A., Stoiloff, S., and S. Gulati. “An empirical study to improve the scientific foundation of forensic firearm and tool mark identification utilizing 10 consecutively manufactured slides.” AFTE Journal. Vol. 45, No. 4 (2013): 376-93. An empirical study to improve the scientific foundation of forensic firearm and tool mark identification utilizing 10 consecutively manufactured slides. In this study, bullets were fired from 10 consecutively manufactured semiautomatic 9mm Ruger pistol slides. Each of 217 examiners received a test set consisting of 15 questioned casings and two known cartridge cases from each of the 10 guns. Of the 3255 answers returned, there were 3239 correct assignments, 14 inconclusive examinations and two false positives. 325 Hamby, J.E., Brundage, D.J., and J.W. Thorpe. “The identification of bullets fired from 10 consecutively rifled 9mm Ruger pistol barrels: a research project involving 507 participants from 20 countries.” AFTE Journal, Vol. 41, No. 2 (2009): 99-110. In this study, bullets were fired from 10 consecutively rifled Ruger P-85 barrels. Each of 440 examiners received a test set consisting of 15 questioned bullets and two known standards from each of the 10 guns. Of the 6600 answers returned, there were 6593 correct assignments, seven inconclusive examinations and no false positives. 326 Fadul, T.G., Hernandez, G.A., Stoiloff, S., and S. Gulati. “An empirical study to improve the scientific foundation of forensic firearm and tool mark identification utilizing 10 consecutively manufactured slides.” AFTE Journal, Vol. 45, No. 4 (2013): 376-93. 322 108 Moreover, as participants find that many of the questioned casings have strong similarities to the known casings, their surmise that matching knowns are always present will tend to be confirmed. The issue with this study design is not just a theoretical possibility: it is evident in the results themselves. Specifically, the closed-set studies have inconclusive and false-positives rate that are dramatically lower (by more than 100-fold) that those for the partly open design (Miami-Dade study) or fully open, black-box designs (Ames Laboratory) studies described below (Table 2). 327 In short, the closed-set design is problematic in principle and appears to underestimate the false positive rate in practice. 328 The design is not appropriate for assessing scientific validity and measuring reliability. Set-to-set comparison/partly open set (‘Miami Dade study’) One study involved a set-to-set comparison in which a few of the questioned samples lacked a matching known standard. 329 The 165 examiners in the study were asked to assign a collection of 15 questioned samples, fired from 10 pistols, to a collection of known standards; two of the 15 questioned samples came from a gun for which known standards were not provided. For these two samples, there were 188 eliminations, 138 inconclusives and 4 false positives. The inconclusive rate was 41.8 percent and the false positive rate among conclusive examinations was 2.1 percent (confidence interval 0.6-5.25 percent). The false positive rate corresponds to an estimated rate of 1 error in 48 cases, with upper bound being 1 in 19. As noted above, the results from the Miami-Dade study are sharply different than those from the closed-set studies: (1) the proportion of inconclusive results was 200-fold higher and (2) the false positive rate was roughly 100-fold higher. Recent black-box study of firearms analysis In 2011, the Forensic Research Committee of the American Society of Crime Lab Directors identified, among the highest ranked needs in forensic science, the importance of undertaking a black-box study in firearms analysis analogous to the FBI’s black-box study of latent fingerprints. DFSC, dissatisfied with the design of previous studies of firearms analysis, concluded that a black-box study was needed and should be conducted by an independent testing laboratory unaffiliated with law enforcement that would engage forensic examiners as Of the 10,230 answers returned across the three studies, there were there were 10,205 correct assignments, 23 inconclusive examinations and 2 false positives. 328 Stroman (2014) acknowledges that, although the test instructions did not explicitly indicate whether the study was closed, their study could be improved if “additional firearms were used and knowns from only a portion of those firearms were used in the test kits, thus presenting an open set of unknowns to the participants. While this could increase the chances of inconclusive results, it would be a more accurate reflection of the types of evidence received in real casework.” 329 Fadul, T.G., Hernandez, G.A., Stoiloff, S., and S. Gulati. “An empirical study to improve the scientific foundation of forensic firearm and tool mark identification utilizing consecutively manufactured Glock EBIS barrels with the same EBIS pattern.” National Institute of Justice Grant #2010-DN-BX-K269, December 2013. www.ncjrs.gov/pdffiles1/nij/grants/244232.pdf. 327 109 participants in the study. DFSC and Defense Forensics and Biometrics Agency jointly funded a study by the Ames Laboratory, a Department of Energy national laboratory affiliated with Iowa State University. 330 Independent tests/open (‘Ames Laboratory study’) The study employed a similar design to the FBI’s black-box study of latent fingerprints, with many examiners making a series of independent comparison decisions between a questioned sample and one or more known samples that may or may not contain the source. The samples all came from 25 newly purchased 9mm Ruger pistols. 331 Each of 218 examiners 332 was presented with 15 separate comparison problems—each consisting of one questioned sample and three known test fires from the same known gun, which might or might not have been the source. 333 Unbeknownst to the examiners, there were five same-source and ten different-source comparisons. (In an ideal design, the proportion of same- and different-source comparisons would differ among examiners.) Among the 2178 different-source comparisons, there were 1421 eliminations, 735 inconclusives and 22 false positives. The inconclusive rate was 33.7 percent and the false positive rate among conclusive examinations was 1.5 percent (upper 95 percent confidence interval 2.2 percent). The false positive rate corresponds to an estimated rate of 1 error in 66 cases, with upper bound being 1 in 46. (It should be noted that 20 of the 22 false positives were made by just 5 of the 218 examiners—strongly suggesting that the false positive rate is highly heterogeneous across the examiners.) The results for the various studies are shown in Table 2. The tables show a striking difference between the closed-set studies (where a matching standard is always present by design) and the non-closed studies (where there is no guarantee that any of the known standards match). Specifically, the closed-set studies show a dramatically lower rate of inconclusive examinations and of false positives. With this unusual design, examiners succeed in answering all questions and achieve essentially perfect scores. In the more realistic open designs, these rates are much higher. Baldwin, D.P., Bajic, S.J., Morris, M., and D. Zamzow. “A study of false-positive and false-negative error rates in cartridge case comparisons.” Ames Laboratory, USDOE, Technical Report #IS-5207 (2014) afte.org/uploads/documents/swggun-falsepostive-false-negative-usdoe.pdf. 331 One criticism, raised by a forensic scientist, is that the study did not involve consecutively manufactured guns. 332 Participants were members of AFTE who were practicing examiners employed by or retired from a national or international law enforcement agency, with suitable training. 333 Actual casework may involve more complex situations (for example, many different bullets from a crime scene). But, a proper assessment of foundational validity must start with the question of how often an examiner can determine whether a questioned bullet comes from a specific known source. 330 110 Table 2: Results From Firearms Studies* Study Type Results for different-source comparisons Raw Data Inconclusives Exclusions/ Inconclusives/ False positives Set-to-set/closed (four studies) Set-to-set/partly open (Miami-Dade study) Black-box study (Ames Laboratory study) False positives among conclusive exams 334 Freq. (Confidence Bound) Estimated Rate Bound on Rate 10,205/23/2 0.2% 0.02% (0.06%) 1 in 5103 1 in 1612 188/138/4 41.8% 2.0% (4.7%) 1 in 49 1 in 21 1421/735/22 33.7% 1.5% (2.2%) 1 in 66 1 in 46 * “Inconclusives”: Proportion of total examinations that were called inconclusive. “Raw Data”: Number of false positives divided by number of conclusive examinations involving questioned items without a corresponding known (for set-to-set/slightly open) or non-mated pairs (for independent/open). “Freq. (Confidence Bond)”: Point estimate of false positive frequency, with the upper 95 percent confidence bounds. “Estimated”: The odds of a false positive occurring, based on the observed proportion of false positives. “Bound”: The odds of a false positive occurring, based on the upper bound of the confidence interval—that is, the rate could reasonably be as high as this value. Conclusions The early studies indicate that examiners can, under some circumstances, associate ammunition with the gun from which it was fired. However, as described above, most of these studies involved designs that are not appropriate for assessing the scientific validity or estimating the reliability of the method as practiced. Indeed, comparison of the studies suggests that, because of their design, many frequently cited studies seriously underestimate the false positive rate. At present, there is only a single study that was appropriately designed to test foundational validity and estimate reliability (Ames Laboratory study). Importantly, the study was conducted by an independent group, unaffiliated with a crime laboratory. Although the report is available on the web, it has not yet been subjected to peer review and publication. The scientific criteria for foundational validity require appropriately designed studies by more than one group to ensure reproducibility. Because there has been only a single appropriately designed study, the current evidence falls short of the scientific criteria for foundational validity. 335 There is thus a need for additional, appropriately designed black-box studies to provide estimates of reliability. The rates for all examinations are, reading across rows: 1 in 5115; 1 in 1416; 1 in 83; 1 in 33; 1 in 99; and 1 in 66. The DOJ asked PCAST to review a recent paper, published in July 2016, and judge whether it constitutes an additional appropriately designed black-box study of firearms analysis (that is, the ability to associate ammunition with a particular gun). PCAST carefully reviewed the paper, including interviewing the three authors about the study design. Smith, T.P., 334 335 111 Finding 6: Firearms analysis Foundational validity. PCAST finds that firearms analysis currently falls short of the criteria for foundational validity, because there is only a single appropriately designed study to measure validity and estimate reliability. The scientific criteria for foundational validity require more than one such study, to demonstrate reproducibility. Whether firearms analysis should be deemed admissible based on current evidence is a decision that belongs to the courts. If firearms analysis is allowed in court, the scientific criteria for validity as applied should be understood to require clearly reporting the error rates seen in appropriately designed black-box studies (estimated at 1 in 66, with a 95 percent confidence limit of 1 in 46, in the one such study to date). Smith, G.A., and J.B. Snipes. "A validation study of bullet and cartridge case comparisons using samples representative of actual casework." Journal of forensic sciences Vol. 61, No. 4 (2016): 939-946. The paper involves a novel and complex design that is unlike any previous study. Briefly, the study design was as follows: (1) six different types of ammunition were fired from eight 40 caliber pistols from four manufacturers (two Taurus, two Sig Sauer, two Smith and Wesson, and two Glock) that had been in use in the general population and obtained by the San Francisco Police Department; (2) tests kits were created by randomly selecting 12 samples (bullets or cartridge cases); (3) 31 examiners were told that the ammunition was all recovered from a single crime scene and were asked to prepare notes describing their conclusions about which sets of samples had been fired from the same gun; and (4) based on each examiner’s notes, the authors sought to re-create the logical path of comparisons followed by each examiner and calculate statistics based on this inferred numbers of comparisons performed by each examiner. While interesting, the paper clearly is not a black-box study to assess the reliability of firearms analysis to associate ammunition with a particular gun, and its results cannot be compared to previous studies. Specifically: (1) The study employs a within-set comparison design (interdependent comparisons within a set) rather than a black-box design (many independent comparisons); (2) The study involves only a small number of examiners; (3) The central question with respect to firearms analysis is whether examiners can associate spent ammunition with a particular gun, not simply with a particular make of gun. To answer this question, studies must assess examiners’ performance on ammunition fired from different guns of the same make (“within-class” comparisons) rather than from guns of different makes (“between-class” comparison); the latter comparison is much simpler because guns of different makes produce marks with distinctive “class” characteristics (due to the design of the gun), whereas guns of the same make must be distinguished based on “randomly acquired” features of each gun (acquired during rifling or in use). Accordingly, previous studies have employed only withinclass comparisons. In contrast, the recent study consists of a mixture of within- vs. between-class comparisons, with the substantial majority being the simpler between-class comparisons. To estimate the false-positive rate for within-class comparisons (the relevant quantity), one would need to know the number of independent tests involving different-source within-class comparisons resulting in conclusive examinations (identification or elimination). The paper does not distinguish between within- and between-class comparisons, and the authors noted that they did not perform such analysis. PCAST’s comments are not intended as a criticism of the recent paper, which is a novel and valuable research project. They simply respond to DOJ’s specific question: the recent paper does not represent a black-box study suitable for assessing scientific validity or estimating the accuracy of examiners to associate ammunition with a particular gun. 112 Validity as applied. If firearms analysis is allowed in court, validity as applied would, from a scientific standpoint, require that the expert: (1) has undergone rigorous proficiency testing on a large number of test problems to evaluate his or her capability and performance, and discloses the results of the proficiency testing; and (2) discloses whether, when performing the examination, he or she was aware of any other facts of the case that might influence the conclusion. The Path Forward Continuing efforts are needed to improve the state of firearms analysis—and these efforts will pay clear dividends for the criminal justice system. One direction is to continue to improve firearms analysis as a subjective method. With only one black-box study so far, there is a need for additional black-box studies based on the study design of the Ames Laboratory blackbox study. As noted above, the studies should be designed and conducted in conjunction with third parties with no stake in the outcome (such as the Ames Laboratory or research centers such as the Center for Statistics and Applications in Forensic Evidence (CSAFE)). There is also a need for more rigorous proficiency testing of examiners, using problems that are appropriately challenging and publically disclosed after the test. A second—and more important—direction is (as with latent print analysis) to convert firearms analysis from a subjective method to an objective method. This would involve developing and testing image-analysis algorithms for comparing the similarity of tool marks on bullets. There have already been encouraging steps toward this goal. 336 Recent efforts to characterize 3D images of bullets have used statistical and machine learning methods to construct a quantitative “signature” for each bullet that can be used for comparisons across samples. A recent review discusses the potential for surface topographic methods in ballistics and suggests approaches to use these methods in firearms examination. 337 The authors note that the development of optical methods have improved the speed and accuracy of capturing surface topography, leading to improved quantification of the degree of similarity. For example, a recent study used data from three-dimensional confocal microscopy of ammunition to develop a similarity metric to compare images. By performing all pairwise comparisons among a total of 90 cartridge cases fired from 10 pistol slides, the authors found that the distribution of the metric for same-gun pairs did not overlap the distribution of the metric for different-gun pairs. Although a small study, it is encouraging. Weller, T.J., Zheng, X.A., Thompson, R.M., and F. Tulleners. “Confocal microscopy analysis of breech face marks on fired cartridge cases from 10 consecutively manufactured pistol slides.” Journal of Forensic Sciences, Vol. 57, No. 4 (2012): 912-17. 337 Vorburger, T.V., Song, J., and N. Petraco. “Topography measurements and applications in ballistics and tool mark identification.” Surface topography: Metrology and Properties, Vol. 4 (2016) 013002. 336 113 In a recent study, researchers used images from an earlier study to develop a computer-assisted approach to match bullets that minimizes human input. 338 The group’s algorithm extracts a quantitative signature from a bullet 3D image, compares the signature across two or more samples, and produces a “matching score,” reflecting the strength of the match. On the small test data set, the algorithm had a very low error rate. There are additional efforts in the private sector focused on development of accurate high-resolution cartridge casing representations to improve accuracy and allow for higher quality scoring functions to improve and assign match confidence during database searches. The current NIBIN database uses older (non-3D) technology and does not provide a scoring function or confidence assignment to each candidate match. It has been suggested that a scoring function could be used for blind verification for human examiners. Given the tremendous progress over the past decade in other fields of image analysis, we believe that fully automated firearms analysis is likely to be possible in the near future. However, efforts are currently hampered by lack of access to realistically large and complex databases that can be used to continue development of these methods and validate initial proposals. NIST, in coordination with the FBI Laboratory, should play a leadership role in propelling this transformation by creating and disseminating appropriate large datasets. These agencies should also provide grants and contracts to support work—and systematic processes to evaluate methods. In particular, we believe that “prize” competitions—based on large, publicly available collections of images 339—could attract significant interest from academic and industry. 5.6 Footwear Analysis: Identifying Characteristics Methodology Footwear analysis is a process that typically involves comparing a known object, such as a shoe, to a complete or partial impression found at a crime scene, to assess whether the object is likely to be the source of the impression. The process proceeds in a stepwise manner, beginning with a comparison of “class characteristics” (such as design, physical size, and general wear) and then moving to “identifying characteristics” or “randomly acquired characteristics (RACs)” (such as marks on a shoe caused by cuts, nicks, and gouges in the course of use). 340 In this report, we do not address the question of whether examiners can reliably determine class characteristics—for example, whether a particular shoeprint was made by a size 12 shoe of a particular make. While it is important that that studies be undertaken to estimate the reliability of footwear analysis aimed at Hare, E., Hofmann, H., and A. Carriquiry. “Automatic matching of bullet lands.” Unpublished paper, available at: arxiv.org/pdf/1601.05788v2.pdf. 339 On July 7, 2016 NIST released the NIST Ballistics Toolmark Research Database (NBTRD) as an open-access research database of bullet and cartridge case toolmark data (tsapps.nist.gov/NRBTD). The database contains reflectance microscopy images and three-dimensional surface topography data acquired by NIST or submitted by users. 340 See: SWGTREAD Range of Conclusions Standards for Footwear and Tire Impression Examinations (2013). SWGTREAD Guide for the Examination of Footwear and Tire Impression Evidence (2006) and Bodziak W. J. Footwear Impression Evidence: Detection, Recovery, and Examination. 2nd ed. CRC Press-Taylor & Francis, Boca Raton, Florida (2000): p 347. 338 114 determining class characteristics, PCAST chose not to focus on this aspect of footwear examination because it is not inherently a challenging measurement problem to determine class characteristics, to estimate the frequency of shoes having a particular class characteristic, or (for jurors) to understand the nature of the features in question. Instead, PCAST focused on the reliability of conclusions, based on RACs, that an impression was likely to have come from a specific piece of footwear. This is a much harder problem, because it requires knowing how accurately examiners identify specific features shared between a shoe and an impression, how often they fail to identify features that would distinguish them, and what probative value should be ascribed to a particular RAC. Despite the absence of empirical studies that measure examiners’ accuracy, authorities in the footwear field express confidence that they can identify the source of an impression based on a single RAC. As described in a 2009 article by an FBI forensic examiner published in the FBI’s Forensic Science Communications: An examiner first determines whether a correspondence of class characteristics exists between the questioned footwear impression and the known shoe. If the examiner deems that there are no inconsistencies in class characteristics, then the examination progresses to any identifying characteristics in the questioned impression. The examiner compares these characteristics with any identifying characteristics observed on the known shoe. Although unpredictable in their occurrence, the size, shape, and position of these characteristics have a low probability of recurrence in the same manner on a different shoe. Thus, combined with class characteristics, even one identifying characteristic is extremely powerful evidence to support a conclusion of identification. 341 In support, the article cites a leading textbook on footwear identification: According to William J. Bodziak (2000), “Positive identifications may be made with as few as one random identifying characteristic, but only if that characteristic is confirmable; has sufficient definition, clarity, and features; is in the same location and orientation on the shoe outsole; and in the opinion of an experienced examiner, would not occur again on another shoe.” 342 The article points to a mathematical model by Stone that claims that the chance is 1 in 16,000 that two shoes would share one identifying characteristics and 1 in 683 billion that they would share three characteristics. 343 Such claims for “identification” based on footwear analysis are breathtaking—but lack scientific foundation. The statement by Bodziak has two components: (1) that the examiner consistently observes a demonstrable RAC in a set of impressions and (2) that the examiner is positive that the RAC would not occur on another shoe. The Smith, M.B. The Forensic Analysis of Footwear Impression Evidence. www.fbi.gov/about-us/lab/forensic-sciencecommunications/fsc/july2009/review/2009_07_review02.htm 342 Bodziak W.J. Footwear Impression Evidence: Detection, Recovery, and Examination. 2nd ed. CRC Press-Taylor & Francis, Boca Raton, Florida (2000). 343 Stone, R.S. “Footwear examinations: Mathematical probabilities of theoretical individual characteristics.” Journal of Forensic Identification, Vol. 56, No. 4 (2006): 577-99. 341 115 first part is not unreasonable, but the second part is deeply problematic: It requires the examiner to rely on recollections and guesses about the frequency of features. The model by Stone is entirely theoretical: it makes many unsupported assumptions (about the frequency and statistical independence of marks) that it does not test in any way. The entire process—from choice of features to include (and ignore) and the determination of rarity—relies entirely on an examiner’s subjective judgment. Under such circumstances, it is essential that the scientific validity of the method and estimates of its reliability be established by multiple, appropriate black-box studies. 344 Background The 2009 NRC report cited some papers that cast doubt on whether footwear examiners reach consistent conclusions when presented with the same evidence. For example, the report contained a detailed discussion of a 1996 European paper that presented examiners with six mock cases—two involving worn shoes from crime scenes, four with new shoes in which specific identifying characteristics had been deliberately added; the paper reported considerable variation in their answers. 345 PCAST also notes a 1999 Israeli study involving two cases from crime scenes that reached similar conclusions. 346 In response to the 2009 NRC report, a 2013 paper claimed to demonstrate that American and Canadian footwear analysts exhibit greater consistency than seen in the 1996 European study. 347 However, this study differed substantially because the examiners in this study did not conduct their own examinations. For example, the photographs were pre-annotated to call out all relevant features for comparison—that is, the examiners were not asked to identify the features. 348 Thus, the study, by virtue of its design, cannot address the consistency of the examination process. Moreover, the fundamental issue is not one of consistency (whether examiners give the same answer) but rather of accuracy (whether they give the right answer). Accuracy can be evaluated only from large, appropriately designed black-box studies. In addition to black-box studies, white-box studies are also valuable to identify the sources of errors. Majamma, H., and A. Ytti. “Survey of the conclusions drawn of similar footwear cases in various crime laboratories.” Forensic Science International. Vol. 82, No. 1 (1996): 109-20. 346 Shor, Y., and S. Weisner. “Survey on the conclusions drawn on the same footwear marks obtained in actual cases by several experts throughout the world.” Journal of Forensic Science, Vol. 44, No. 2 (1999): 380-4384. 347 Hammer, L., Duffy, K., Fraser, J., and N.N. Daeid. “A study of the variability in footwear impression comparison conclusions.” Journal of Forensic Identification, Vol. 63, No. 2 (2013): 205-18. 348 The paper states that “All characteristics and observations that were to be considered by the examiners during the comparisons were clearly identified and labeled for each impression.” 344 345 116 Studies of Scientific Validity and Reliability PCAST could find no black-box studies appropriately designed to establish the foundational validity of identifications based on footwear analysis. Consistent with our conclusion, the OSAC Footwear and Tire subcommittee recently identified the need for both black-box and white-box examiner reliability studies—citing it as a “major gap in current knowledge” in which there is “no or limited current research being conducted.” 349 Finding 7: Footwear analysis Foundational validity. PCAST finds there are no appropriate empirical studies to support the foundational validity of footwear analysis to associate shoeprints with particular shoes based on specific identifying marks (sometimes called “randomly acquired characteristics). Such conclusions are unsupported by any meaningful evidence or estimates of their accuracy and thus are not scientifically valid. PCAST has not evaluated the foundational validity of footwear analysis to identify class characteristics (for example, shoe size or make). The Path Forward In contrast to latent fingerprint analysis and firearms analysis, there is little research on which to build with respect to conclusions that seek to associate a shoeprint with a particular shoe (identification conclusions). New approaches will be needed to develop paradigms. As an initial step, the FBI Laboratory is engaging in a study examining a set of 700 similar boots that were worn by FBI Special Agent cadets during their 16-week training program. The study aims to assess whether RACs are observed on footwear from different individuals. While such “uniqueness” studies (i.e., demonstrations that many objects have distinct features) cannot establish foundational validity (see p. 42), the impressions generated from the footwear could provide an initial dataset for (1) a pilot black-box study and (2) a pilot database of feature frequencies. Importantly, NIST is beginning a study to see if it is possible to quantify the footwear examination process, or at minimum aspects of the process, in an effort to increase the objectivity of footwear analysis. Separately, evaluations should be undertaken concerning the accuracy and reliability of determinations about class characteristics, a topic that is not addressed in this report. See: www.nist.gov/forensics/osac/upload/SAC-Phy-Footwear-Tire-Sub-R-D-001-Examiner-ReliabilityStudy_Revision_Feb_2016.pdf (accessed on May, 12, 2016). 349 117 5.7 Hair Analysis Forensic hair examination is a process by which examiners compare microscopic features of hair to determine whether a particular person may be the source of a questioned hair. As PCAST was completing this report, the DOJ released for comment guidelines concerning testimony on hair examination that included supporting documents addressing the validity and reliability of the discipline. 350 While PCAST has not undertaken a comprehensive review of the discipline, we undertook a review of the supporting document in order to shed further light on the standards for conducting a scientific evaluation of a forensic feature-comparison discipline. The supporting document states that “microscopic hair comparison has been demonstrated to be a valid and reliable scientific methodology,” while noting that “microscopic hair comparisons alone cannot lead to personal identification and it is crucial that this limitation be conveyed both in the written report and in testimony.” Foundational Studies of Microscopic Hair Examination In support of its conclusion that hair examination is valid and reliable, the DOJ supporting document discusses five studies of human hair comparison. The primary support is a series of three studies by Gaudette in 1974, 1976 and 1978. 351 The 1974 and 1976 studies focus, respectively, on head hair and pubic hair. Because the designs and results are similar, we focus on the head hair study. The DOJ supporting document states that “In the head hair studies, a total of 370,230 intercomparisons were conducted, with only nine pairs of hairs that could not be distinguished”—corresponding to a false positive rate of less than 1 in 40,000. More specifically, the design of this 1974 study was as follows: a single examiner (1) scored between 6 and 11 head hairs from each of 100 individuals (a total of 861 hairs) with respect to 23 distinct categories (with a total of 96 possible values); (2) compared the hairs from different individuals, to identify those pairs of hairs with fewer than four differences; and (3) compared these pairs of hairs microscopically to see if they could be distinguished. The DOJ supporting document fails to note that these studies were strongly criticized by other scientists for flawed methodology. 352 The most serious criticism was that Gaudette compared only hairs from different individuals, but did not look at hairs from the same individual. As pointed out by a 1990 paper by two authors at the Hair and Fibre Unit of the Royal Canadian Mounted Police Forensic Laboratory (as well as in other papers), See: Department of Justice Proposed Uniform Language for Testimony and Reports for the Forensic Hair Examination Discipline, available at: www.justice.gov/dag/file/877736/download and Supporting Documentation for Department of Justice Proposed Uniform Language for Testimony and Reports for the Forensic Hair Examination Discipline, available at: www.justice.gov/dag/file/877741/download. 351 Gaudette, B.D., and E.S. Keeping. “An attempt at determining probabilities in human scalp hair comparisons.” Journal of Forensic Sciences, Vol. 19 (1974): 599-606; Gaudette, B.D. “Probabilities and Human Pubic Hair Comparisons.” Journal of Forensic Science, Vol. 21 (1976): 514-517; Gaudette, B.D. “Some further thoughts on probabilities and human hair comparisons.” Journal of Forensic Sciences, Vol. 23 (1978): 758–763. 352 Wickenheiser, R. A. and D.G. Hepworth, D.G. “Further evaluation of probabilities in human scalp hair comparisons.” Journal of Forensic Sciences, Vol. 35 (1990): 1323-29. See also Barnett, P.D. and R.R. Ogle. “Probabilities and human hair comparison.” Journal of Forensic Sciences, Vol. 27 (1982): 272–278 and Gaudette, B.D. "A Supplementary Discussion of Probabilities and Human Hair Comparisons." Journal of Forensic Sciences, Vol. 27, No. 2, (1982): 279-89. 350 118 the apparently low false positive rate could have resulted from examiner bias—that is, that the examiner explicitly knew that all hairs being examined came from different individuals and thus could be inclined, consciously or unconsciously, to search for differences. 353 In short, one cannot appropriately assess a method’s false-positive rate without simultaneously assessing its true-positive rate (sensitivity). In the 1990 paper, the authors used a similar study design, but employed two examiners who examined all pairs of hairs. They found non-repeatability for the individual examiners (“each examiner had considerable day-to-day variation in hair feature classification”) and non-reproducibility between the examiners (“in many cases, the examiners classified the same hairs differently”). Most notably, they found that, while the examiners found no matches between hairs from different individuals, they also found almost no consistent matches among hairs from the same person. Of 15 pairs of same-source hairs that the authors determined should have been declared to match, only two were correctly called by both examiners. In Gaudette’s 1978 study, the author gave a different hair to each of three examiner trainees, who had completed one year of training, and asked them to identify any matching samples among a reference set of 100 hairs (which, unbeknownst to the examiners, came from 100 different people, including the sources of the hairs). The three examiners reported 1, 1 and 4 matches, consisting of 3 correct and 3 incorrect answers. Of the declared matches, 50 percent were thus false positive associations. Among the 300 total comparisons, the overall false positive rate was 1 percent, which notably is 400-fold higher than the rate estimated in the 1974 study. Interestingly, we noted that the DOJ supporting document wrongly reports the results of the study—claiming that the third examiner trainee made only 1 error, rather than 3 errors. The explanation for this discrepancy is found in a remarkably frank passage of the text, which illustrates the need for employing rigorous protocols in evaluating the results of experiments: “Two trainees correctly identified one hair and only one hair as being similar to the standard. The third trainee first concluded that there were four hairs similar to the standard. Upon closer examination and consultation with the other examiners, he was easily able to identify one of his choices as being incorrect. However, he was still convinced that there were three hairs similar to the standard, the correct one and two others. Examination by the author brought the opinion that one of these two others could be eliminated but that the remaining one was indistinguishable from hairs in the standard. Another experienced examiner then studied the hairs and also concluded that one of the two others could be eliminated. This time, however, it was the opposite to the one picked by the author!” 354 Ex post facto reclassification of errors is generally not advisable in studies pertaining to validity and reliability. In addition, inconsistency in scoring features would add random noise to any structure in the data (e.g., feature correlations) and thereby decrease the frequency of matches occurring by chance. 354 Gaudette, B.D. “Some further thoughts on probabilities and human hair comparisons.” Journal of Forensic Sciences Vol. 23, (1978): 758–763. 353 119 The two other human-hair studies discussed in the DOJ supporting document are also problematic. A 1983 paper involved hair samples from 100 individuals, classified into three racial groups. 355 After the author had extensively studied the hairs, she asked a neutral party to set up seven “blind” challenge problems for her—by selecting 10 questioned hairs and 10 known hairs (across groups in three cases, within a group in four cases). 356 The results consist of a single sentence in which the author simply states that she performed with “100 percent accuracy.” Self-reported performance on a test is not generally regarded as appropriate scientific methodology. A 1984 paper studied hairs from 17 pairs of twins (9 fraternal, 6 identical and 2 unknown zygosity) and one set of identical triplets. 357 Interestingly, the hairs from identical twins showed no greater similarity than the hairs from fraternal twins. In the sole test designed to simulate forensic casework, two examiners were given seven challenge problems, each consisting of comparing a questioned hair to between 5 and 10 known hairs. The false positive rate was 1 in 12, which is roughly 3300-fold higher than in Gaudette’s 1974 study of hair from unrelated individuals. 358 PCAST finds that, based on their methodology and results, the papers described in the DOJ supporting document do not provide a scientific basis for concluding that microscopic hair examination is a valid and reliable process. After describing the scientific papers, the DOJ document goes on to discuss the conclusions that can be drawn from hair comparison: These studies have also shown that microscopic hair comparison alone cannot lead to personal identification and it is crucial that this limitation be conveyed both in the written report and in testimony. The science of microscopic hair comparison acknowledges that the microscopic characteristics exhibited by a questioned hair may be encompassed by the range of characteristics exhibited by known hair samples of more than one person. If a questioned hair is associated with a known hair sample that is truly not the source, it does not mean that the microscopic hair association is in error. Rather, it highlights the limitation of the science in that there is an unknown pool of people who could have contributed the questioned hair. However, studies have not determined the number of individuals who share hairs with the same or similar characteristics. The passage violates fundamental scientific principles in two important ways. The first problem is that it uses the fact that the method’s accuracy is not perfect to dismiss the need to know the method’s accuracy at all. According to the supporting document, it is not an “error” but simply a “limitation of the science” when an examiner associates a hair with an individual who was not actually the source of the hair. This is disingenuous. When an expert witness tells a jury that a hair found at the scene of a crime is microscopically indistinguishable Strauss, M.T. “Forensic characterization of human hair.” The Microscope, Vol. 31, (1983): 15-29. The DOJ supporting document mistakenly reports that the comparison-microscopy test involved comparing 100 questioned hairs with 100 known hairs. 357 Bisbing, R.E. and M.F. Wolner. “Microscopical Discrimination of Twins’ Head Hair.” Journal of Forensic Sciences, Vol. 29, (1984): 780-786. 358 The DOJ supporting document describes the results in positive terms: “In the seven tests, one examiners correctly excluded 47 of 52 samples, and a second examiner correctly excluded 49 of 52 samples.” It does not specify whether the remaining results are inconclusive results or false positives. 355 356 120 from a defendant’s hair, the expert and the prosecution intend the statement to carry weight. Yet, the document goes on to say that no information is available about the proportion of individuals with similar characteristics. As Chapter 4 makes clear, this is scientifically unacceptable. Without appropriate estimates of accuracy, an examiner’s statement that two samples are similar—or even indistinguishable—is scientifically meaningless: it has no probative value, and considerable potential for prejudicial impact. In short, if scientific hair analysis is to mean something, there must be actual empirical evidence about its meaning. The second problem with the passage is its implication that there is no relevant empirical evidence about the accuracy of hair analysis. In fact, such evidence was generated by the FBI Laboratory. We turn to this point next. FBI Study Comparing Microscopic Hair Examination and DNA Analysis A particularly concerning aspect of the DOJ supporting document is its treatment of the FBI study on hair examination discussed in Chapter 2. In that 2002 study, FBI personnel used mitochondrial DNA analysis to reexamine 170 samples from previous cases in which the FBI Laboratory had performed microscopic hair examination. The authors found that, in 9 of 80 cases (11 percent) in which the FBI Laboratory had found the hairs to be microscopically indistinguishable, the DNA analysis showed that the hairs actually came from different individuals. The 2002 FBI study is a landmark in forensic science because it was the first study to systematically and comprehensively analyze a large collection of previous casework to measure the frequency of false-positive associations. Its conclusion is of enormous importance to forensic science, to police, to courts and to juries: When hair examiners conclude in casework that two hair samples are microscopically indistinguishable, the hairs often (1 in 9 times) come from different sources. Surprisingly, the DOJ document completely ignores this key finding. Instead, it references the FBI study only to support the proposition that DNA analysis “can be used in conjunction with microscopic hair comparison,” citing “a 2002 study, which indicated that out of 80 microscopic associations, approximately 88 percent were also included by additional mtDNA testing.” The document fails to acknowledge that the remaining cases were found to be false associations—that is, results that, if presented as evidence against a defendant, would mislead a jury about the origins of the hairs. 359 Conclusion Our brief review is intended simply to illustrate potential pitfalls in evaluations of the foundational validity and reliability of a method. PCAST is mindful of the constraints that DOJ faces in undertaking scientific evaluations of 359 In a footnote, the document also takes pains to note that paper cannot be taken to provide an estimate of the falsepositive rate for microscopic hair comparison, because it contains no data about the number of different-sources comparison that examiners correctly excluded. While this statement is correct, it is misleading—because the paper provides an estimate of a far more important quantity—namely, the frequency of false associations that occurred in actual casework. 121 the validity and reliability of forensic methods, because critical evaluations by DOJ might be taken as admissions that could be used to challenge past convictions or current prosecutions. These issues highlight why it is important for evaluations of scientific validity and reliability to be carried out by a science-based agency that is not itself involved in the application of forensic science within the legal system (see Section 6.1). They also underscore why it is important that quantitative information about the reliability of methods (e.g., the frequency of false associations in hair analysis) be stated clearly in expert testimony. We return to this point in Chapter 8, where we consider the DOJ’s proposed guidelines, which would bar examiners from providing information about the statistical weight or probability of a conclusion that a questioned hair comes from a particular source. 5.8 Application to Additional Methods Although we have undertaken detailed evaluations of only six specific methods and included a discussion of a seventh method, the basic analysis can be applied to assess the foundational validity of any forensic featurecomparison method—including traditional forensic disciplines (such as document examination) as well as methods yet to be developed (such as microbiome analysis or internet-browsing patterns). We note that the evaluation of scientific validity is based on the available scientific evidence at a point in time. Some methods that have not been shown to be foundationally valid may ultimately be found to be reliable— although significant modifications to the methods may be required to achieve this goal. Other methods may not be salvageable—as was the case with compositional bullet lead analysis and is likely the case with bitemarks. Still others may be subsumed by different but more reliable methods, much as DNA analysis has replaced other methods in many instances. 5.9 Conclusion As the chapter above makes clear, many forensic feature-comparison methods have historically been assumed rather than established to be foundationally valid based on appropriate empirical evidence. Only within the past decade has the forensic science community begun to recognize the need to empirically test whether specific methods meet the scientific criteria for scientific validity. Only in the past five years, for example, have there been appropriate studies that establish the foundational validity and measure the reliability of latent fingerprint analysis. For most subjective methods, there are no appropriate black-box studies with the result that there is no appropriate evidence of foundational validity or estimates of reliability. The scientific analysis and findings in Chapters 4 and 5 are intended to help focus the relevant actors on how to ensure scientific validity, both for existing technologies and for technologies still to be developed. PCAST expects that some forensic feature-comparison methods may be rejected by courts as inadmissible because they lack adequate evidence of scientific validity. We note that decisions to exclude unreliable methods have historically helped propel major improvements in forensic science—as happened in the early days 122 of DNA evidence—with the result that some methods become established (possibly in revised form) as scientifically valid, while others are discarded. In the remaining chapters, we offer recommendations on specific actions that could be taken by the Federal Government—including science-based agencies (NIST and OSTP), the FBI Laboratory, the Attorney General, and the Federal judiciary—to ensure the scientific validity and reliability of forensic feature-comparison methods and promote their more rigorous use in the courtroom. 123 6. Actions to Ensure Scientific Validity in Forensic Science: Recommendations to NIST and OSTP Based on the scientific findings in Chapters 4 and 5, PCAST has identified actions that we believe should be taken by science-based Federal agencies—specifically, NIST and OSTP—to ensure the scientific validity of forensic feature-comparison methods. 6.1 Role for NIST in Ongoing Evaluation of Foundational Validity There is an urgent need for ongoing evaluation of the foundational validity of important methods, to provide guidance to the courts, the DOJ, and the forensic science community. Evaluations should be undertaken of both existing methodologies that have not yet met the scientific standards for foundational validity and new methodologies that are being and will be developed in the years ahead. To ensure that the scientific judgments are unbiased and independent, such evaluations must clearly be conducted by a science agency with no stake in the outcome. 360 This responsibility should be lodged with NIST. NIST is the world’s leading metrological laboratory, with a long and distinguished history in the science and technology of measurement. It has tremendous experience in designing and carrying out validation studies, as well as assessing the foundational validity and reliability of laboratory techniques and practices. NIST’s mission of advancing measurement science, technology, and standards has expanded from traditional physical measurement standards to respond to many other important societal needs, including those of forensic science, in which NIST has vigorous programs. 361 As described above, NIST has begun to lead a number of important efforts to strengthen the forensic sciences, including its roles with respect to NCFS and OSAC. PCAST recommends that NIST be tasked with responsibility for preparing an annual report evaluating the foundational validity of key forensic feature-comparison methods, based on available, published empirical studies. These evaluations should be conducted under the auspices of NIST, with input from additional expertise as deemed necessary from experts outside forensic science, and overseen by an appropriate review panel. The reports should, as a minimum, produce assessments along the lines of those in this report, updated as appropriate. Our intention is not that NIST have a formal regulatory role with respect to forensic science, but rather that NIST’s evaluations help inform courts, the DOJ, and the forensic science community. For example, agencies that apply forensic feature-comparison methods within the legal system have a clear stake in the outcome of such evaluations. 361 See: www.nist.gov/forensics. 360 124 We do not expect NIST to take responsibility for conducting the necessary validation studies. However, NIST should advise on the design and execution of such studies. NIST could carry out some studies through its own intramural research program and through CSAFE. However, the majority of studies will likely be conducted by other groups—such as NSF’s planned Industry/University Cooperative Research Centers; the FBI Laboratory; the U.S. national laboratories; other Federal agencies; state laboratories; and academic researchers. We note that the NCFS has recently endorsed the need for independent scientific review of forensic science methods. A Views Document overwhelmingly approved by the commission in June 2016 stated that, “All forensic science methodologies should be evaluated by an independent scientific body to characterize their capabilities and limitations in order to accurately and reliably answer a specific and clearly defined forensic question” and that “The National Institute of Standards and Technology (NIST) should assume the role of independent scientific evaluator within the justice system for this purpose.” 362 Finally, we believe that the state of forensic science would be improved if papers on the foundational validity of forensic feature-comparison methods were published in leading scientific journals rather than in forensicscience journals, where, owing to weaknesses in the research culture of the forensic science community discussed in this report, the standards for peer review are less rigorous. Commendably, FBI scientists published its black-box study of latent fingerprints in the Proceedings of the National Academy of Sciences. We suggest that NIST explore with one or more leading scientific journals the possibility of creating a process for rigorous review and online publication of important studies of foundational validity in forensic science. Appropriate journals could include Metrologia, a leading international journal in pure and applied metrology, and the Proceedings of the National Academy of Sciences. 6.2 Accelerating the Development of Objective Methods As described throughout the report, objective methods are generally preferable to subjective methods. The reasons include greater accuracy, greater efficiency, lower risk of human error, lower risk of cognitive bias, and greater ease of establishing foundational validity and estimating reliability. Where possible, vigorous efforts should be undertaken to transform subjective methods into objective methods. Two forensic feature-comparison methods—latent fingerprint analysis and firearms analysis—are ripe for such transformation. As discussed in the previous chapter, there are strong reasons to believe that both methods can be made objective through automated image analysis. In addition, DNA analysis of complex mixtures has recently been converted into a foundationally valid objective method for a limited range of mixtures, but additional work will be needed to expand the limits of the range. NIST, in conjunction with the FBI Laboratory, should play a leadership role in propelling this transformation by (1) the creation and dissemination of large datasets to support the development and testing of methods by both Views of the Commission: Technical Merit Evaluation of Forensic Science Methods and Practices. www.justice.gov/ncfs/file/881796/download. 362 125 companies and academic researchers, (2) grant and contract support, and (3) sponsoring processes, such as prize competitions, to evaluate methods. 6.3 Improving the Organization for Scientific Area Committees The creation by NIST of OSAC was an important step in strengthening forensic science practice. The organizational design—which houses all of the subject area communities under one structure and encourages cross-disciplinary communication and coordination—is a significant improvement over the previous Scientific Working Groups (SWGs), which functioned less formally as stand-alone committees. However, initial lessons from its first years of operation have revealed some important shortcomings. OSAC’s membership includes relatively few independent scientists: it is dominated by forensic professionals, who make up more than two-thirds of its members. Similarly, it has few independent statisticians: while virtually all of the standards and guidelines evaluated by this body need consideration of statistical principles, OSAC’s 600 members include only 14 statisticians spread across all four Science Area Committees and 23 subcommittees. Restructuring PCAST concludes that OSAC lacks sufficient independent scientific expertise and oversight to overcome the serious flaws in forensic science. Some restructuring is necessary to ensure that independent scientists and statisticians have a greater voice in the standards development process, a requirement for meaningful scientific validity. Most importantly, OSAC should have a formal committee—a Metrology Resource Committee—at the level of the other three Resource Committees (the Legal Resource Committee, the Human Factors Committee, and the Quality Infrastructure Committee). This Committee should be composed of laboratory scientists and statisticians from outside the forensic science community and charged with reviewing each standard and guideline that is recommended for registry approval by the Science Area Committees before it is sent for final review the Forensic Science Standards Board (FSSB). Availability of OSAC Standards OSAC is not a formal standard-setting body. It reviews and evaluates standards relevant to forensic science developed by standards developing organizations such as ASTM International, the National Fire Protection Association (NFPA) and the International Organization for Standardization (ISO) for inclusion on the OSAC Registries of Standards and Guidelines. The OSAC evaluation process includes a public comment period. OSAC, working with the standards developers, has arranged for the content of standards under consideration to be accessible to the public during the public comment period. Once approved by OSAC, a standard is listed, by title, on a public registry maintained by NIST. It is customary for some standards developing organization, including ASTM International, to charge a fee for a licensed copy of each copyrighted standard and to restrict users from distributing these standards. 363,364 For a list of ASTM’s forensic science standards, see: www.astm.org/DIGITAL_LIBRARY/COMMIT/PAGES/E30.htm. The American Academy of Forensic Sciences (AAFS) will also become an accredited Standards Developing Organization (SDO) and could, in the future, develop standards for review and listing by OSAC. 363 364 126 NIST recently negotiated a licensing agreement with ASTM International that, for a fee, allows federal, state and local government employees online access to ASTM Committee E30 standards. 365 However, this list does not include indigent defendants, private defense attorneys, or large swaths of the academic research community. At present, contracts have been negotiated with the other SDOs that have standards currently under review by the OSAC. PCAST believes it is important that standards intended for use in the criminal justice system are widely available to all who may need access. It is important that the standards be readily available to defendants and to external observers, who have an important role to play in ensuring quality in criminal justice. 366 NIST should ensure that the content of OSAC-registered standards and guidelines are freely available to any party that may desire them in connection with a legal case or for evaluation and research, including by aligning with the policies related to reasonable availability of standards in the Office of Management and Budget Circular A-119, Federal Participation in the Development and Use of Voluntary Consensus Standards and Conformity Assessment Activities and the Office of the Federal Register, IBR (incorporation by reference) Handbook. 6.4 Need for an R&D Strategy for Forensic Science The 2009 NRC report found that there is an urgent need to strengthen forensic science, noting that, “Forensic science research is not well supported, and there is no unified strategy for developing a forensic science research plan across federal agencies.” 367 It is especially important to create and support a vibrant academic research community rooted in the scientific culture of universities. This will require significant funding to support academic research groups, but will pay big dividends in driving quality and innovation in both existing and entirely new methods. Both NIST and NSF have recently taken initial steps to help bridge the significant gaps between the forensic practitioner and academic research communities through multi-disciplinary research centers. These centers promise to engage the broader research community in advancing forensic science and create needed links between the forensic science community and a broad base of research universities and could help drive forward critical foundational research. Nonetheless, as noted in Chapter 2, the total level of Federal funding by NIJ, NIST, and NSF to the academic community for fundamental research in forensic science is extremely small. Substantially larger funding will be needed to develop a robust research community and to support the development and evaluation of promising new technologies. According to the revised contract, ASTM will provide unlimited web-based access for all ASTM committee E30 Forensic Science Standards to: OSAC members and affiliates; NIST and Federal/State/Local Crime Laboratories; Public Defenders Offices; Law Enforcement Agencies; Prosecutor Offices; and Medical Examiner/and Coroners Offices. 366 PCAST expresses no opinion about the appropriateness of paywalls for standards in areas other than criminal justice. 367 National Research Council. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. Washington DC. (2009): 78. 365 127 Federal R&D efforts in forensic science, both intramural and extramural, need to be better coordinated. No one agency has lead responsibility for ensuring that the forensic sciences are adequately supported. Greater coordination is needed across the relevant Federal agencies and laboratories to ensure that funding is directed to the highest priorities and that work is of high quality. OSTP should convene relevant Federal agencies, laboratories, and stakeholders to develop a national research strategy and 5-year plan to ensure that foundational research in support of the forensic sciences is wellcoordinated, solidify Federal agency commitments made to date, and galvanize further action and funding that could be taken to encourage additional foundational research, improve current forensic methods, support the creation of new research databases, and oversee the regular review and prioritization of research. 6.5 Recommendations Based on its scientific findings, PCAST makes the following recommendations. Recommendation 1. Assessment of foundational validity It is important that scientific evaluations of the foundational validity be conducted, on an ongoing basis, to assess the foundational validity of current and newly developed forensic feature-comparison technologies. To ensure the scientific judgments are unbiased and independent, such evaluations must be conducted by a science agency which has no stake in the outcome. (A) The National Institute of Standards and Technology (NIST) should perform such evaluations and should issue an annual public report evaluating the foundational validity of key forensic featurecomparison methods. (i) The evaluations should (a) assess whether each method reviewed has been adequately defined and whether its foundational validity has been adequately established and its level of accuracy estimated based on empirical evidence; (b) be based on studies published in the scientific literature by the laboratories and agencies in the U.S. and in other countries, as well as any work conducted by NIST’s own staff and grantees; (c) as a minimum, produce assessments along the lines of those in this report, updated as appropriate; and (d) be conducted under the auspices of NIST, with additional expertise as deemed necessary from experts outside forensic science. (ii) NIST should establish an advisory committee of experimental and statistical scientists from outside the forensic science community to provide advice concerning the evaluations and to ensure that they are rigorous and independent. The members of the advisory committee should be selected jointly by NIST and the Office of Science and Technology Policy. (iii) NIST should prioritize forensic feature-comparison methods that are most in need of evaluation, including those currently in use and in late-stage development, based on input from the Department of Justice and the scientific community. 128 (iv) Where NIST assesses that a method has been established as foundationally valid, it should (a) indicate appropriate estimates of error rates based on foundational studies and (b) identify any issues relevant to validity as applied. (v) Where NIST assesses that a method has not been established as foundationally valid, it should suggest what steps, if any, could be taken to establish the method’s validity. (vi) NIST should not have regulatory responsibilities with respect to forensic science. (vii) NIST should encourage one or more leading scientific journals outside the forensic community to develop mechanisms to promote the rigorous peer review and publication of papers addressing the foundational validity of forensic feature-comparison methods. (B) The President should request and Congress should provide increased appropriations to NIST of (a) $4 million to support the evaluation activities described above and (b) $10 million to support increased research activities in forensic science, including on complex DNA mixtures, latent fingerprints, voice/speaker recognition, and face/iris biometrics. Recommendation 2. Development of objective methods for DNA analysis of complex mixture samples, latent fingerprint analysis, and firearms analysis The National Institute of Standards and Technology (NIST) should take a leadership role in transforming three important feature-comparison methods that are currently subjective—latent fingerprint analysis, firearms analysis, and, under some circumstances, DNA analysis of complex mixtures—into objective methods. (A) NIST should coordinate these efforts with the Federal Bureau of Investigation Laboratory, the Defense Forensic Science Center, the National Institute of Justice, and other relevant agencies. (B) These efforts should include (i) the creation and dissemination of large datasets and test materials (such as complex DNA mixtures) to support the development and testing of methods by both companies and academic researchers, (ii) grant and contract support, and (iii) sponsoring processes, such as prize competitions, to evaluate methods. Recommendation 3. Improving the Organization for Scientific Area Committees process (A) The National Institute of Standards and Technology (NIST) should improve the Organization for Scientific Area Committees (OSAC), which was established to develop and promulgate standards and guidelines to improve best practices in the forensic science community. 129 (i) NIST should establish a Metrology Resource Committee, composed of metrologists, statisticians, and other scientists from outside the forensic science community. A representative of the Metrology Resource Committee should serve on each of the Scientific Area Committees (SACs) to provide direct guidance on the application of measurement and statistical principles to the developing documentary standards. (ii) The Metrology Resource Committee, as a whole, should review and publically approve or disapprove all standards proposed by the Scientific Area Committees before they are transmitted to the Forensic Science Standards Board. (B) NIST should ensure that the content of OSAC-registered standards and guidelines are freely available to any party that may desire them in connection with a legal case or for evaluation and research, including by aligning with the policies related to reasonable availability of standards in the Office of Management and Budget Circular A-119, Federal Participation in the Development and Use of Voluntary Consensus Standards and Conformity Assessment Activities and the Office of the Federal Register, IBR (incorporation by reference) Handbook. Recommendation 4. R&D strategy for forensic science (A) The Office of Science and Technology Policy (OSTP) should coordinate the creation of a national forensic science research and development strategy. The strategy should address plans and funding needs for: (i) major expansion and strengthening of the academic research community working on forensic sciences, including substantially increased funding for both research and training; (ii) studies of foundational validity of forensic feature-comparison methods; (iii) improvement of current forensic methods, including converting subjective methods into objective methods, and development of new forensic methods; (iv) development of forensic feature databases, with adequate privacy protections, that can be used in research; (v) bridging the gap between research scientists and forensic practitioners; and (vi) oversight and regular review of forensic science research. (B) In preparing the strategy, OSTP should seek input from appropriate Federal agencies, including especially the Department of Justice, Department of Defense, National Science Foundation, and National Institute of Standards and Technology; Federal and State forensic science practitioners; forensic science and non-forensic science researchers; and other stakeholders. 130 7. Actions to Ensure Scientific Validity in Forensic Science: Recommendation to the FBI Laboratory Based on the scientific findings in Chapters 4 and 5, PCAST has identified actions that we believe should be taken by the FBI Laboratory to ensure the scientific validity of forensic feature-comparison methods. We note that the FBI Laboratory has played an important role in recent years in undertaking high-quality scientific studies of latent fingerprint analysis. PCAST applauds these efforts and urges the FBI Laboratory to expand them. 7.1 Role for FBI Laboratory The FBI Laboratory is a full-service, state-of-the-art facility that works to apply cutting-edge science to solve cases and prevent crime. Its mission is to apply scientific capabilities and technical services to the collection, processing, and exploitation of evidence for the Laboratory and other duly constituted law enforcement and intelligence agencies in support of investigative and intelligence priorities. Currently, the Laboratory employs approximately 750 employees and over 300 contractors to meet the broad scope of this mission. Laboratory Capabilities and Services The FBI has specialized capabilities and personnel to respond to incidents, collect evidence in their field, carry out forensic analyses, and provide expert witness testimony. The FBI Laboratory supports Evidence Response Teams in all 56 FBI field offices and has personnel who specialize in hazardous evidence and crime scene documentation and data collection. The Laboratory is responsible for training and supplying these response activities for FBI personnel across the U.S. 368 The Laboratory also manages the Terrorist Explosive Device Analytical Center (TEDAC), which received nearly 1,000 evidence submissions in FY 2015 and disseminated over 2,000 intelligence products. The FBI Laboratory employs forensic examiners to carry out analyses in a range of disciplines, including chemistry, cryptanalysis, DNA, firearms and toolmarks, latent prints, questioned documents, and trace evidence. The FBI Laboratory received over 3875 evidence submissions and authored over 4850 laboratory reports in FY 2015. In addition to carrying out casework for federal cases, the Laboratory provides support to state and local laboratories and carries out testing in state and local cases for some disciplines. The FBI Laboratory supported 162 deployments and 168 response exercises, as well as delivering 239 training courses in FY 2015. 368 131 Research and Development Activities In addition to its services, the FBI Laboratory carries out important research and development activities. The activities are critical for providing the Laboratory with the most advanced tools for advancing its mission. A strong research program and culture is also important to the Laboratory’s ability to maintain excellence and to attract and retain highly qualified personnel. Due to the expansive scope and many requirements on its operations, only about five percent of the FBI Laboratory’s annual $100 million budget is available for research and development activities. 369 The R&D budget is stretched across a number of applied research activities, including validation studies (for new methods or commercial products, such as new DNA analyzers). For its internal research activities, the Laboratory relies heavily on its Visiting Scientist Program, which brings approximately 25 post docs, master’s students, and bachelor’s degree students into the laboratory each year. The Laboratory has worked to partner with other government agencies to provide more resources to its research priorities as a composite initiative, and has also been able to stretch available budgets by performing critical research studies incrementally over several years. The FBI Laboratory’s series of studies in latent print examination is an example of important foundational research that it was able to carry out incrementally over a five-year period. The work includes “black box” studies that evaluate the accuracy and reliability of latent print examiners’ conclusions, as well as “white box” studies to evaluate how the quality and quantity of features relate to latent print examiners’ decisions. These studies have resulted in a series of important publications that have helped to quantify error rates for the community of practice and assess the repeatability and reproducibility of latent fingerprint examiners’ decisions. Indeed, PCAST’s judgment that latent fingerprint analysis is foundationally valid rests heavily on the FBI blackbox study. Similar lines of research are being pursued in some other disciplines, including firearms examination and questioned documents. Unfortunately, the limited funding available for these studies—and for the intramural research program more generally—has hampered progress in testing the foundational validity of forensic science methods and in strengthening the forensic sciences. PCAST believes that the budget for the FBI Laboratory should be significantly increased, and targeted so as allow the R&D budget to be increased to a total of $20 million. Access to databases The FBI also has an important role to play in encouraging research by external scientists, by facilitating access, under appropriate conditions, to large forensic databases. Most of the databases routinely used in forensic analysis are not accessible for use by researchers, and the lack of access hampers progress in improving forensic science. For example, ballistic database systems such as the Bureau of Alcohol, Tobacco, Firearms and Explosives’ National Integrated Ballistic Information System (NIBIN), which is searched by firearms examiners seeking to identify a firearm or cartridge case, cannot be assessed to study its completeness, relevance or In 2014, the FBI Laboratory spent $10.9 million on forensic science research and development, with roughly half from its own budget and half from grants from NIST and the Department of Homeland Security. See: National Academies of Sciences, Engineering, and Medicine. Support for Forensic Science Research: Improving the Scientific Role of the National Institute of Justice. The National Academies Press. Washington DC. (2015): p. 31. 369 132 quality, and the search algorithm that is used to identify potential matches cannot be evaluated. The NGI (formerly IAFIS) 370 system that currently houses more than 70 million fingerprint entries would dramatically expand the data available for study; currently, there exists only one publicly available fingerprint database, consisting of 258 latent print-10 print pairs. 371 And, the FBI’s NDIS system, which currently houses more than 14 million offender and arrestee DNA profiles. NIST has developed an inventory of all of the forensic databases that are heavily used by law enforcement and forensic scientists, with information as to their accessibility. Substantial efforts are needed to make existing forensic databases more accessible to the research community, subject to appropriate protection of privacy, such as removal of personally identifiable information and data-use restrictions. For some disciplines, such as firearms analysis and treadmarks, there are no significant privacy concerns. For latent prints, privacy concerns might be ameliorated in variety of ways. For example, one might avoid the issue by (1) generating large collections of known-latent print pairs with varying quality and quantity of information through the touching and handling of natural items in a wide variety of circumstances (surfaces, pressure, distortion, etc.), (2) using software to automatically generate the “morphing transformations” from the known prints and the latent prints, and (3) applying these transformations to prints from deceased individuals to create millions of latent-known print pairs. 372 For DNA, protocols have been developed in human genomic research, which poses similar or greater privacy concerns, to allow access to bona fide researchers. 373 Such policies should be feasible for forensic DNA databases as well. We note that the law that authorizes the FBI to maintain a national forensic DNA database explicitly contemplates allowing access to DNA samples and DNA analyses “if personally identifiable information is removed . . . for identification research and protocol development purposes.” 374 Although the law does not contain an explicit statement on this point, DOJ interprets the law as allowing use for this purpose only by criminal justice agencies. It is reluctant, in the absence of statutory clarification, to provide even controlled access to other researchers. This topic deserves attention. PCAST believes that the availability of data will speed the development of methods, tools, and software that will improve forensic science. For databases under its control, the FBI Laboratory should develop programs to make forensic databases (or subsets of those databases) accessible to researchers under conditions that protect NGI standards for “Next Generation Identification” and combines multiple biometric information systems, including IAFIS, iris and face recognition systems, and others. 371 NIST Special Database 27A, available at: www.nist.gov/itl/iad/image-group/nist-special-database-27a-sd-27a. 372 Medical examiners offices routinely collect fingerprints from deceased individuals as part of the autopsy process; these fingerprints could be collected and used to create a large database for research purposes. 373 A number of models that have been developed in the biomedical research context that allow for tiered access to sensitive data while providing adequate privacy protection could be employed here. Researchers could be required to sign Non-Disclosure Agreements (NDAs) or enter into limited use agreements. Researchers could be required to access the data on site, so that data cannot be downloaded or shared, or could be permitted to download only aggregated or summary data. 374 Federal DNA Identification Act, 42 U.S.C. §14132(b)(3)(D)). 370 133 privacy. For databases owned by others, the FBI Laboratory and NIST should each work with other agencies and companies that control the databases to develop programs providing appropriate access. 7.2 Recommendation Based on its scientific findings, PCAST makes the following recommendation. Recommendation 5. Expanded forensic-science agenda at the Federal Bureau of Investigation Laboratory (A) Research programs. The Federal Bureau of Investigation (FBI) Laboratory should undertake a vigorous research program to improve forensic science, building on its recent important work on latent fingerprint analysis. The program should include: (i) conducting studies on the reliability of feature-comparison methods, in conjunction with independent third parties without a stake in the outcome; (ii) developing new approaches to improve reliability of feature-comparison methods; (iii) expanding collaborative programs with external scientists; and (iv) ensuring that external scientists have appropriate access to datasets and sample collections, so that they can carry out independent studies. (B) Black-box studies. Drawing on its expertise in forensic science research, the FBI Laboratory should assist in the design and execution of additional black-box studies for subjective methods, including for latent fingerprint analysis and firearms analysis. These studies should be conducted by or in conjunction with independent third parties with no stake in the outcome. (C) Development of objective methods. The FBI Laboratory should work with the National Institute of Standards and Technology to transform three important feature-comparison methods that are currently subjective—latent fingerprint analysis, firearm analysis, and, under some circumstances, DNA analysis of complex mixtures—into objective methods. These efforts should include (i) the creation and dissemination of large datasets to support the development and testing of methods by both companies and academic researchers, (ii) grant and contract support, and (iii) sponsoring prize competitions to evaluate methods. (D) Proficiency testing. The FBI Laboratory, should promote increased rigor in proficiency testing by (i) within the next four years, instituting routine blind proficiency testing within the flow of casework in its own laboratory, (ii) assisting other Federal, State, and local laboratories in doing so as well, and (iii) encouraging routine access to and evaluation of the tests used in commercial proficiency testing. 134 (E) Latent fingerprint analysis. The FBI Laboratory should vigorously promote the adoption, by all laboratories that perform latent fingerprint analysis, of rules requiring a “linear Analysis, Comparison, Evaluation” process—whereby examiners must complete and document their analysis of a latent fingerprint before looking at any known fingerprint and should separately document any additional data used during comparison and evaluation. (F) Transparency concerning quality issues in casework. The FBI Laboratory, as well as other Federal forensic laboratories, should regularly and publicly report quality issues in casework (in a manner similar to the practices employed by the Netherlands Forensic Institute, described in Chapter 5), as a means to improve quality and promote transparency. (G) Budget. The President should request and Congress should provide increased appropriations to the FBI to restore the FBI Laboratory’s budget for forensic science research activities from its current level to $30 million and should evaluate the need for increased funding for other forensicscience research activities in the Department of Justice. 135 8. Actions to Ensure Scientific Validity in Forensic Science: Recommendations to the Attorney General Based on the scientific findings in Chapters 4 and 5, PCAST has identified actions that we believe should be taken by the Attorney General to ensure the scientific validity of forensic feature-comparison methods and promote their more rigorous use in the courtroom. 8.1 Ensuring the Use of Scientifically Valid Methods in Prosecutions The Federal Government has a deep commitment to ensuring that criminal prosecutions are not only fair in their process, but correct in their outcome—that is, that guilty individuals are convicted, while innocent individuals are not. Toward this end, the DOJ should ensure that testimony about forensic evidence presented in court is scientifically valid. This report provides guidance to DOJ concerning the scientific criteria for both foundational validity and validity as applied, as well as evaluations of six specific forensic methods and a discussion of a seventh. Over the long term, DOJ should look to ongoing evaluations of forensic methods that should be performed by NIST (as described in Chapter 6). In the interim, DOJ should undertake a review of forensic feature-comparison methods (beyond those reviewed in this report) to identify which methods used by DOJ lack appropriate black-box studies necessary to assess foundational validity. Because such subjective methods are presumptively not established to be foundationally valid, DOJ should evaluate (1) whether DOJ should present in court conclusions based on such methods and (2) whether black-box studies should be launched to evaluate those methods. 8.2 Revision of DOJ Recently Proposed Guidelines on Expert Testimony On June 3, 2016, the DOJ released for comment a first set of proposed guidelines, together with supporting documents, on “Proposed Uniform Language for Testimony and Reports” on several forensic sciences, including latent fingerprint analysis and forensic footwear and tire impression analysis. 375 On July 21, 2016, the DOJ released for comment a second set of proposed guidelines and supporting documents for several additional forensic sciences, including microscopic hair analysis, certain types of DNA analysis, and other fields. See: www.justice.gov/dag/proposed-language-regarding-expert-testimony-and-lab-reports-forensic-science. A second set of proposed guidelines was released on July 21, 2016 including hair analysis and mitochondrial DNA and Y chromosome typing (www.justice.gov/dag/proposed-uniform-language-documents-anthropology-explosive-chemistry-explosive-devicesgeology). 375 136 The guidelines represent an important step forward, because they instruct DOJ examiners not to make sweeping claims that they can identify the source of a fingerprint or footprint to the exclusion of all other possible sources. PCAST applauds DOJ’s intention and efforts to bring uniformity and to prevent inaccurate testimony concerning feature comparisons. Some aspects of the guidelines, however, are not scientifically appropriate and embody heterodox views of the kind discussed in Section 4.7. As an illustration, we focus on the guidelines for footwear and tire impression analysis and the guidelines for hair analysis. Footwear and Tire Impression Analysis Relevant portions of the guidelines for testimony and reports about forensic footwear and tire impression are shown in Box 6. BOX 6. Excerpt from DOJ Proposed uniform language for testimony and reports for the forensic footwear and tire impression discipline376 Statements Approved for Use in Laboratory Reports and Expert Witness Testimony Regarding Forensic Examination of Footwear and Tire Impression Evidence Identification 1. The examiner may state that it is his/her opinion that the shoe/tire is the source of the impression because there is sufficient quality and quantity of corresponding features such that the examiner would not expect to find that same combination of features repeated in another source. This is the highest degree of association between a questioned impression and a known source. This opinion requires that the questioned impression and the known source correspond in class characteristics and also share one or more randomly acquired characteristics. This opinion acknowledges that an identification to the exclusion of all others can never be empirically proven. Statements Not Approved for Use in Laboratory Reports and Expert Witness Testimony Regarding Forensic Examination of Footwear and Tire Impression Evidence Exclusion of All of Others 1. The examiner may not state that a shoe/tire is the source of a questioned impression to the exclusion of all other shoes/tires because all other shoes/tires have not been examined. Examining all of the shoes/tires in the world is a practical impossibility. 376 See: www.justice.gov/olp/file/861936/download. 137 Error Rate 2. The examiner may not state a numerical value or percentage regarding the error rate associated with either the methodology used to conduct the examinations or the examiner who conducted the analyses. Statistical Weight 3. The examiner may not state a numerical value or probability associated with his/her opinion. Accurate and reliable data and/or statistical models do not currently exist for making quantitative determinations regarding the forensic examination of footwear/tire impression evidence. These proposed guidelines have serious problems. An examiner may opine that a shoe is the source of an impression, but not that the shoe is the source of impression to the exclusion of all other possible shoes. But, as a matter of logic, there is no difference between these two statements. If an examiner believes that X is the source of Y, then he or she necessarily believes that nothing else is the source of Y. Any sensible juror should understand this equivalence. What then is the goal of the guidelines? It appears to be to acknowledge the possibility of error. In effect, examiners should say, “I believe X is the source of Y, although I could be wrong about that.” This is appropriate. But, the critical question is then: How likely is it that the examiner is wrong? There’s the rub: the guidelines bar the examiner from discussing the likelihood of error, because there is no accurate or reliable information about accuracy. In effect, examiners are instructed to say, “I believe X is the source of Y, although I could be wrong about that. But, I have no idea how often I’m wrong because we have no reliable information about that.” Such a statement does not meet any plausible test of scientific validity. As Judge Easterly wrote in Williams v. United States, a claim of identification under such circumstances: has the same probative value as the vision of a psychic: it reflects nothing more than the individual’s foundationless faith in what he believes to be true. This is not evidence on which we can in good conscience rely, particularly in criminal cases, where we demand proof—real proof—beyond a reasonable doubt, precisely because the stakes are so high. 377 Williams v. United States, DC Court of Appeals, Decided January 21, 2016, (Easterly, concurring). We cite the analogy for its expositional value concerning the scientific point; we express no position on the role of the case as legal authority. 377 138 Hair Analysis Relevant portions of the guidelines for testimony and reports on forensic hair examination are shown in Box 7. BOX 7. Excerpt from DOJ Proposed uniform language for testimony and reports for the forensic hair examination discipline 378 Statements Not Approved for Use in Forensic Hair Examination Testimony and/or Laboratory Reports Human Hair Comparisons 1. The examiner may state or imply that the questioned human hair is microscopically consistent with the known hair sample and accordingly, the source of the known hair sample can be included as a possible source of the questioned hair. Statements Not Approved for Use in Forensic Hair Examination Testimony and/or Laboratory Reports Individualization 1. The examiner may not state or imply that a hair came from a particular source to the exclusion of all others. Statistical Weight 2. The examiner may not state or imply a statistical weight or probability to a conclusion or provide a likelihood that the questioned hair originated from a particular source. Zero Error Rate 3. The examiner may not state or imply that the method used in performing microscopic hair examinations has a zero error rate or is infallible. The guidelines appropriately state that examiners may not claim that they can individualize the source of a hair nor that they have a zero error rate. However, while examiners may “state or imply that the questioned human hair is microscopically consistent with the known hair sample and accordingly, the source of the known hair sample can be included as a possible source of the questioned hair,” they are barred from providing accurate information about the reliability of such conclusions. This is contrary to the scientific requirement that forensic feature-comparison methods must be supported by and accompanied by appropriate empirical estimates of reliability. In particular, as discussed in Section 5.7, a landmark study in 2002 by scientists at the FBI Laboratory showed that, among 80 instances in actual casework where examiners concluded that a questioned hair was microscopically consistent with the known hair sample, the hair were found by DNA analysis to have come from Department of Justice Proposed Uniform Language for Testimony and Reports for the Forensic Hair Examination Discipline, available at: www.justice.gov/dag/file/877736/download. 378 139 a different source in 11 percent of cases. The fact that such a significant proportion of conclusions were false associations is of tremendous importance in interpreting conclusions of hair examiners. In cases of hair examination unaccompanied by DNA analysis, examiners should be required to disclose the high frequency of false associations seen in the FBI study so that juries can appropriately weigh conclusions. Conclusion The DOJ should revise the proposed guidelines, to bring them into alignment with scientific standards for scientific validity. The supporting documentation should also be revised, as discussed in Section 5.7. 8.3 Recommendations Based on its scientific findings, PCAST makes the following recommendations. Recommendation 6. Use of feature-comparison methods in Federal prosecutions (A) The Attorney General should direct attorneys appearing on behalf of the Department of Justice (DOJ) to ensure expert testimony in court about forensic feature-comparison methods meets the scientific standards for scientific validity. While pretrial investigations may draw on a wider range of methods, expert testimony in court about forensic feature-comparison methods in criminal cases—which can be highly influential and has led to many wrongful convictions—must meet a higher standard. In particular, attorneys appearing on behalf of the DOJ should ensure that: (i) the forensic feature-comparison methods upon which testimony is based have been established to be foundationally valid, as shown by appropriate empirical studies and consistency with evaluations by the National Institute of Standards and Technology (NIST), where available; and (ii) the testimony is scientifically valid, with the expert’s statements concerning the accuracy of methods and the probative value of proposed identifications being constrained by the empirically supported evidence and not implying a higher degree of certainty. (B) DOJ should undertake an initial review, with assistance from NIST, of subjective feature-comparison methods used by DOJ to identify which methods (beyond those reviewed in this report) lack appropriate black-box studies necessary to assess foundational validity. Because such subjective methods are presumptively not established to be foundationally valid, DOJ should evaluate whether it is appropriate to present in court conclusions based on such methods. (C) Where relevant methods have not yet been established to be foundationally valid, DOJ should encourage and provide support for appropriate black-box studies to assess foundational validity and measure reliability. The design and execution of these studies should be conducted by or in conjunction with independent third parties with no stake in the outcome. 140 Recommendation 7. Department of Justice guidelines on expert testimony (A) The Attorney General should revise and reissue for public comment the Department of Justice’s (DOJ) proposed “Uniform Language for Testimony and Reports” and supporting documents to bring them into alignment with scientific standards for scientific validity. (B) The Attorney General should issue instructions directing that: (i) Where empirical studies and/or statistical models exist to shed light on the accuracy of a forensic feature-comparison method, an examiner should provide quantitative information about error rates, in accordance with guidelines to be established by DOJ and the National Institute of Standards and Technology, based on advice from the scientific community. (ii) Where there are not adequate empirical studies and/or statistical models to provide meaningful information about the accuracy of a forensic feature-comparison method, DOJ attorneys and examiners should not offer testimony based on the method. If it is necessary to provide testimony concerning the method, they should clearly acknowledge to courts the lack of such evidence. (iii) In testimony, examiners should always state clearly that errors can and do occur, due both to similarities between features and to human mistakes in the laboratory. 141 9. Actions to Ensure Scientific Validity in Forensic Science: Recommendations to the Judiciary Based on the scientific findings in Chapters 4 and 5, PCAST has identified actions that we believe should be taken by the judiciary to ensure the scientific validity of evidence based on forensic feature-comparison methods and promote their more rigorous use in the courtroom. 9.1 Scientific Validity as a Foundation for Expert Testimony In Federal courts, judges are assigned the critical role of “gatekeepers” charged with ensuring that expert testimony “rests on a reliable foundation.” 379 Specifically, Rule 702 (c,d) of the Federal Rules of Evidence requires that (1) expert testimony must be the product of “reliable principles and methods” and (2) experts must have “reliably applied” the methods to the facts of the case. 380 The Supreme Court has stated that judges must determine “whether the reasoning or methodology underlying the testimony is scientifically valid.” 381 As discussed in Chapter 3, this framework establishes an important conversation between the judiciary and the scientific community. The admissibility of expert testimony depends on a threshold test of whether it meets certain legal standards for evidentiary reliability, which are exclusively the province of the judiciary. Yet, in cases involving scientific evidence, these legal standards are to be “based upon scientific validity.” 382 PCAST does not opine on the legal standards, but aims in this report to clarify the scientific standards that underlie them. To ensure that the distinction between scientific and legal concepts is clear, we have adopted specific terms to refer to scientific concepts (foundational validity and validity as applied) intended to parallel legal concepts expressed in Rule 702 (c,d). As the Supreme Court has noted, the judge’s inquiry under Rule 702 is a flexible one: there is no simple one-sizefits-all test that can be applied uniformly to all scientific disciplines. 383 Rather, the evaluation of scientific validity should be based on the appropriate scientific criteria for the scientific field. Moreover, the appropriate scientific field should be the larger scientific discipline to which it belongs. 384 Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993) at 597. See: www.uscourts.gov/file/rules-evidence. 381 Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993) at 592. 382 Daubert, at FN9 (“in a case involving scientific evidence, evidentiary reliability will be based on scientific validity.” [emphasis in original]). 383 Daubert, at 594. 384 For example, in Frye, the court evaluated whether a proffered lie detector had gained “standing and scientific recognition among physiological and psychological authorities,” rather than among lie detector experts. Frye v. United 379 380 142 In this report, PCAST has focused on forensic feature-comparison methods—which belong to the field of metrology, the science of measurement and its application. 385 We have sought—in a form usable by courts, as well as by scientists and others who seek to improve forensic science—to lay out the scientific criteria for foundational validity and validity as applied (Chapter 4) and to illustrate their application to specific forensic feature-comparison methods (Chapter 5). The scientific criteria are described in Finding 1. PCAST’s conclusions can be summarized as follows: Scientific validity and reliability require that a method has been subjected to empirical testing, under conditions appropriate to its intended use, that provides valid estimates of how often the method reaches an incorrect conclusion. For subjective feature-comparison methods, appropriately designed black-box studies are required, in which many examiners render decisions about many independent tests (typically, involving “questioned” samples and one or more “known” samples) and the error rates are determined. Without appropriate estimates of accuracy, an examiner’s statement that two samples are similar—or even indistinguishable—is scientifically meaningless: it has no probative value, and considerable potential for prejudicial impact. Nothing—not personal experience nor professional practices—can substitute for adequate empirical demonstration of accuracy. The applications to specific feature-comparison methods are described in Findings 2-7. The full set of scientific findings is collected in Chapter 10. Finally, we note that the Supreme Court in Daubert suggested that judges should be mindful of Rule 706, which allows a court at its discretion to procure the assistance of an expert of its own choosing. 386 Such experts can provide independent assessments concerning, among other things, the validity of scientific methods and their applications. 9.2 Role of Past Precedent One important issue that arose throughout our deliberations was the role of past precedents. As discussed in Chapter 5, our scientific review found that most forensic feature-comparison methods (with the notable exception of DNA analysis of single-source and simple-mixture samples) have historically been assumed rather than established to be foundationally valid. Only after it became clear in recent years (based on DNA and other analysis) that there are fundamental problems with the reliability of some of these methods has the forensic science community begun to recognize the need to empirically test whether specific methods meet the scientific criteria for scientific validity. This creates an obvious tension, because many courts admit forensic feature-comparison methods based on longstanding precedents that were set before these fundamental problems were discovered. States, 293 F. 1013 (D.C. Cir. 1923). Similarly, the fact that bitemark examiners believe that bitemark examination is valid carries little weight. 385 See footnote 93 on p.44. 386 Daubert, at 595. 143 From a purely scientific standpoint, the resolution is clear. When new facts falsify old assumptions, courts should not be obliged to defer to past precedents: they should look afresh at the scientific issues. How are such tensions resolved from a legal standpoint? The Supreme Court has made clear that a court may overrule precedent if it finds that an earlier case was “erroneously decided and that subsequent events have undermined its continuing validity.” 387 PCAST expresses no view on the legal question of whether any past cases were “erroneously decided.” However, PCAST notes that, from a scientific standpoint, subsequent events have indeed undermined the continuing validity of conclusions that were not based on appropriate empirical evidence. These events include (1) the recognition of systemic problems with some forensic feature-comparison methods, including through study of the causes of hundreds of wrongful convictions revealed through DNA and other analysis; (2) the 2009 NRC report from the National Academy of Sciences, the leading scientific advisory body established by the Legislative Branch, 388 that found that some forensic feature-comparison methods lack a scientific foundation; and (3) the scientific review in this report by PCAST, the leading scientific advisory body established by the Executive Branch, 389 finding that some forensic feature-comparison methods lack foundational validity. 9.3 Resources for Judges Another important issue that arose frequently in our conversations with experts was the need for better resources for judges related to evaluation of forensic feature-comparison methods for use in the courts. The most appropriate bodies to provide such resources are the Judicial Conference of the United States and the Federal Judicial Center. The Judicial Conference of the United States is the national policy-making body for the federal courts. 390 Its statutory responsibility includes studying the operation and effect of the general rules of practice and procedure in the federal courts. The Judicial Conference develops best practices manuals and issues Advisory Committee notes to assist judges with respect to specific topics, including through its Standing Advisory Committee on the Federal Rules of Evidence. The Federal Judicial Center is the research and education agency of the federal judicial system. 391 Its statutory duties include (1) conducting and promoting research on federal judicial procedures and court operations and Boys Markets, Inc. v. Retails Clerks Union, 398 U.S. 235, 238 (1970). See also: Patterson v. McLean Credit Union, 485 U.S. 617, 618 (1988) (noting that the Court has “overruled statutory precedents in a host of cases”). PCAST sought advice on this matter from its panel of Senior Advisors. 388 The National Academy of Sciences was chartered by Congress in 1863 to advise the Federal government on matters of science (U.S. Code, Section 36, Title 1503). 389 The President formally established a standing scientific advisory council soon after the launch of Sputnik in 1957. It is currently titled the President’s Council of Advisors of Science and Technology (operating under Executive Order 13539, as amended by Executive Order 13596). 390 Created in 1922 under the name the Conference of Senior Circuit Judges, the Judicial Conference of the United States is currently established under 28 U.S.C. § 331. 391 The Federal Judicial Center was established by Congress in 1967 (28 U.S.C. §§ 620-629), on the recommendation of the Judicial Conference of the United States. 387 144 (2) conducting and promoting orientation and continuing education and training for federal judges, court employees, and others. PCAST recommends that the Judicial Conference of the United States, through its Subcommittee on the Federal Rules of Evidence, develop best practices manuals and an Advisory Committee note and the Federal Judicial Center develop educational programs related to procedures for evaluating the scientific validity of forensic feature-comparison methods. 9.4 Recommendation Based on its scientific findings, PCAST makes the following recommendation. Recommendation 8. Scientific validity as a foundation for expert testimony (A) When deciding the admissibility of expert testimony, Federal judges should take into account the appropriate scientific criteria for assessing scientific validity including: (i) foundational validity, with respect to the requirement under Rule 702(c) that testimony is the product of reliable principles and methods; and (ii) validity as applied, with respect to requirement under Rule 702(d) that an expert has reliably applied the principles and methods to the facts of the case. These scientific criteria are described in Finding 1. (B) Federal judges, when permitting an expert to testify about a foundationally valid featurecomparison method, should ensure that testimony about the accuracy of the method and the probative value of proposed identifications is scientifically valid in that it is limited to what the empirical evidence supports. Statements suggesting or implying greater certainty are not scientifically valid and should not be permitted. In particular, courts should never permit scientifically indefensible claims such as: “zero,” “vanishingly small,” “essentially zero,” “negligible,” “minimal,” or “microscopic” error rates; “100 percent certainty” or proof “to a reasonable degree of scientific certainty;” identification “to the exclusion of all other sources;” or a chance of error so remote as to be a “practical impossibility.” (C) To assist judges, the Judicial Conference of the United States, through its Standing Advisory Committee on the Federal Rules of Evidence, should prepare, with advice from the scientific community, a best practices manual and an Advisory Committee note, providing guidance to Federal judges concerning the admissibility under Rule 702 of expert testimony based on forensic featurecomparison methods. (D) To assist judges, the Federal Judicial Center should develop programs concerning the scientific criteria for scientific validity of forensic feature-comparison methods. 145 10. Scientific Findings PCAST’s scientific findings in this report are collected below. Finding 1, concerning the scientific criteria for scientific validity, is based on the discussion in Chapter 4. Findings 2–6, concerning foundational validity of six forensic feature-comparison methods, is based on the evaluations in Chapter 5. Finding 1: Scientific Criteria for Scientific Validity of a Forensic Feature-Comparison Method (1) Foundational validity. To establish foundational validity for a forensic feature-comparison method, the following elements are required: (a) a reproducible and consistent procedure for (i) identifying features within evidence samples, (ii) comparing the features in two samples, and (iii) determining, based on the similarity between the features in two samples, whether the samples should be declared to be likely to come from the same source (“matching rule”); and (b) empirical estimates, from appropriately designed studies from multiple groups, that establish (i) the method’s false positive rate—that is, the probability it declares a proposed identification between samples that actually come from different sources, and (ii) the method’s sensitivity—that is, the probability it declares a proposed identification between samples that actually come from the same source. As described in Box 4, scientific validation studies should satisfy a number of criteria: (a) they should be based on sufficiently large collections of known and representative samples from relevant populations; (b) they should be conducted so that have no information about the correct answer; (c) the study design and analysis plan are specified in advance and not modified afterwards based on the results; (d) the study is conducted or overseen by individuals or organizations with no stake in the outcome; (e) data, software and results should be available to allow other scientists to review the conclusions; and (f) to ensure that the results are robust and reproducible, there should be multiple independent studies by separate groups reaching similar conclusions. Once a method has been established as foundationally valid based on adequate empirical studies, claims about the method’s accuracy and the probative value of proposed identifications, in order to be valid, must be based on such empirical studies. For objective methods, foundational validity can be established by demonstrating the reliability of each of the individual steps (feature identification, feature comparison, matching rule, false match probability, and sensitivity). 146 For subjective methods, foundational validity can be established only through black-box studies that measure how often many examiners reach accurate conclusions across many feature-comparison problems involving samples representative of the intended use. In the absence of such studies, a subjective feature-comparison method cannot be considered scientifically valid. Foundational validity is a sine qua non, which can only be shown through empirical studies. Importantly, good professional practices—such as the existence of professional societies, certification programs, accreditation programs, peer-reviewed articles, standardized protocols, proficiency testing, and codes of ethics—cannot substitute for empirical evidence of scientific validity and reliability. (2) Validity as applied. Once a forensic feature-comparison method has been established as foundationally valid, it is necessary to establish its validity as applied in a given case. As described in Box 5, validity as applied requires that: (a) the forensic examiner must have been shown to be capable of reliably applying the method, as shown by appropriate proficiency testing (see Section 4.6), and must actually have done so, as demonstrated by the procedures actually used in the case, the results obtained, and the laboratory notes, which should be made available for scientific review by others; and (b) the forensic examiner’s assertions about the probative value of proposed identifications must be scientifically valid—including that the expert should report the overall false positive rate and sensitivity for the method established in the studies of foundational validity; demonstrate that the samples used in the foundational studies are relevant to the facts of the case; where applicable, report probative value of the observed match based on the specific features observed in the case; and not make claims or implications that go beyond the empirical evidence. Finding 2: DNA Analysis Foundational validity. PCAST finds that DNA analysis of single-source samples or simple mixtures of two individuals, such as from many rape kits, is an objective method that has been established to be foundationally valid. Validity as applied. Because errors due to human failures will dominate the chance of coincidental matches, the scientific criteria for validity as applied require that an expert (1) should have undergone rigorous and relevant proficiency testing to demonstrate their ability to reliably apply the method, (2) should routinely disclose in reports and testimony whether, when performing the examination, he or she was aware of any facts of the case that might influence the conclusion, and (3) should disclose, upon request, all information about quality testing and quality issues in his or her laboratory. 147 Finding 3: DNA analysis of complex-mixture samples Foundational validity. PCAST finds that: (1) Combined Probability of Inclusion-based methods. DNA analysis of complex mixtures based on CPIbased approaches has been an inadequately specified, subjective method that has the potential to lead to erroneous results. As such, it is not foundationally valid. A very recent paper has proposed specific rules that address a number of problems in the use of CPI. These rules are clearly necessary. However, PCAST has not adequate time to assess whether they are also sufficient to define an objective and scientifically valid method. If, for a limited time, courts choose to admit results based on the application of CPI, validity as applied would require that, at a minimum, they be consistent with the rules specified in the paper. DNA analysis of complex mixtures should move rapidly to more appropriate methods based on probabilistic genotyping. (2) Probabilistic genotyping. Objective analysis of complex DNA mixtures with probabilistic genotyping software is relatively new and promising approach. Empirical evidence is required to establish the foundational validity of each such method within specified ranges. At present, published evidence supports the foundational validity of analysis, with some programs, of DNA mixtures of 3 individuals in which the minor contributor constitutes at least 20 percent of the intact DNA in the mixture and in which the DNA amount exceeds the minimum required level for the method. The range in which foundational validity has been established is likely to grow as adequate evidence for more complex mixtures is obtained and published. Validity as applied. For methods that are foundationally valid, validity as applied involves similar considerations as for DNA analysis of single-source and simple-mixtures samples, with a special emphasis on ensuring that the method was applied correctly and within its empirically established range. Finding 4: Bitemark analysis Foundational validity. PCAST finds that bitemark analysis does not meet the scientific standards for foundational validity, and is far from meeting such standards. To the contrary, available scientific evidence strongly suggests that examiners cannot consistently agree on whether an injury is a human bitemark and cannot identify the source of bitemark with reasonable accuracy. 148 Finding 5: Latent fingerprint analysis Foundational validity. Based largely on two recent appropriately designed black-box studies, PCAST finds that latent fingerprint analysis is a foundationally valid subjective methodology—albeit with a false positive rate that is substantial and is likely to be higher than expected by many jurors based on longstanding claims about the infallibility of fingerprint analysis. Conclusions of a proposed identification may be scientifically valid, provided that they are accompanied by accurate information about limitations on the reliability of the conclusion—specifically, that (1) only two properly designed studies of the foundational validity and accuracy of latent fingerprint analysis have been conducted, (2) these studies found false positive rates that could be as high as 1 error in 306 cases in one study and 1 error in 18 cases in the other, and (3) because the examiners were aware they were being tested, the actual false positive rate in casework may be higher. At present, claims of higher accuracy are not warranted or scientifically justified. Additional black-box studies are needed to clarify the reliability of the method. Validity as applied. Although we conclude that the method is foundationally valid, there are a number of important issues related to its validity as applied. (1) Confirmation bias. Work by FBI scientists has shown that examiners typically alter the features that they initially mark in a latent print based on comparison with an apparently matching exemplar. Such circular reasoning introduces a serious risk of confirmation bias. Examiners should be required to complete and document their analysis of a latent fingerprint before looking at any known fingerprint and should separately document any additional data used during their comparison and evaluation. (2) Contextual bias. Work by academic scholars has shown that examiners’ judgments can be influenced by irrelevant information about the facts of a case. Efforts should be made to ensure that examiners are not exposed to potentially biasing information. (3) Proficiency testing. Proficiency testing is essential for assessing an examiner’s capability and performance in making accurate judgments. As discussed elsewhere in this report, there is a need to improve proficiency testing, including making it more rigorous, incorporating it within the flow of casework, and disclosing test problems following a test so that they can evaluated for appropriateness by the scientific community. From a scientific standpoint, validity as applied requires that an expert: (1) has undergone appropriate proficiency testing to ensure that he or she is capable of analyzing the full range of latent fingerprints encountered in casework and reports the results of the proficiency testing; (2) discloses whether he or she documented the features in the latent print in writing before comparing it to the known print; (3) provides a written analysis explaining the selection and comparison of the features; (4) discloses whether, when performing the examination, he or she was aware of any other facts of the case that might influence the conclusion; and (5) verifies that the latent print in the case at hand is similar in quality to the range of latent prints considered in the foundational studies. 149 Finding 6: Firearms analysis Foundational validity. PCAST finds that firearms analysis currently falls short of the criteria for foundational validity, because there is only a single appropriately designed study to measure validity and estimate reliability. The scientific criteria for foundational validity require more than one such study, to demonstrate reproducibility. Whether firearms analysis should be deemed admissible based on current evidence is a decision that belongs to the courts. If firearms analysis is allowed in court, the scientific criteria for validity as applied should be understood to require clearly reporting the error rates seen in appropriately designed black-box studies (estimated at 1 in 66, with a 95 percent confidence limit of 1 in 46, in the one such study to date). Validity as applied. If firearms analysis is allowed in court, validity as applied would, from a scientific standpoint, require that the expert: (1) has undergone rigorous proficiency testing on a large number of test problems to measure his or her accuracy and discloses the results of the proficiency testing; and (2) discloses whether, when performing the examination, he or she was aware of any other facts of the case that might influence the conclusion. Finding 7: Footwear analysis Foundational validity. PCAST finds there are no appropriate empirical studies to support the foundational validity of footwear analysis to associate shoeprints with particular shoes based on specific identifying marks (sometimes called “randomly acquired characteristics). Such conclusions are unsupported by any meaningful evidence or estimates of their accuracy and thus are not scientifically valid. PCAST has not evaluated the foundational validity of footwear analysis to identify class characteristics (for example, shoe size or make). 150 Appendix A: Statistical Issues To enhance its accessibility to a broad audience, the main text of this report avoids, where possible, the use of mathematical and statistical terminology. However, for the actual implementation of some of the principles stated in the report, somewhat more precise descriptions are necessary. This Appendix summarizes the relevant concepts from elementary statistics. 392 Sensitivity and False Positive Rate Forensic feature-comparison methods typically aim to determine how likely it is that two samples came from the same source, given the result of a forensic test on the samples. Two possibilities are considered: the null hypothesis (H0) that they are from different sources (H0) and the alternative hypothesis (H1) that two samples are from the same source. The forensic test result may be summarized as match declared (M) or no match declared (O). There are two necessary characterizations of a method’s accuracy: Sensitivity (abbreviated SEN) and False Positive Rate (FPR). Sensitivity is defined as the probability that the method declares a match between two samples when they are known to be from the same source (drawn from an appropriate population), that is, SEN = P(M H1). For example, a value SEN = 0.95 would indicate that two samples from the same source will be declared as a match 95 percent of the time. In the statistics literature, SEN is sometimes also called the “true positive rate,” “TPR,” or “recall rate.” 393 False positive rate (abbreviated FPR) is defined as the probability that the method declares a match between two samples that are from different sources (again in an appropriate population), that is, FPR = P(M H0). For example, a value FPR = 0.01 would indicate that two samples from different sources will be (mistakenly) called as a match 1 percent of the time. 394 Methods with a high FPR are scientifically unreliable for making important See, e.g.: Peter Amitage, G. Berry, JNS Matthews: Statistical Methods in Medical Research, 4th ed., Blackwell Science, 2002; George Snedecor, William G Cochran: Statistical Methods, 8th ed., Iowa State University Press, 1989; Gerald van Belle, Lloyd D Fisher, Patrick Heagerty, Thomas Lumley, Biostatistics: A Methodology for the Health Sciences, Wiley, 2004; Alan Agresti; Brent A. Coull: Approximate Is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician 52(2), 119-126, 1998; Robert V Hogg, Elliot Tanis, Dale Zimmerman: Probability and Statistical Inference, 9th ed., Pearson, 2015; David Freedman, Roger Pisani, Roger Purves: Statistics. Norton, 2007; Lincoln E Moses: Think and Explain with Statistics, Addison-Wesley, 1986; David S Moore, George P McCabe, Bruce A Craig: Introduction to the Practice of Statistics. W.H. Freeman, 2009. 393 The term false negative rate is sometimes used for the complement of SEN, that is, FNR = 1 – SEN. 394 Statisticians may refer to a method’s specificity (SPC) instead of its false positive rate (FPR). The two are related by the formula FPR = 1 – SPC. In the example given, FPR = 0.01 (1 percent) and SPC = 0.99 (99 percent). 392 151 judgments in court about the source of a sample. To be considered reliable, the FPR should certainly be less than 5 percent and it may be appropriate that it be considerably lower, depending on the intended application. The results of a given empirical study can be summarized by four values: the number of occurrences in the study of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). (The matrix of these values is, perhaps oddly, referred to as the “confusion matrix.”) Test Result Match No Match H1: Truly from same source TP FN H0: Truly from different sources FP TN In this standard-but-confusing terminology, “true” and “false” refer to agreement or disagreement with the ground truth (either H0 or H1), while “positive” and “negative” refer to the test results (that is, results M and O, respectively). A widely-used estimate, called the maximum likelihood estimate, of SEN is given by TP/(TP+FN), the fraction of events with ground truth H1 (same source) that are correctly declared as M (match). The maximum likelihood estimate of FPR is correspondingly FP/(FP+TN), the fraction events with ground truth H0 (different source) that are mistakenly declared as M (match). Since the false positive rate will often be the mathematically determining factor in the method’s probative value in a particular case (discussion below), it is particularly important that FPR be well measured empirically. In addition, tests with very low sensitivity should be viewed with suspicion because rare positive test results may be matched or outweighed by the occurrence of false positive results. 395 Confidence Intervals As discussed in the main text, to be valid, empirical measurements of SEN and FPR must be based on large collections of known and representative samples from each relevant population, so as to reflect how often a given feature or combination of features occurs. (Other requirements for validity are also discussed in the main text.) Since empirical measurements are based on a limited number of samples, SEN and FPR cannot be measured exactly, but only estimated. Because of the finite sample sizes, the maximum likelihood estimates thus do not tell the whole story. Rather, it is necessary and appropriate to quote confidence bounds within which SEN, and FPR, are highly likely to lie. The argument in favor of a test that “this test succeeds only occasionally, but in this case it did succeed” is thus a fallacious one 395 152 Because one should be primarily concerned about overestimating SEN or underestimating FPR, it is appropriate to use a one-sided confidence bound. By convention, a confidence level of 95 percent is most widely used— meaning that there is a 5 percent chance the true value exceeds the bound. Upper 95 percent one-sided confidence bounds should thus be used for assessing the error rates and the associated quantities that characterize forensic feature matching methods. (The use of lower values may rightly be viewed with suspicion as an attempt at obfuscation.) The confidence bound for proportions depends on the sample size in the empirical study. When the sample size is small, the estimates may be far from the true value. For example, if an empirical study found no false positives in 25 individual tests, there is still a reasonable chance (at least 5 percent) that the true error rate might be as high as roughly 1 in 9. For technical reasons, there is no single, universally agreed method for calculating these confidence intervals (a problem known as the “binomial proportion confidence interval”). However, the several widely used methods give very similar results, and should all be considered acceptable: the Clopper-Pearson/Exact Binomial method, the Wilson Score interval, the Agresti-Coull (adjusted Wald) interval, and the Jeffreys interval. 396 Web-based calculators are available for all of these methods. 397 For example, if a study finds zero false positives in 100 tries, the four methods mentioned give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context “the false positive rate might be as high as.” (In this report, we used the Clopper-Pearson/Exact Binomial method.) Calculating Results for Conclusive Tests For many forensic tests, examiners may reach a conclusion (e.g., match or no match) or declare that the test is inconclusive. SEN and FPR can thus be calculated based on the conclusive examinations or on all examinations. While both rates are of interest, from a scientific standpoint, the former rate should be used for reporting FPR to a jury. This is appropriate because evidence used against a defendant will typically be based on conclusive, rather than inconclusive, examinations. To illustrate the point, consider an extreme case in which a method had been tested 1000 times and found to yield 990 inconclusive results, 10 false positives, and no correct results. It would be misleading to report that the false positive rate was 1 percent (10/1000 examinations). Rather, one should report that 100 percent of the conclusive results were false positives (10/10 examinations). Bayesian Analysis In this appendix, we have focused on the Sensitivity and False Positives rates (SEN = P(M H1) and FPR = P(M H0)). The quantity of most interest in a criminal trial is P(H1 M), that is, “the probability that the samples are from the same source given that a match has been declared.” This quantity is often termed the positive predictive value (PPV) of the test. Brown, L.D., Cai, T.T., and A. DasGupta. “Interval estimation for a binomial proportion.” Statistical Science, Vol. 16, No. 2 (2001): 101-33. 397 For example, see: epitools.ausvet.com.au/content.php?page=CIProportion. 396 153 The calculation of PPV depends on two quantities: the “Bayes factor” BF = SEN/FPR and a second quantity called the “prior odds ratio” (POR). This latter quantity is defined mathematically as POR = P(H0)/P(H1), where P(H0) and P(H1) are the prior (i.e., before doing the test) probabilities of the hypotheses H0 and H1. 398 The formula for PPV in terms of BF and POR is: PPV = BF / (BF + POR), a formula that follows from the statistical principle known as Bayes Theorem. 399 Bayes Theorem offers a mathematical way to combine the test result with independent information—such as (1) one’s prior probability that two samples came from the same source and (2) the number of samples searched. Some Bayesian statisticians would choose POR = 1 in the case of a match to single sample (implying that it is equally likely a priori that the samples came from the same source as from different sources) and POR = 100,000 for a match identified by comparing a sample to a database containing 100,000 samples. Others would set POR = (1-p)/p, where p is the a priori probability of same-source identity in the relevant population, given the other facts of the case. The Bayesian approach is mathematically elegant. However, it poses challenges for use in courts: (1) different people may hold very different beliefs about POR and (2) many jurors may not understand how beliefs about POR affect the mathematical calculation of PPV. (Moreover, as noted previously, the empirical estimates of SEN and FPR have uncertainty, so the estimated BF = SEN/FPR also has uncertainty.) Some commentators therefore favor simply reporting the empirically measured quantities (the sensitivity, the false positive rate of the test, and the probability of a false positive match given the number of samples searched against) and allowing a jury to incorporate them into their own intuitive Bayesian judgments. (For example, “Yes, the test has a false positive rate of only 1 in 100, but two witnesses place the defendant 1000 miles from the crime scene, so the test result was probably one of those 1 in 100 false positives.”) That is, if p is the a priori probability of same-source identity in the population under examination then POR = (1-p)/p. In the main text, the phrase “appropriately correct for the size of the pool that was searched in identifying a suspect” refers to the use of this formula with an appropriate value for POR. 398 399 154 Appendix B. Additional Experts Providing Input PCAST sought input from a diverse group of additional experts and stakeholders. PCAST expresses its gratitude to those listed here who shared their expertise. They did not have the opportunity to review drafts of the report, and their willingness to engage with PCAST on specific points does not imply endorsement of the views expressed therein. Responsibility for the opinions, findings, and recommendations in this report and for any errors of fact or interpretation rests solely with PCAST. Richard Alpert Assistant Criminal District Attorney Tarrant County Criminal District Attorney’s Office Peter Bush Research Instructor Director of the South Campus Instrument Center University at Buffalo School of Dental Medicine Kareem Belt Forensic Policy Analyst Innocence Project John Butler Special Assistant to the Director for Forensic Science Special Programs Office National Institute of Standards and Technology William Bodziak Consultant Bodziak Forensics Arturo Casadevall Professor Department of Microbiology & Immunology and Department of Medicine Albert Einstein College of Medicine John Buckleton Principal Scientist Institute of Environment and Scientific Research New Zealand Alicia Carriquiry Distinguished Professor at Iowa State and Director, Center for Statistics and Applications in Forensic Evidence Iowa State University Bruce Budowle Professor, Executive Director of Institute of Applied Genetics University of North Texas Health Science Center Richard Cavanagh Director Special Programs Office National Institute of Standards and Technology Mary A. Bush Associate Professor Department of Restorative Dentistry University at Buffalo School of Dental Medicine Eleanor Celeste Policy Analyst Medical and Forensic Sciences Office of Science and Technology Policy 155 Christophe Champod Professor of Law, Criminal Science and Public Administration University of Lausanne Itiel Dror Senior Cognitive Neuroscience Researcher University College London Sarah Chu Senior Forensic Policy Advocate Innocence Project Meredith Drosback Assistant Director Education and Physical Sciences Office Of Science and Technology Policy Simon A. Cole Professor of Criminology, Law and Society School of Social Ecology University of California Irvine Kimberly Edwards Physical Scientist Forensic Examiner Federal Bureau of Investigation Laboratory Kelsey Cook Program Director Chemical Measurement and Imaging National Science Foundation Ian Evett Forensic Statistician Principal Forensic Services Patricia Cummings Special Fields Bureau Chief Dallas County District Attorney’s Office Chris Fabricant Director, Strategic Litigation Innocence Project Christopher Czyryca President Collaborative Testing Services Kenneth Feinberg Steven and Maureen Klinsky Visiting Professor of Practice for Leadership and Progress Harvard Law School Dana Delger Staff Attorney Innocence Project Rebecca Ferrell Program Director Biological Anthropology National Science Foundation Shari Diamond Howard J. Trienens Professor of Law Professor of Psychology Pritzker School of Law Northwestern University Jennifer Friedman Forensic Science Coordinator Los Angeles County Public Defender 156 Lynn Garcia General Counsel Texas Forensic Science Commission Alice Isenberg Deputy Assistant Director Federal Bureau of Investigation Laboratory Daniel Garner Chief Executive Officer and President Houston Forensic Science Center Matt Johnson Senior Forensic Specialist Orange County Sheriff’s Department Constantine A. Gatsonis Henry Ledyard Goddard University Professor of Biostatistics Chair of Biostatistics Director of Center for Statistical Sciences Brown University Jonathan Koehler Beatrice Kuhn Professor of Law Pritzker School of Law Northwestern University Eric Gilkerson Forensic Examiner Federal Bureau of Investigation Laboratory Glenn Langenburg Forensic Science Supervisor Minnesota Bureau of Criminal Apprehension Brandon Giroux President Giroux Forensics, L.L.C. President Forensic Assurance Gerald LaPorte Director Office of Investigative and Forensic Sciences National Institute of Justice Catherine Grgicak Assistant Professor Anatomy and Neurobiology Boston University School of Medicine Julia Leighton General Counsel Public Defender Service District of Columbia Austin Hicklin Fellow Noblis Alan I. Leshner Chief Executive Officer, Emeritus American Association for the Advancement of Science and Executive Publisher of the journal Science Cindy Homer Forensic Scientist Maine State Police Crime Lab Ryan Lilien Chief Science Officer Cadre Research Labs 157 Elizabeth Mansfield Deputy Office Director Personalized Medicine Food and Drug Administration Steven O’Dell Director Forensic Services Division Baltimore Police Department Anne-Marie Mazza Director Committee on Science, Technology, and Law The National Academies of Science, Engineering and Medicine Lynn Overmann Senior Policy Advisor Office of Science and Technology Policy Willie E. May Director National Institute of Standards and Technology Skip Palenik Founder Microtrace Daniel MacArthur Assistant Professor Harvard Medical School Co-Director of Medical and Population Genetics Broad Institute of Harvard and MIT Matthew Redle County and Prosecuting Attorney Sheridan County Prosecutor’s Office Brian McVicker Forensic Examiner Federal Bureau of Investigation Laboratory Maria Antonia Roberts Research Program Manager Latent Print Support Unit Federal Bureau of Investigation Laboratory Stephen Mercer Director Litigation Support Group Office of the Public Defender State of Maryland Walter F. Rowe Professor of Forensic Sciences George Washington University Melissa Mourges Chief Forensic Sciences/Cold Case Unit New York County District Attorney's Office Norah Rudin President and CEO Scientific Collaboration, Innovation & Education Group Peter Neufeld Co-Director and Co-Founder Innocence Project Jeff Salyards Director Defense Forensic Science Center The Defense Forensics and Biometrics Agency 158 Rodney Schenck Defense Forensic Science Center The Defense Forensics and Biometric Agency Harry Swofford Chief, Latent Print Branch Defense Forensics Science Center The Defense Forensics and Biometric Agency David Senn Director Center for Education and Research in Forensics and the Southwest Symposium on Forensic Dentistry University of Texas Health Science Center at San Antonio Robert Thompson Program Manager Forensic Data Systems Law Enforcement Standards Office National Institute of Standards and Technology Stephen Shaw Trace Examiner Federal Bureau of Investigation Laboratory William Thompson Professor of Criminology, Law, and Society and Psychology & Social Behavior Law School of Social Ecology University of California, Irvine Andrew Smith Supervisor Firearm/ Toolmark Unit San Francisco Police Department Rick Tontarski Chief Scientist Defense Forensic Science Center Erich Smith Physical Scientist Firearms-Toolmarks Unit Federal Bureau of Investigation Laboratory Jeremy Triplett Laboratory Supervisor Kentucky State Police Central Forensic Laboratory Tasha Smith Firearm and Tool Mark Unit Criminalistics Laboratory San Francisco Police Department Richard Vorder Bruegge Senior Photographic Technologist Federal Bureau of Investigation Jeffrey Snipes Associate Professor Criminal Justice Studies San Francisco State University Victor Weedn Chair of Forensic Sciences Department of Forensic Sciences George Washington University Jill Spriggs Laboratory Director Sacramento County District Attorney’s Office Robert Wood Associate Professor and Head Department of Dental Oncology Dentistry, Ocular and Maxillofacial Prosthetics Princess Margaret Cancer Centre University of Toronto 159 Xiaoyu Alan Zheng Mechanical Engineer National Institute of Standards and Technology 160 President?s Council of Advisors on Science and Technology (PCAST)