CRIMINOLOGY 8c CRIMINALJUSTICE CCS Central Connecticut State University September 7, 2017 To: Chief olm C. Gavallas President, Connecticut Police Chiefs Association From: Stephen M. Cox, Professor, Department of Criminology and Criminal Justice, Institute for the Study of Crime and Justice Re: Review of the ?State of Connecticut Traf?c Stop Data Analysis and Findings, 2013-14? and the ?State of Cormecticut Traf?c Stop Data Analysis and Findings, 2014-2015? reports. Under the direction of the State of Connecticut Of?ce of Policy and Management, the Institute for Municipal and Regional Policy (IMRP) at Central Connecticut State University has been analyzing police traf?c stops data and has presented their ?ndings in three reports (released in April 2015, May 2016, and July 2017). These reports have used a variety of statistical techniques to draw conclusions about speci?c police departments and individual police of?cers regarding possible racial pro?ling activities. These reports employed numerous statistical analyses that non-statisticians have found challenging to understand and interpret. Per the request of the CPCA to better understand the methodologies used and the ?ndings presented in these reports, three national experts in policing research and statistical analysis were asked to conduct peer reviews of the 2013-2014 and 2014-2015 reports. They have provided feedback regarding the validity of the reports? methodologies and the statistical techniques utilized. The individual reviews are attached along with a description of the process that was used to conduct these reviews. The reviewers? biographies and resumes are also included. While each review provides its own independent assessment, there are several commonalities across them. First, the use of population-based benchmarks and descriptive statistics to approximate towns? driving populations has weaknesses and are not recommended to be used nor presented in these reports. Second, while the Veil of Darkness and the Control statistical analyses were the most powerful of the IMRP analyses, both techniques contain several flaws that limit the validity of the conclusions drawn from them. Third, two of the reviewers raised signi?cant concerns with the implementation of the post-stop Hit Rate analysis. As such, any conclusions drawn from these analyses should be approached with caution. Fourth, all three reviewers indicated that both reports left out very important details and discussion regarding why certain procedures were used and how sonre conclusions were reached. This lack of information made it dif?cult for them to understand and assess various issues in each report. Additionally, each reviewer expressed a concern with one or both reports that is worth noting. First, the use of such a high number of statistical tests (over 2,000) in the 2014-2015 report 1615 Stanley Street HO. Box 4010 New Britain, 06050?40l0 860832?3005 li: increases the likelihood of false positives ?nding disparities in traffic stops where they do not actually exist). Second, the reports did not use the racial/ethnic composition of traffic accidents as a benchmark for traf?c stops although there is a growing body of research that suggests this is an appropriate technique. Third, one reviewer noted that the IMRP states several times throughout both reports that ?racial and ethnic disparities do not, by themselves, provide conclusive evidence of racial pro?ling? yet they conclude in the first report that ?the statistical disparity provides evidence in support of the claim that certain of?cers in the state are engaged in racial profiling. . Process In an effort to maintain objectivity in the review process, three national experts were independently contacted and asked if they would review these reports to assess the validity of the underlying assumptions, the statistical methodologies used, and the reports? conclusions. The experts were approached due to their research expertise and knowledge of police research and practice. To my knowledge, none of the reviewers had personal or professional interests that would have affected their ability to provide objective reviews. In addition, the reviewers worked independently and were unaware of who else was reviewing these reports. Reviewers? Biographies Je?rey T. Grogger, Ph. D. is the Irving Harris Professor in Urban Policy at the University of Chicago Harris School of Public Policy. He has authored dozens of scholarly articles, including a seminal piece on the Veil of Darkness research methodology that was used by the IMRP in their racial profiling reports. An applied microeconomist, he has worked on issues including crime, education, migration, and various aspects of racial inequality. He is a leading authority on social insurance programs and on US welfare reform. Dr. Grogger received a in economics from the University of California, San Diego. Before joining Harris, he taught at the University of California, Los Angeles, and the University of California, Santa Barbara. He has served as a coeditor of the Journal of Human Resources, as Chair of the National Longitudinal Surveys Technical Review Committee, as a research associate for the National Bureau of Economic Research, and as a research fellow with the Institute for the Study of Labor. Michael R. Smith, J.D., Ph. D. is Professor and Chair of the Department of Criminal Justice at the University of Texas at San Antonio. He holds a .D. from the University of South Carolina School of Law and in Justice Studies from Arizona State University. He has served as a principal investigator on many extramural grants and research contracts over his 20 year career as a police scholar and criminal justice researcher. With funding from the National Institute of Justice (NIJ), he led the most comprehensive investigation to date on the use of force by police and injuries to of?cers and citizens. He is a nationally-recognized expert on racial profiling and use of force and led or contributed to large~seale traf?c or pedestrian stop data analysis efforts in San Jose, San Francisco, Los Angeles, Miami?Dade County, FL, Richmond, VA, and with state highway patrol agencies in Washington and Arizona. He is currently a co-principal investigator on an NIJ-funded, randomized controlled evaluation of a police training initiative to reduce con?ict and the use of force between police and citizens. In 2016, he served as the senior research lead to the US. Department of Justice COPS Office?led police collaborative reform initiative in San Francisco and previously served as a statistical and methodological consultant to the Special Litigation Section of the USDOJ where he pioneered methodologies to help inform courts, communities and law enforcement agencies about disparities in police stop practices. He has written extensively about these and other critical issues at the intersection of law, public policy, and policing. His most recent publications have appeared in Justice Quarterly, Criminology Public Policy, and Policing: An International Journal of Police Strategies Management. Edward R. Maguire, is a Professor of Criminology and Criminal Justice at Arizona State University and is an expert in policing and violence, both in the US and abroad. Dr. Maguire served as a professor at George Mason University and the University of Nebraska. He worked as a social science analyst at the US. Department of Justice and the United Nations and as an associate social affairs officer at the United Nations Crime Prevention and Criminal Justice Branch in Vienna, Austria. He serves as a member ofthe Academy of Criminal Justice Sciences, the American Society of Criminology and the International Association of Chiefs of Police. He has received awards from the Emerald Literati Club and the University of Albany. Dr. McGuire has published more than 60 chapters and articles in scholarly journals such as Police Quarterly, Family Relations and Punishment Society. He is the author of the book, Organizational Structure in American Police Agencies: Context, Complexity, and Control. He received his and M.A. in criminal justice from the University of Albany, SUNY and BS. in criminal justice from the University of Lowell. Jeffrey Grogger Irving Harris Professor THE UNIVERSITY OF jgrogger@uchicago.edu arris Public Policy 1155 E. 60'h Street Chicago IL 60637 TO: Stephen M. Cox, Professor Department of Criminology and Criminal Justice Institute for the Study of Crime and Justice Central Connecticut State University New Britain, CT 06053 RE: Review of?Traf?c Stop Data and Analysis: 2013-2014? DATE: June 28, 2017 I am happy to provide my review of the report ?Traf?c Stop Data and Analysis: 2013?2014,? as you requested. The report is divided into two main sections, one that analyzes racial pro?ling in traf?c stops and one that conducts an analysis of post?stop outcomes. I follow the lead of the report in devoting most of my discussion to the ?rst part. This report is an ambitious undertaking with a worthy goal: detecting and understanding racial pro?ling in traf?c stops in the State of Connecticut. Racial disparities in policing are a long? standing problem, and numerous recent events have demonstrated the social polarization and upheaval that the practice can provoke. In order to gauge and remedy one aspect of the problem, many states and localities have recently mandated the collection and reporting of traf?c stop data. To the best of my knowledge, Connecticut goes the furthest not only in requiring such data to be analyzed and reported to the public, but in utilizing state-of-the?art statistical techniques to do so. Racial pro?ling in traf?c stops The principal technique used by the authors to assess racial pro?ling in traf?c stops is the so- called Veil of Darkness method. In the main, the authors? analysis and the conclusions based on this approach seem sound. In the discussion below I provide suggestions fer improvements and extensions. Before critiquing individual aspects of the study, it is important to understand the fundamental problem that any analysis of racial pro?ling must solve. What we seek is to compare the race distribution of traf?c stops to the race distribution of persons at risk of a traf?c stop. If minorities are overrepresented among traf?c stops relative to their representation among the population at risk of being stopped, racial pro?ling may be to blame. With a reporting program such as that in Connecticut, it is straightforward to measure the race distribution of traf?c stops. One simply tallies the share of minority drivers among all drivers stopped by police. This part of a racial pro?ling study is uncontroversial. The challenge lies in estimating the race distribution of persons at risk of being stopped, sometimes referred to as the ?benchmark,? the ?counter-factual,? or the ?risk set.? The problem is that the counterfactual is unobserved and depends on a number of complicated factors that make it dif?cult to estimate. First, to be at risk, one must be driving a car. Presumably, the further one drives, the greater the risk, all else equal. Next, one must be driving in such a manner as to draw the attention of police, legitimately or otherwise. This in turn implies that drivers in areas with a larger police presence will be at greater risk of being stopped, all else equal Explicit benchmarking The complexity of these factors means that almost any attempt to estimate an explicit benchmark will fail to produce a valid test of racial profiling in traf?c stops. The report addresses some of these issues explicitly in discussing population benchmarks. The simplest population benchmark is just the race distribution of residents within a particular jurisdiction. Thanks to Census data, this can be measured fairly accurately on a regular basis, but that is really its only virtue. The population race distribution makes no account of who drives or how much. One can limit the data to persons of driving age, but that fails to account for differences in car ownership, mode of travel, and distance to work. It also fails to account for the fact that the people who drive within a jurisdiction are not necessarily the same as the people who live there. People who drive into the jurisdiction from elsewhere, be it to work, to shop, or for entertainment, may differ from the residents of the jurisdiction. Likewise, residents who drive outside the jurisdiction may not represent a random sample of residents. For all of these reasons, population benchmarks fail to provide a credible counter-factual against which to compare the race distribution of traf?c stops. The report discusses and implements a number of approaches that attempt to produce a more credible explicit benchmark. One is to estimate the race distribution of the ?driving population.? This involves combining conventional Census population data with recently developed Census data on commuting patterns. This may provide a better estimate of the race distribution of people who are likely to drive within a jurisdiction. However, it fails to account for differences in how much people drive within the jurisdiction, in exposure to that jurisdiction?s police, and in driving behavior. Even if it solves one problem, it leaves many unaddressed. In a similar vein, the authors compare the race distribution of traffic stops of jurisdiction residents to the race distribution of the jurisdiction?s population. This may seem to solve the problem of out?of-area drivers, but again does nothing to re?ect differences in car ownership, mode of travel, proximity to police, or driving behavior. Furthermore, it would miss any racial pro?ling that involved out?of-area drivers. In another approach, the authors compare the race distribution of stops within one jurisdiction to that of so?called peer jurisdictions, where peers are selected based on their similarity in terms of an index constructed from a number characteristics of the jurisdiction. The weak link here is that 1155 East 60th Street. Chicago. IL 60637 000.000.0000 000.000.0000 harris.uchicago.edu the index is necessarily constructed only from observable characteristics. In contrast, many of the characteristics that are important in estimating a valid benchmark are unobservable, including car ownership, miles travelled, proximity to police, driving behavior, and so on. If it were possible to construct peer groups based on these characteristics, there might be some utility to the approach. Since it is not, however, these comparisons too yield invalid tests of racial pro?ling. The same critique applies to the authors? comparisons between individual jurisdictions and statewide averages. One can highlight jurisdictions with substantially higher rates of minority stops than one sees in the state as a whole, and one can then compare their minority residential populations with that of the entire state. However, one can?t compare the race distribution of the jurisdiction?s risk set with that of the state, and it is the distribution of the risk set that matters. What does one learn from these various different analyses based on explicit benchmarks? To the authors? credit, they refer to these analyses as ?descriptive,? which seems to suggest that they are not going to draw conclusions on the basis of them. However, at the same time, they highlight the jurisdictions that stand out on the basis of each of these comparisons, suggesting that they warrant further scrutiny. I ?nd it hard to justify this last point. The authors point out that none of these descriptive techniques provide valid tests of racial profiling in traffic stops. Why then should jurisdictions be scrutinized on the basis of their results? I readily understand why the authors would present simple population benchmarks: it is the natural comparison for lay readers, who will not have thought much about all the problems that they present. However, at that point I would move to the main results, rather than conducting additional invalid tests. Conducting multiple invalid tests gives them more weight than they deserve, and scrutinizing police departments on the basis of even multiple invalid tests has the effect of directing remedial resources toward departments where they are unlikely to be warranted. The focus of the analysis should be on the best tests the authors can conduct. Implicit benchmarking via the Veil 0fDar/mess For this they turn to the Veil of Darkness (VOD) test. This test, originally devised by Greg Ridgeway and me, has a number of It also has some weakness, some of which can be remedied. I discuss these in turn. The VOD test involves what might be called an implicit benchmark, which is based on two ideas. The first is this: in order to engage in racial profiling, the police of?cer must be able to observe the race of the driver before she pulls him over. This suggests using the race distribution of stops conducted at night, when race is harder to observe, as the benchmark for the race distribution of stops conducted during the day, when it is easier to observe. The second is to restrict attention to stops carried out during the evening intertwilight period, that is, between the hours of roughly 5 and 9 pm. 1155 East 60th Street. Chicago. lL 60637 000.000.0000 000.000.0000 harris.uchicago.edu Between 5 and 9 pm, it is daylight during the summer and dark during the winter. Constructing the benchmark from the race distribution of stops during the winter between the hours of 5 and 9 pm solves many of the problems with more conventional benchmarks. It accounts for the race distribution of people actually driving, implicitly accounting for racial differences in mode of travel and distance travelled. Restricting attention to evening traf?c also provides implicit controls for average driving behavior and exposure to police. Since most traffic on the road at that time stems from evening rush hour, it also provides a general control for the composition of drivers, subject to a caveat below. Comparing the race distribution of drivers stopped during the day to the race distribution of drivers stopped at night, restricting attention to stops carried out during the intertwilight period, provides evidence of racial pro?ling if a relatively higher share of minority drivers is stopped during the day. A similar test could be constructed from stops carried out during the morning intertwilight period, as the authors do, but there tend to be fewer stops during that time. One caveat involves changes in the seasonal composition of drivers. If the composition of people who drive between 5 and 9 during the summer is different from that of people who drive between 5 and 9 during the winter, then the test may be invalid. Fortunately, there are ways to check this condition. One could look for changes in the proportion of out?of?state drivers stopped during the summer, particularly in resort areas. One could also look for changes in the share of out-of-town drivers stopped during the summer near seasonal entertainment venues such as baseball stadiums. If necessary, one could exclude such areas from the analysis. A more general caveat involves the assumption of differential visibility between night and day. Although it is uncontroversial that humans have better daytime than nighttime vision, nighttime visibility may vary as a function of artificial lighting. The National Oceanic and Atmospheric Administration has recently made available nighttime satellite imagery that one could potentially use to better differentiate nighttime visibility and construct more powerful tests. Finally, in analyzing jurisdiction?speci?c tests for racial pro?ling, the authors would be well- served to adjust for multiple hypothesis testing. The typical signi?cance criterion employed for statistical testing assumes that the analyst is carrying out a single test and wishes to keep the probability of Type I error the probability of declaring a jurisdiction to be engaging in racial profiling when in fact it is not) below some threshold, typically 5 percent. If the same criterion is used for testing multiple hypotheses, multiple jurisdictions, the probability of committing at least one Type I error can rise well about 5 percent. Although there are different methods for adjusting for multiple testing, a particularly useful approach in settings such as this is known as false discovery rate (fdr) control. This approach ranks test statistics from largest to smallest, then applies significance criteria that are designed to balance the desire to control Type-l error against the desire to discover jurisdictions where racial profiling is actually taking place. Some versions of this procedure also produce an estimate of the probability that a seemingly significant test is truly a false discovery, that is, a Type?I error. 1155 East 60th Street, Chicago, IL 60637 000.000.0000 000.000.0000 harrls.uchicago.edu This type of approach would lower the likelihood of Type-I error and provide a sounder basis upon which to recommend further action. Post-stop analysis Beyond the question of whom the police are stopping, there is the question of what happens to the drivers once they are stopped. Racial disparities in post-stop treatment, particularly in the rate at which drivers are searched, have been a source of serious policy concern. The leading test for racial disparities in searches is known as the hit-rate test, or sometimes the Knowles-Persico-Todd test for its original authors. The hit~rate test compares the share of stops in which contraband is found (?hits?) across races. Under certain conditions, disparities in hit rates by race are indicative of racial prejudice. Speci?cally, if the minority hit rate is lower than the white hit rate, it suggests that police are racially pro?ling minorities. Put differently, since searches of minorities are less productive than searches of whites, it suggests that police are prejudiced against minorities. Put differently still, it means that police use a lower bar in deciding to search minorities than in deciding to search whites. The hit-rate test has problems both conceptual and operational. The operational problem is that with low search rates (less than 3 percent of stops in the State as a whole), the samples available to conduct the hit rate tests by jurisdiction are very small. As a result, the tests have low power even where they are feasible. The conceptual problem goes by the term ?inframarginality.? Although the concept is as complicated as its label is ponderous, what it means in a nutshell is this: if the distribution of ?guilt,? the distribution of contraband carrying, is suf?ciently different between two groups, the hit-rate may be higher for one group even though the bar for searching that group is lower. Recent work suggests the conditions underlying inframarginality may occur frequently in practice. As a result, the authors? decision to place less weight on their post?stop analysis seems well warranted. 1155 East 60th Street, Chicago. lL 60637 000.000.0000 000.000.0000 harris.uchicago.edu Jeffrey Grogger Irving Harris Professor THE UNIVERSITY OF CHICAGO . jgrogger@uchicago.edu I T773.542.3533 1155 E. 60?h Street . Chicago IL 60637 TO: Stephen M. Cox, Professor Department of Criminology and Criminal Justice Institute for the Study of Crime and Justice Central Connecticut State University New Britain, CT 06053 RE: Review of?Traf?c Stop Data and Analysis: 2014-2015? DATE: July 6, 2017 1 am happy to provide my review of the report ?Traf?c Stop Data and Analysis: 2014-2015,? as you requested. This report shares much in common with its predecessor based on the 2013?14 data. As a result, my main critiques from that report largely apply to this one as well. I start by reiterating the main points of my earlier review, discussing a statistical technique that is new to this report in the process. I then elaborate some issues that arise in the so-called enhanced analyses located in Part II of this report. Finally, I discuss the of?cer?level analysis, also contained in Part II. 1. My main critiques of the 2013?14 report fell into two general categories: a. Not all of the statistical tests provide valid controls for the population at risk of being stopped in traf?c; b. The sheer number of statistical tests calls for adjustments to conventional thresholds used to judge statistical signi?cance. a. Validity of statistical tests The fundamental problem in testing for racial pro?ling lies in estimating the race distribution of persons at risk of being stopped, sometimes referred to as the ?benchmark,? the ?counterfactual,? or the ?risk set.? The problem is that the counterfactual is unobserved and depends on a number of complicated factors that make it dif?cult to estimate. As explained in my earlier report, these factors involve who drives, how much they drive, where they drive, and their exposure to police while driving. The authors conduct what they refer to as ?descriptive? or ?informal? as well as ?formal? tests. The ?informal? tests generally involve some sort of explicit benchmarking, such as comparing the race distribution of stopped drivers to the race distribution of the local population, in some cases adjusting for the driving?age or out?of-area driving population. The problem with all of these methods is that they fail to account for differences in the amount of driving that drivers do and in differences in their exposure to police. Comparisons between jurisdictions and state averages suffer from similar problems. The authors acknowledge these limitations, which have been widely discussed in the literature. However, they then use them to classify jurisdictions as problematic (my term), arguing that jurisdictions that violate more than one such test merit further scrutiny. Specifically, of the 11 jurisdictions that were classi?ed as problematic on the basis of their 2013?14 results, and therefore subjected to ?enhanced analysis? in Part II the current report, six were so classi?ed only on the basis of the informal tests. I don?t follow the logic of how one arrives at a valid conclusion by combining invalid tests. I would limit attention to tests with high validity. The authors? ?formal? tests include the veil?of-darkness (VOD) test and the hit?rate test, which appeared in the 2013-14 report as well. New this year is what they refer to as a control? test, which would be more accurately referred to as a matching approach. The idea is to match ?treatment? stops, that is, those in the target jurisdiction, with ?comparison? stops from other jurisdictions, in such a way that the comparison stops provide good controls for those in the target jurisdiction. Matching is done on the basis of both jurisdiction and stop?level characteristics, after which the race distribution of stops in the target jurisdiction is compared to the race distribution of comparison stops. In a way, this is a re?ned version of the matching based on Mahlahnobis distance that the authors conducted in the 2013?14 report. It shares the weakness that matching is carried out only on the basis of observable characteristics. Thus it cannot match on the basis of exposure to police, which is a key weakness of the ?informal? tests. At a technical level, the standard errors, which measure the precision of the estimates, should be clustered by jurisdiction. Each test involves stops from different jurisdictions, which likely involve a jurisdiction-specific component of error due to differences in training and departmental culture. Failing to account for this via clustering can greatly overstate the precision of the tests. There is a subtler problem with this matching exercise as well, which is that matching on stop- level characteristics could actually mask racial pro?ling. Suppose jurisdiction A racially pro?les by means of equipment violations, pulling over more minority drivers than white drivers on the pretext of minor equipment Violations. Matching on the reason for the step will disproportionately select equipment violations from other jurisdictions into the comparison group. If equipment violations in other jurisdictions disproportionately involve minority drivers, then the test based on matching will be biased against ?nding racial pro?ling in jurisdiction A. Note that this does not require that other jurisdictions racially pro?le by means of stops for equipment Violations; it merely requires a racial disproportion in equipment violations, which could arise, for example, if higher?income drivers maintain their cars better and the average white driver has higher income than the average minority driver. Of course, if other jurisdictions do racially pro?le drivers by means of pretextual stops, the same bias will arise. 1155 East 60th Street. Chicago. IL 60637 000.000.0000 000.000.0000 harris.uchicago.edu b. Multiple testing To understand how many statistical tests are involved here, consider the following. There are 92 police departments plus ll State Police troops; six ?tools? or statistical tests, three ?informal? and three ?formal,? as well as four classi?cations of what it means to be minority (Black, Hispanic, Black 01' Hispanic, and Non-Caucasian). Add them up and that?s 2,472 tests. This exaggerates the true number a bit, since many jurisdictions lack suf?cient data to carry out the hit-rate tests, but even with only ?ve statistical tests, the number of tests conducted comes to 2,060. My ?rst recommendation, consistent with the comments above and in my ?rst report, would be to drop the ?informal? tests and restrict attention to tests with high validity. This would reduce the number of tests, although enough remain that the probability of false positives would still be high without adjustments for multiple testing. To reduce this probability, the analysis should employ false discovery rate (fdr) control, which is designed precisely to control the probability of false positives while providing suf?cient statistical power to detect true positives, that is, jurisdictions that may be racially pro?ling drivers. 2. ?Enhanced analyses? from Part II The enhanced analyses in Part II focus on the 11 jurisdictions that were highlighted as problematic based on the 2013-14 report. For the-nine towns in the group, the analysis involves mapping the traf?c stops, to the extent that the data permit, and comparing the race distribution of stops by census tract. This approach highlights the problems of explicit benchmarks. For example, in many of the towns, fewer than half of the stopped motorists are residents of the town. It stands to reason that many drivers at risk of being stopped likewise reside elsewhere, invalidating the use of population benchmarks. Many of the maps show traf?c stops to be concentrated in areas where one might expect greater traf?c enforcement, such as the vicinity of major intersections. Again, one could readily imagine that the geographical distribution of traf?c enforcement could cause the race distribution of drivers at risk of being stopped to differ substantially from that of population-based benchmarks. For the two State Police troops, the enhanced analysis amounts to adding additional controls to the regressions used to calculate the VOD tests, restricting the sample, or both. None of these changes affects the basic result that minority drivers are more likely to be stopped by these of?cers during the daytime than at night. 3. Of?cer-level analysis The ?nal section of the report employs a matching analysis aimed at identifying individual police of?cers who disproportionately stop minority drivers. For each police of?cer in the pool, a comparison set of of?cers is selected on the basis of propensity score matching. Comparison 1155 East 60th Street. Chicago, IL 60637 000.000.0000 000.000.0000 harrls.uchlcagol.edu of?cers are matched to the target of?cer on the basis of characteristics of the stop and the of?cer. The idea is to match the target of?cer to a set of of?cers whose stops have similar characteristics. The analysis then asks whether the race distribution of the target of?cer?s stops is out of line with that of the comparison group. This technique was applied to policing data by Ridgeway and McDonald (2009), who used it to construct internal benchmarks for New York City police of?cers on the basis of pedestrian stops. The pool for this exercise consisted of of?cers from the ll jurisdictions identi?ed as problematic on the basis of the 2013-14 study. Of the 935 of?cers in those jurisdictions, the analysis was limited to the 370 of?cers who made at least 50 stops, a threshold applied by Ridgeway and MacDonald to reduce statistical noise. The variables included in the propensity scoring model included a ?cubic spline of clock-time, reason for stop controls, state and town resident controls, day of the week controls, and season controls? 193). It is not clear to me whether the comparison pool for this exercise consisted only of the of?cers in the target of?cer?s jurisdiction, 01' whether it included all of?cers in the ll jurisdictions together. If it?s the latter, the model should include jurisdictional characteristics in addition to those listed above. My concerns above about controls for reason?for-stop apply here as well. If the target of?cer racially pro?les drivers by making many stops that are largely discretionary, his comparison pool will tend to consist disproportionately of other of?cers who similarly make many discretionary stops. Such a pool could bias the analysis against a ?nding of racial disproportionality on the part of the target of?cer. My concerns about multiple hypothesis testing also apply to this analysis. Of the 370 of?cers involved in the analysis, 38 were ?agged as stopping a disproportionate share of minority drivers. The authors do not state what signi?cance level they use in flagging those of?cers. Furthermore, the authors seem not to have controlled for the multiple comparison problem that this analysis presents. Making 370 tests based on standard signi?cance levels yields a very high probability of false positives. Ridgeway and MacDonald recognized this problem and employed control to deal with it. The authors should do the same. 1155 East 60th Street. Chicago. IL 60637 000.000.0000 000.000.0000 harris.uchicago.edu The University of Texas at San Antonio Department of Criminal Justice To: Dr. Stephen M. Cox Department of Criminology and Criminal Justice institute for the Study of Crime and Justice Central Connecticut State University From: Michael R. Smith, J.D., Professor and Chair Date: July 15, 2017 Re: Review of State of Connecticut Traffic Stop Data Analysis and Findings, 2013-14 In April 2017, I was asked by Dr. Stephen Cox at Central Connecticut State University to review the State of Connecticut's report on Traffic Stop Data Analysis and Findings, 2013?14 and to provide an independent opinion on the methodology employed by the research team from the Institute for Municipal and Regional Policy and the Connecticut Economic Resource Center, which are the entities that performed the analysis and wrote the report. This memorandum serves as my review. Background This ambitious report analyzes 620,000 traffic stops conducted by Connecticut?s 92 municipal police departments, 13 specialized agencies with traffic stop authority, and the 13 troops of the Connecticut State Police during a 12 month period from October 1, 2013 through September 30, 2014. The authors employed a variety of research strategies and analytic techniques to examine racial and ethnic disparities in traffic stops and selected traffic stop outcomes. They conducted these analyses statewide and at the individual agency and/or State Police troop level. in keeping with most other reported studies of this type, they found disparities in some of the analyses they conducted. These disparities were observed at the statewide level and in some of the agencies and State Police troops that were examined. Population-Based Benchmarks in order to determine whether certain racial or ethnic groups are stopped more often than possibly warranted, researchers must compare the proportions of those groups stopped by the police to an appropriate benchmark or estimate of the groups? representation among those at risk for being stopped. A number of benchmarks have been proposed and used by researchers in the reported literature, but not all have proven to be accurate representations of the drivingand/or traffic violating populations on the roadways. In this report, the research team chose to use several population-based benchmarks against which to compare the racial and ethnic composition of drivers stopped by the police. They provided caveats to these analyses by acknowledging that they are less stringent than other types of analyses, but they ultimately chose to include them because "in the absence of better alternatives, it inevitably becomes the default method for making comparisons." 16). There are indeed better alternatives to population-based benchmarks, and the research team made use of some of them in the report. Their decision to include scientifically questionable population- based benchmarks was a judgment call seemingly made because ?others? (perhaps the media or members of the general public) will inevitably make such comparisons and thus it was better to be over-inclusive than to leave such comparisons out of the report. 1 501 W. C?sar E. Chavez Blvd. San Antonio, Texas 78207 0 (210) 458-2535 (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice While this decision was not unreasonable, the benchmark analyses that compared individual towns (or troops) to (1) statewide averages, (2) estimates of the residential or driving populations of towns or regions based on population (or driving age) data, (3) adjusted (EDP) driving population estimates, and (4) peer-group estimates are not well-supported in the scientific literature on racial profiling and police stops. Among scholars contributing to the contemporary, peer-reviewed literature on racial profiling, these methods have been discarded as unreliable and are no longer widely used. The authors of the report are correct to point out their weaknesses and to urge caution in drawing conclusions based on their results, even if an agency appears to be an outlier across multiple population-based benchmark indicators. Veil of Darkness Analyses in addition to the descriptive analyses discussed above, the authors of the report also used the "Veil of Darkness" method developed by Grogger and Ridgeway to compare the percentage of minority drivers stopped during the daytime to the percentage of those drivers stopped at night. This method is theoretically sound and has been used in a number of reported studies on racial and/or ethnic disparities in police stops. Conceptually, the idea behind this method is that if police stop a higher percentage of minority drivers during dayiight hours when they are (theoretically) more likely than at night to be capable of observing the race of the driver before initiating a traffic stop, then this provides evidence that police are using race as a factor in making a stop. Grogger and Ridgeway cleverly used the natural variation in daytime/nighttime hours that occur across the calendar year during the intertwilight period to control for the possibility that the minority driving population may vary markedly across the 24 hour clock. The authors of the report followed the Grogger and Ridgeway model for their analyses and arguably improved upon it by including some additional controls in their estimation equations. In addition, they controlled for the possibility of differential rates of equipment violations among minority groups (often associated with lower socioeconomic status and poorly maintained vehicles) by limiting their analyses to moving violations and eliminating investigative stops and equipment violations. The results of the VOD analysis revealed racial and ethnic disparities statewide and among certain towns and State Police troops consistent with the theoretical constructs of the analytic method. That is, non-Caucasians were more likely to be stopped during the daytime than at night, suggesting that they were being targeted for stops based on their race and/or ethnicity. it is important to note some key limitations to the VOD method that are not discussed in the report. This method assumes that officers can more readily and accurately identify the race and/or ethnicity of a driver during the day as compared to at night. To my knowledge, this assumption has never been empirically tested or evaluated, and in urban areas with good ambient lighting from street lights, signs, or other sources, there are sound reasons to question this assumption. lf the ability to observe the race and/or ethnicity of a driver is not significantly diminished at night, then variations in the percentage of minority drivers stopped during the daytime as compared to at night may reflect legitimate differences in the daytime/nighttime driving populations of minority drivers. In other words, there may, in fact, be more minority drivers on the roadways in some towns or cities during the day than at night. This possibility cannot be ruled out and should be acknowledged in the report as a limitation of the VOD method. Post-Stop Disparities Hit rate analysis The authors followed Knowles, Persico, Todd (2001) and performed a hit rate analysis of searches conducted by Connecticut police officers. According to the KPT theoretical construct, hit rates (the rate at which contraband is found following a search) should be approximately equal across racial groups in the absence of racial bias. 2 501 W. C?sar E. Chavez Blvd. - San Antonio, Texas 78207 - (210) 458-2535 (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice If hit rates for minorities are lower than for Whites, this suggests that officers are applying a lower threshold of evidence for searching minorities as compared to Whites and therefore are mistaken more often, thus resulting in lower hit rates. As the authors point out, the KPT hit rate analysis has been criticized in the literature, but it has been widely used as well and is considered to be one of the better post-stop disparity analyses available to researchers. Thus, while the authors? use of the KPT hit rate approach is appropriate, their application of it may be flawed. A hit rate analysis is only suggestive of possible racial/ethnic bias when limited to a subset of high discretion searches. Some searches, like some arrests, are relatively low discretion events. For example, searches conducted incident to arrest or inventory searches conducted following the seizure of a motor vehicle are low discretion searches. Often departmental policies, and certainly officer safety concerns and police training, will mandate that officers conduct a search of all persons who are custodially arrested. Differences in hit rates across racial groups for these types of low discretion searches are unlikely to reveal racial/ethnic biases on the part of the arresting officers; rather, such differences would likely be indicative of differential contraband "carry" rates among racial groups. Thus, subjecting low discretion searches to a hit rate analysis is methodologically unsound. Low discretion searches should be identified and removed from the search dataset before a hit rate analysis is conducted. There is no indication in the text of the report that the authors parsed out low and high discretion searches and subjected only the high discretion searches to a hit rate analysis. On p. 45, the authors state that they aggregated all search data for Connecticut and performed a KPT hit rate analysis on these data. if the search data included low discretion searches such as searches incident to arrest or inventory searches, then the hit rate analysis is fatally flawed. Moreover, the presentation of the results is obtuse and difficult for the lay reader to understand. Even if the hit rate analysis was conducted properly with only high discretion searches, i would recommend that the authors show the actual hit rate percentages by racial/ethnic category (including for Whites) and indicate which of the differences was statistically significant. Solar-powered stops and searches analysis In addition to the KPT hit rate analysis, the authors also performed a ?solar~powered? stops and searches analysis following Ritter?s (2013) approach in an apparently unpublished working paper. To my knowledge, this approach does not appear anywhere in the peer-reviewed literature on racial disparities in police stops. The theoretical explanation for this approach as outlined by the authors is difficult to follow. They state that if "one observes an increased rate of searches during darkness hours, a possible conclusion would be to assert that officers are pulling over less minority drivers because they cannot discern their demographics prior to the making a stop decision." 47). This statement, on its face, is illogical. Even if it is true that officers cannot ascertain the race or ethnicity of a driver as readily at night as compared to in the daytime, it does not logically follow that one would therefore expect a higher rate of searches at night. The decision to stop and the decision to search are largely independent of one another. Once a driver is stopped and a search takes place, the officer can certainly, at that point, ascertain the race or ethnicity of the driver. One might expect higher search rates at night because there is less public visibility for the search but not because officers are less able to ascertain race or ethnicity at night prior to initiating a stop. Likewise, the statement that "one would expect to observe a statistically significant and positive log odds ratio on the darkness indicator variable if officers have a lower threshold for stopping and searching minorities? is difficult to follow for the same reason. Again, the decision to stop and the decision to subsequently search are often independent of one another. While it my be theoretically sound to equate higher search rates for minorities at night with possible racial/ethnic bias (again because of the lower level of public visibility and scrutiny), this conclusion does not logically flow from the statement above statement regarding the stopping of minorities. 3 501 W. C?sar E. Chavez Blvd. San Antonio, Texas 78207 i (210) 458-2535 0 (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice Given that the "solar-powered? analytic approach is not widely reported in the peer-reviewed literature and conflates the independent decisions to stop and search (at least as presented by the authors in the report), i would recommend dropping this type of analysis from future reports until this method gains broader acceptance in the scientific community. Missing Analyses In addition to the problems outlined above with the post~stop disparities analysis, there are several commonly reported post-stop analyses that are missing from the report. The stop data available to the research team seems to include indicators for whether a citation was issued or an arrest was made, yet these outcomes were not subjected to a disparity analysis as they commonly are in other reported studies of racial profiling. Examining whether minorities are issued a citation more or less frequently than Whites while controlling for the reason for the stop is a useful analysis as part of a broader look at post-stop outcome disparities. in addition, examining the influence of race/ethnicity on the decision to make an arrest is also useful and commonly reported in other studies. Like a hit rate analysis, though, an arrest disparity analysis should remove (or control for) low discretion arrests. Warrant~based arrests, DUI arrests, and on-view felony arrests are typically low discretion arrests that may vary by race. These types of arrests should not be subjected to an arrest disparity analysis. Instead, such an analysis should focus on high discretion arrests primarily misdemeanor and/or traffic-related arrests. Finally, a growing body of literature has used the racial/ethnic composition of drivers involved in traffic crashes as a benchmark for traffic stops. The idea is that not-at-fault drivers in two-vehicle collisions represent an unbiased and relatively random sample of persons driving on the highways. Likewise, at-fault drivers represent a proxy for risky drivers - those who may be violating the traffic laws and therefore at increased risk for being stopped by the police. If Connecticut?s uniform traffic crash report captures the race and ethnicity of the drivers involved in collisions investigated by the police, then future analyses could utilize those data as a scientifically sound benchmarking technique for stops (see Alpert, Smith, Dunham, 2004 for a review of the literature and application of this technique; see also Smith, Rojek, Tillyer, Lloyd. 2017 and Withrow Williams, 2015 for recent applications) Summary This report is well-written and appropriately organized. From a stylistic standpoint, it sometimes reads like an econometric journal article and may be difficult to follow and understand from a layperson?s perspective. I would recommend placing all equations and associated discussions in an appendix and reporting and discussing odds ratios in the V00 tables rather than simply coefficients. From a methodological perspective, the report is uneven. Some of the analyses and techniques employed are respected and well?represented in the peer?reviewed literature on disparities in police stops the Veil of Darkness method). However, the population?based benchmarking analyses are methodologically unsound, and recommend removing them from future reports. The post-stop analyses are flawed. The KPT hit rate analysis is only appropriate for high discretion searches, and there is no indication that the pool of searches was properly limited to these types of searches. The "solar-powered" analysis is confusing as presented and not widely reported in the literature. Again, i would recommend eliminating it from future reports. Finally, one of the best benchmarking techniques available (using traffic crash data) is not reported, which may be due to the unavailability of statewide traffic crash data on the race and ethnicity of drivers involved in collisions. If the State of Connecticut intends to examine its stop data on an annual basis, it should consider collecting these data (if does not already) and making them available to the research team charged with conducting the statewide analysis. 4 501 W. C?sar E. Chavez Blvd. 0 San Antonio, Texas 78207 0 (210) 458-2535 0 (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice References Alpert, G.P., Smith, M.R., Dunham, R. (2004). Toward a better benchmark: Assessing the utility of not-at- fauit traffic crash data in racial profiling research. Justice Research and Policy, 6, 44?69. Smith, M.R., Rojek, J.. Tillyer, R. Lloyd, C. (2017). San Jose Police Department Traf?c and Pedestrian Stop Study. Available at: Traffic-Pedestrian Stop Study 2017.pdf. Withrow, BL. and Williams, H. (2015). Proposing a Benchmark Based on Vehicle Collision Data in Racial Profiling Research. Criminal Justice Review, 40(1), 449-469. 5 501 W. Cesar E. Chavez Blvd. - San Antonio, Texas 78207 - (210) 458-2535 0 (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice To: Dr. Stephen M. Cox Department of Criminology and Criminal Justice Institute for the Study of Crime and Justice Central Connecticut State University From: Michael R. Smith, J.D., Professor and Chair Date: July 19, 2017 Re: Review of State of Connecticut Traffic Stop Data Analysis and Findings, 2014-15 In April 2017, was asked by Dr. Stephen Cox at Central Connecticut State University to review the State of Connecticut?s report on Traffic Stop Data Analysis and Findings, 2013-14 and to provide an independent opinion on the methodology employed by the research team from the Institute for Municipal and Regional Policy and the Connecticut Economic Resource Center, which are the entities that performed the analysis and wrote the report. This memorandum serves as my review. Background This year?s report analyzes 586,000 traffic stops conducted by Connecticut's 92 municipal police departments, 13 specialized agencies with traffic stop authority, and the 13 troops of the Connecticut State Police during a 12?month period from October 1, 2014 through September 30, 2015. The authors employed a variety of research strategies and analytic techniques to examine racial and ethnic disparities in traffic stops and selected traffic stop outcomes. They conducted these analyses statewide and at the individual agency and/or State Police troop level. in keeping with most other reported studies of this type, they found disparities in some of the analyses they conducted. These disparities were observed at the statewide level and in some of the agencies and State Police troops that were examined. Unlike the previous year?s report, the 2014?15 report also details the results of in?depth analyses conducted in selected towns/Troops that showed disparities in the overall analysis. My review focuses on the primary analytic methodologies used by the research team in the overall analysis. To the extent that new or different methodologies are used in the in-depth analyses, I will comment on those as well. Population-Based Benchmarks In order to determine whether certain racial or ethnic groups are stopped more often than possibly warranted, researchers must compare the proportions of those groups stopped by the police to an appropriate benchmark or estimate of the groups? representation among those at risk for being stopped. A number of benchmarks have been proposed and used by researchers in the reported literature. but not all have proven to be accurate representations of the driving and/or traffic violating populations on the roadways. In this report, the research team chose to use several population?based benchmarks against which to compare the racial and ethnic composition of drivers stopped by the police. They provided caveats to these anaiyses by acknowledging that they are less stringent than other types of analyses. Although they chose to include them in the report, they further limited this year's direct Census comparison to stops of town residents compared to the residential population age 16 and over. 1 501 W. C?sar E. Chavez Blvd. - San Antonio, Texas 78207 0 (210) 458-2535 0 (210) 45812680 fax The University of Texas at San Antonio Department of Criminal Justice The decision to include scientifically questionable population-based benchmarks was a judgment call and perhaps not unreasonable given the acknowledged limitations of these approaches in the report. However, among scholars contributing to the contemporary, peer-reviewed literature on racial profiling, population?based comparisons, even those adjusted using commuting data and other ?pushes? and "pulls," have been discarded as unreliable and are no longer widely used. The authors of the report are correct to point out their weaknesses and to urge caution in drawing conclusions based on their results, even if an agency appears to be an outlier across multiple population-based benchmark indicators. Veil of Darkness Analyses in addition to the descriptive analyses discussed above, the authors of the report also used the "Veil of Darkness? method developed by Grogger and Ridgeway to compare the percentage of minority drivers stopped during the daytime to the percentage of those drivers stopped at night. This method is theoretically sound and has been used in a number of reported studies on racial and/or ethnic disparities in police stops. Conceptually, the idea behind this method is that if police stop a higher percentage of minority drivers during daylight hours when they are (theoretically) more likely than at night to be capable of observing the race of the driver before initiating a traffic stop, then this provides evidence that police are using race as a factor in making a stop. Grogger and Ridgeway cleverly used the natural variation in daytime/nighttime hours that occur across the calendar year during the intertwilight period to control for the possibility that the minority driving population may vary markedly across the 24-hour clock. The authors of the report followed the Grogger and Ridgeway model for their analyses and arguably improved upon it by including some additional controls in their estimation equations. In addition, they controlled for the possibility of differential rates of equipment violations among minority groups (often associated with lower socioeconomic status and poorly maintained vehicles) by limiting their analyses to moving violations and eliminating investigative stops and equipment violations. The results of the VOD analysis revealed racial and ethnic disparities statewide and among certain towns and State Police troops consistent with the theoretical constructs of the analytic method. That is, non?Caucasians were more likely to be stopped during the daytime than at night, suggesting that they were being targeted for stops based on their race and/or ethnicity. It is important to note some key limitations to the VOD method that are not discussed in the report. This method assumes that officers can more readily and accurately identify the race and/or ethnicity of a driver during the day as compared to at night. To my knowledge, this assumption has never been empirically tested or evaluated, and in urban areas with good ambient lighting from street lights, signs, or other sources, there are sound reasons to question this assumption. If the ability to observe the race and/or ethnicity of a driver is not significantly diminished at night, then variations in the percentage of minority drivers stopped during the daytime as compared to at night may reflect legitimate differences in the daytime/nighttime driving populations of minority drivers. In other words, there may, in fact, be more minority drivers on the roadways in some towns or cities during the day than at night. This possibility cannot be ruled out and should be acknowledged in the report as a limitation of the VOD method. Control Analysis As a final stops?based disparity analysis, the authors used propensity score weighting and boosted logistic regression to compare stops made by officers in the various towns in Connecticut to a propensity score weighted sample of stops made outside of their towns. The idea behind this approach is to use statistical weighting to identify a sample of stops from contiguous towns that ?match? those made by officers in the town in question to determine whether there are statistically significant differences in the proportion of minority stops made by officers in the two groups. 2 501 W. C?sar E. Chavez Blvd. San Antonio, Texas 78207 (210) 458-2535 0 (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice Like any regression modeling technique, this approach is susceptible to misspecification and/or omitted variable bias. In this case. the variables used in constructing the models appear in Table 28a and include a variety of population-based indicators drawn from the US. Census. For the same reasons that the descriptive, population-based benchmarks are flawed (see discussion above), the results from the control analysis are also questionable. In their 2009 work cited by the authors of the Connecticut report, Ridgeway and MacDonald apply propensity score weighting to identify outlier NYPD officers who conducted a significantly higher percentage of minority pedestrian stops than the weighted comparison sample. The samples were matched along a variety of indicators for time, place, and nature of the stop. Crucially, Ridgeway and MacDonald compared stops made by officers working in the same patrol borough and precincts to one another rather than using population-based estimates to match stops made by officers in one borough to stops made by officers in another borough. In contrast, the control analysis in the Connecticut report relies on population?based variables to weight traffic stops made by officers outside of the town of interest (the control group) and compare them to stops made by officers in the town of interest (the experimental group). As discussed above, population-based benchmarks (or weighting variables) are inherently flawed as a traffic stop benchmark because they do not provide a sufficiently reliable estimate of the driving population of a particular town. The use of a mathematically sophisticated modeling technique does not change the underlying construct of the model, which is based on a comparison of stops made by officers exposed to potentially different driving populations. While the authors of the report did an admirable job of trying to make an analytic approach ?fit? the data they had, in my opinion, these models do not adequately account for the differences in the driving populations of the towns being compared. which can vary considerably from their residential populations. Adjusting (or weighting) traffic stops made by comparison town officers by using residential population parameters is methodologically questionable when the underlying analysis involves traffic stops made by Connecticut police officers. There are better techniques available (see VOD above and discussion of traffic crash benchmarking below) without relying on this type of analysis. Post-Stop Disparities Hit rate analysis The authors followed Knowles, Persico, Todd (2001) and performed a hit rate analysis of searches conducted by Connecticut police officers. According to the KPT theoretical construct, hit rates (the rate at which contraband is found following a search) should be approximately equal across racial groups in the absence of racial bias. If hit rates for minorities are lower than for Whites, this suggests that officers are applying a lower threshold of evidence for searching minorities as compared to Whites and therefore are mistaken more often, thus resulting in lower hit rates. As the authors point out, the KPT hit rate analysis has been criticized in the literature, but it has been widely used as well and is considered to be one of the better post~stop disparity analyses available to researchers. Thus, while the authors? use of the KPT hit rate approach is appropriate, their application of it may be flawed. A hit rate analysis is only suggestive of possible racial/ethnic bias when limited to a subset of high discretion searches. Some searches, like some arrests, are relatively low discretion events. For example, searches conducted incident to arrest or inventory searches conducted following the seizure of a motor vehicle are low discretion searches. Often departmental policies, and certainly officer safety concerns and police training, will mandate that officers conduct a search of all persons who are custodially arrested. Differences in hit rates across racial groups for these types of low discretion searches are unlikely to reveal racial/ethnic biases on the 3 501 W. C?sar E. Chavez Blvd. 0 San Antonio, Texas 78207 0 (210} 458~2535 0 (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice part of the arresting officers; rather, such differences would likely be indicative of differential contraband "carry" rates among racial groups. Thus, subjecting low discretion searches to a hit rate analysis is methodologically unsound. Low discretion searches should be identified and removed from the search dataset before a hit rate analysis is conducted. There is no indication in the text of the report that the authors parsed out low and high discretion searches and subjected only the high discretion searches to a hit rate analysis. On p. 52, the authors state that they aggregated all search data for Connecticut and performed a KPT hit rate analysis on these data. If the search data included low discretion searches such as searches incident to arrest or inventory searches, then the hit rate analysis is fatally flawed. Moreover, the presentation of the results is obtuse and difficult for the lay reader to understand. Even if the hit rate analysis was conducted properly with only high discretion searches, I would recommend that the authors show the actual hit rate percentages by racial/ethnic category (including for Whites) and indicate which of the differences was statistically significant. Town?Specific Analyses in this year's report, the authors drilled down and conducted further analyses in nine towns and two State Police Troops that showed significant disparities in the overall analysis. These town and/or troop-level analyses again compared stops to the residential populations of the towns and their contiguous areas. They also mapped the stops and used the geographic distribution of the stops to help understand where most stops occurred and what the residential populations were of those areas. These additional analyses also examined racial/ethnic differences in stop outcomes infraction, warnings, etc.) and in the reasons for the stops registration violations, speed-related, etc.). These descriptive findings were helpful but stopped short of using multivariate modeling to control for stop, driver, officer, and location variables to determine whether observed racial or ethnic differences persisted after controlling for relevant covariates known to possibly impact stops and stop outcomes. Multivariate modeling at the town and/or troop levels would be the next logical improvement that could be made to the analytic strategy. Officer-to-Officer Comparisons The 2014-15 report added an important internal benchmarking component that was lacking in the 2013-14 report. The authors used propensity score weighting and doubly-robust regression estimation to identify officers who stopped more minorities than a statistically matched sample of ?peer? officers. This approach has been used in a number peer?reviewed publications cited by the authors of the report and has generally been well-received by the scholarly community. Once the models identified outlier officers, the authors then further examined them "using a balancing test that directly compared the distribution of observable traffic stop characteristics with those of each officer's benchmark." This approach appeared to further reduce the number of outlier officers identified, but the approach was not fully explained. Thus, it is difficult for me to assess its validity or even to understand why the authors felt it necessary add an additional step beyond the propensity score weighting/logistic regression analyses already performed. Missing Analyses As with the 2013-14 report, there are several commonly reported post-stop analyses that are missing for most of the agencies in the state, except for those subjected to the more in-depth agency-level analysis. Examining whether minorities are issued a citation more or less frequently than Whites while controlling for the reason for the stop is a useful analysis as part of a broader look at post?stop outcome disparities. In addition, examining the influence of race/ethnicity on the decision to make an arrest is also useful and commonly reported in other studies. Like a hit rate analysis, though, an arrest disparity analysis should remove (or control for) low 4 501 W. C?sar E. Chavez Blvd. San Antonio. Texas 78207 i (210) 458-2535 - (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice discretion arrests. Warrant?based arrests, DUI arrests, and on-view felony arrests are typically low discretion arrests that may vary by race. These types of arrests should not be subjected to an arrest disparity analysis. instead. such an analysis should focus on high discretion arrests primarily misdemeanor and/or traffic-related arrests. Finally, a growing body of literature has used the racial/ethnic composition of drivers involved in traffic crashes as a benchmark for traffic stops. The idea is that not-at?fault drivers in two~vehicle collisions represent an unbiased and relatively random sample of persons driving on the highways. Likewise. at?fauit drivers represent a proxy for risky drivers - those who may be violating the traffic laws and therefore at increased risk for being stopped by the police. If Connecticut?s uniform traffic crash report captures the race and ethnicity of the drivers involved in collisions investigated by the police, then future analyses could utilize those data as a scientifically sound benchmarking technique for stops (see Alpert, Smith, Dunham, 2004 for a review of the literature and application of this technique; see also Smith, Rojek, Tillyer, Lloyd, 2017 and Withrow Williams, 2015 for recent applications) - Summary This report is well-written and appropriately organized. From a stylistic standpoint, it sometimes reads like an econometric journal article and may be difficult to follow and understand from a layperson?s perspective. I would recommend placing all equations and associated discussions in an appendix and reporting and discussing odds ratios in the VOD tables rather than simply coefficients. From a methodological perspective, the report is an improvement over the 2013-14 report. Some of the analyses and techniques employed are respected and well?represented in the peer-reviewed literature on disparities in police stops (eg. the Veil of Darkness method). However, the population?based benchmarking analyses are methodologically unsound, and recommend removing them from future reports. The state-wide post-stop analyses are now limited to a hit rate analysis only. The authors wisely eliminated the "solar-powered" search analysis from the 2013-14 report. However, the KPT hit rate analysis is only appropriate for high discretion searches, and there is no indication that the pool of searches was properly limited to these types of searches. Future reports should appropriately limit the sample of searches in the hit rate analysis. Finally, one of the best benchmarking techniques available (using traffic crash data) is not reported, which may be due to the unavailability of statewide traffic crash data on the race and ethnicity of drivers involved in collisions. if the State of Connecticut intends to examine its stop data on an annual basis, it should consider collecting these data (if does not already) and making them available to the research team charged with conducting the statewide analysis. 5 501 W. C?sar E. Chavez Blvd. - San Antonio, Texas 78207 0 (210) 4584635 0 (210) 458-2680 fax The University of Texas at San Antonio Department of Criminal Justice References Alpert, G.P., Smith, M.R., Dunham, R. (2004). Toward a better benchmark: Assessing the utility of not-at- fault traffic crash data in racial profiling research. Justice Research and Policy, 6, 44-69. Smith, M.R., Rojek, J., Tillyer, R. Lloyd, C. (2017). San Jose Police Department Traf?c and Pedestrian Stop Study. Available at: Traffic?Pedestrian Stop Study 2017.pdf. Withrow, B.L. and Williams, H. (2015). Proposing a Benchmark Based on Vehicle Collision Data in Racial Profiling Research. Criminal Justice Review, 40(1), 449-469. 6 501 W. C?sar E. Chavez Blvd. - San Antonio, Texas 78207 (210) 458-2535 - (210) 458-2680 fax SCHOOL OF CRIMINOLOGY . .Edward Mag.u're? phID' CRIMINAL JUSTICE Professor, School of Criminology 8: Criminal Justice Associate Director, Center for Violence Prevention Community Safety ARIZONA STATE UNIVERSITY Review of State ofConnecIicui Traffic Stop Data Analysis and Findings, 2013-14 The major ?ndings from this report are presented in Parts 4, 5, and 6 and those are the sections on which I focus my review. I did not understand the logic for including the methodology and ?ndings presented in part 4 (?Descriptive Statistics and Intuitive Measures?). The methods used in this section are widely considered less defensible than the methods used in Parts 5 and 6. Moreover, the authors appear to agree with this assessment. In the introduction to Part 4 (on page 30), the authors write: ?although these simple statistics present an intriguing story, conclusions should not be drawn from these measures.? If conclusions cannot be drawn from these measures then they shouldn?t be included here. The methodology used in Part 5 (?Analysis of Traf?c Stop Disparities?) and Part 6 (?Analysis of Post-Stop Disparities?) appears to be sound. These sections draw on the most appropriate methods used in the scienti?c literature on racial disparities in policing. I noted some potential errors in the discussion of the ?ndings. For instance, in Table 19, the authors note that all but one of the coef?cients is signi?cant. My reading of the table suggests that all of the coef?cients are signi?cant. Similarly, in Table 25 the authors note that ?only four of the ?ve speci?cations ?nd a disparity that indicates a bias towards searching minority groups.? I assume they are referring to positive differentials in drawing this inference. However, my reading of the table shows that only three of the differentials are positive, not four. In the text following Tables 19, 20, and 21, the authors note in each instance that the coef?cient for black drivers ?regains? statistical signi?cance. 1 did not understand what they meant by this terminology. Was the coef?cient non-signi?cant and then became signi?cant after some modi?cation to the models? On page 43, the authors conclude (based on the results from Tables 23 and 24) that ?a large share of the disparity at the state level is being driven by these ?ve departments.? 1 did not see any basis in the ?ndings presented by the authors for making this inference. Moreover, it is unclear why ?ve departments were chosen and what threshold was used to select them. The intended audience of the report is also somewhat unclear. It is written very densely and in a manner that readers without statistical training will unlikely be able to wade through. For that reason the intended audience appears to be scholars with advanced training in statistics. Yet the inclusion of Part 4, which contains descriptive and ?intuitive? measures, appears to be meant for readers who are untrained in social science methodology even though readers are cautioned not to draw conclusions from these results. Reports like this should identify the intended audience clearly and be written consistently at a level that the proposed audience can understand. The report uses unnecessarily confusing and even contradictory language in drawing inferences about racial pro?ling. For instance, on p. 54, the authors note that ?racial and ethnic disparities do not, by themselves, provide conclusive evidence of racial pro?ling.? Yet, on page 51 they conclude that: ?the statistical disparity provides evidence in support of the claim that certain of?cers in the state are engaged in racial pro?ling during daylight hours when motorist race and ethnicity is visible.? Given that the main purpose of this report is to draw inferences about racial pro?ling, it is imperative for the authors to use much clearer language around this vital issue. On page 51, the authors emphasize that ?it is speci?c of?cers and departments that are driving these statewide trends.? Arguably, the department~level analyses they present may be suf?cient to warrant this conclusion with regard to departments. However, to my knowledge no of?cer~ level analyses were carried out and therefore the conclusion that ?speci?c of?cers? are driving statewide trends is merely speculative at this point. In the summary of ?ndings on page 53, the authors identify 5 agencies with disparities that were identi?ed from the Veil of Darkness analyses and 12 agencies with disparities that were identi?ed from the descriptive analyses. The decision to name speci?c agencies that emerged from the descriptive analyses seems unwarranted. Earlier, the authors cautioned readers not to draw conclusions from these analyses. Now they are using these same analyses to name agencies with evidence of racial or ethnic disparities in traf?c stops. This decision raises a number of important questions. If the descriptive analyses are appropriate, why carry out the Veil of Darkness analyses? If they are not appropriate, which would appear to be the case, then why carry them out at all? If the Veil of Darkness analysis is the most appropriate methodology, it seems inappropriate to list the twelve agencies emerging out of the descriptive analyses as having the greatest disparities. If the authors are truly intent on using the results from the descriptive analyses to name agencies with disparities, it would be appropriate for them to provide some indication of the degree of overlap in the departments with the strongest evidence of disparities across the different methodologies. Do they produce the same ?ndings? If not, it is essential for the authors to explain why. Absent these additional steps, the descriptive analyses would appear to have no valid place here. Review of State of Connecticut Traf?c Stop Data Analysis and Findings, 2014-15 This report is similar to the previous report in its basic structure. The authors once again include a section that presents ?descriptive statistics and intuitive measures.? The authors acknowledge that these methods have been heavily criticized but choose to use them anyway. On page 16 the authors note that ?although any one of these benchmarks cannot provide by itself a rigorous enough analysis to draw conclusions regarding racial pro?ling, if taken together they highlight those jurisdictions where disparities are signi?cant enough to justify further analysis." Later, on page 33 they provide a strong warning to readers about these same analyses: "although these simple statistics present an intriguing story, conclusions should not be drawn from these measures." As with the previous report, I ?nd the inclusion of this section not to be well justi?ed by the authors. The previous report used ?ve minority classi?cations; this study uses only four. The ?non- Caucasian or Hispanic? category was dropped. Why was it dropped? Ordinarily a major change in one of the key outcome variables would merit an explanation. The methodology used in the Veil of Darkness tests both statewide and at the department level appears to be sound. These analyses draw on the most appropriate methods used in the scienti?c literature on racial disparities in policing. The conclusions from these analyses appear to be appropriate. In the section that begins on page 46, the authors introduce a controls design. I am familiar with controls and propensity score matching, but the description of the study procedures is unclear. These methods are typically used to test the effects of a treatment on an outcome. Here the authors? description of what constitutes the treatment is not written clearly, and as a result I don?t really understand what they did here. For that reason I cannot comment on the quality or scienti?c merit of these analyses. In the section that begins on page 51, the authors present ?ndings from their KPT hit rate analyses of vehicle searches. In the previous report the authors acknowledged that this method has been criticized in the literature. As a result, the previous report included supplementary analyses using a ?solar-powered" analysis recommended by Ritter. That additional method is not included here and it is not clear why. Again, ordinarily a major change like this would merit an explanation from the authors. As with the previous report, my biggest concern with this report is the decision by the authors to use the descriptive analysis as a basis for identifying agencies with a pattern of racial or ethnic disparities. Earlier the authors cautioned readers not to draw conclusions from these measures, yet now these same measures are being used as a basis for identifying individual agencies. That seems problematic. Why did the agencies that emerged from the descriptive analyses not emerge from the more sophisticated econometric analyses? The officer-level analyses shown on pages l90~196 appear capany done but I am not clear on how this research would fare in front of a human subjects review board. Are the police of?cers represented in these analyses unwitting research subjects? Are they required to provide consent to participate in the research or to have their identities revealed? If they are being identi?ed as potentially problematic of?cers, it is not dif?cult to imagine them facing some harm as a result. I am not an expert in the laws and policies governing research on human subjects, but it would appear sensible to ensure that the appropriate laws and policies were followed in this case.