MEASURING INTERJUDGE SENTENCING DISPARITY: BEFORE AND AFTER THE FEDERAL SENTENCING GUIDELINES* JAMES M. ANDERSON, Defender Association of Philadelphia JEFFREY R. KLING, Princeton University and National Bureau of Economic Research and KATE STITH Yale Law School Abstract This paper evaluates the impact of the Federal Sentencing Guidelines on interjudge sentencing disparity, which is defined as the differences in average nominal prison sentence lengths for comparable caseloads assigned to different judges. This disparity is measured as the dispersion of a random effect in a zero-inflated negative binomial model. The results show that the expected difference between two typical judges in the average sentence length was about 17 percent (or 4.9 months) in 1986–87 prior to the Guidelines and fell to about 11 percent (or 3.9 months) in 1988–93 during the early years of the Guidelines. We have not sought to measure the effect of parole in the pre-Guidelines period, other sources of disparity such as prosecutorial discretion, or the proportionality of punishment under the Guidelines as compared with the pre-Guidelines era. I. Introduction O ne of the chief objectives of the Sentencing Reform Act of 19841 was to reduce sentencing disparity among similar offenders. The act described * Assistance in data production was generously provided by Ralph Mecham, Steve Schelsinger, and Cathy Whitaker at the Administrative Office of the U.S. Courts. Helpful comments were made by participants in the conference sponsored by the John M. Olin Program in Law and Economics at which an earlier version of this article was presented, and at various stages of the project by Daron Acemoglu, Josh Angrist, Jushan Bai, Katherine Brownlee, José Cabranes, Gary Chamberlain, Ken Chay, Peter Diamond, Hugh Eastwood, Dan Freed, Jerry Hausman, Bo Honoré, Larry Katz, Kara Kling, David Lee, Steve Levitt, Jeff Liebman, John Lott, Whitney Newey, Abigail Payne, Anne Peihl, Steve Pischke, Jack Porter, and Jim Poterba. Kling acknowledges financial support from a National Science Foundation Graduate Fellowship and an Alfred P. Sloan Doctoral Dissertation Fellowship. 1 Sentencing Reform Act of 1984, 28 U.S.C. § 991(b) (1988). [Journal of Law and Economics, vol. XLII (April 1999)]  1999 by The University of Chicago. All rights reserved. 0022-2186/99/4201-0011$01.50 271 This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 272 the journal of law and economics this purpose as ‘‘avoiding unwarranted disparities among defendants with similar records who have been found guilty of similar criminal conduct.’’ 2 Both the act and its legislative history demonstrate that Congress’s overriding concern was to reduce disparity thought to result from the exercise of judicial discretion in sentencing. The act was sponsored and shepherded through Congress by an unusual coalition of liberals and conservatives. Liberals expressed particular concern that permitting the exercise of discretion compromised the ideal of equal treatment under the law, while conservatives were concerned as well with perceived undue leniency in sentencing.3 To accomplish these ends, the act created the United States Sentencing Commission to develop federal Sentencing Guidelines. As the associate attorney general wrote to the first chairman of the commission,‘‘Simply stated, unwarranted disparity caused by broad judicial discretion is the ill that the Sentencing Reform Act seeks to cure.’’ 4 The commission’s Sentencing Guidelines, which became effective on November 1, 1987, restrict the exercise of judicial discretion to a narrow sentencing range in each case and limit judicial departures from that range. The sentencing ranges of the Guidelines are a small percentage of the statutory ranges 5 that were available to judges in the pre-Guidelines era; it was presumably hoped that sentencing disparity would be reduced commensurately. This paper examines the impact of the Guidelines on one particular type of disparity: interjudge disparity in the average length of prison sentences of criminal defendants in federal district courts. After a brief review of the Guidelines and the mechanisms through which they may affect disparity, we explore methods of measuring interjudge disparity. Using the fact that cases are randomly assigned to judges in many districts, we define interjudge disparity as relative differences in the average 2 See also 28 U.S.C. § 994(f ) (1988) (‘‘The Commission, in promulgating guidelines pursuant to subsection (a) (1), shall promote the purposes set forth in section 991 (b) (1), with particular attention to the requirements of subsection 991 (b) (1) (B) for providing certainty and fairness in sentencing and reducing unwarranted disparities’’). 3 See generally Kate Stith & José A. Cabranes, Fear of Judging: Sentencing Guidelines in the Federal Courts 38–48 (1998). 4 Stephen S. Trott, Letter to Hon. William W. Wilkins (April 7, 1987). The letter from Trott (who was associate attorney general writing on behalf of the Department of Justice) to Wilkins (who was the first chairman of the Sentencing Commission) urged adoption of sentencing guidelines that would permit only narrow judicial discretion; the letter is reprinted at 8 Fed. Sentencing Rep. 196 (1995). See also 8 Fed. Sentencing Rep. 199 (1995) (reprinting letter dated November 9, 1994, from Judge Stephen S. Trott to the new chairman of the Sentencing Commission, explaining that experience under the Guidelines had caused him to conclude that ‘‘the cure is worse than the disease’’). 5 Before the passage of the Guidelines, these ranges were the only statutory upward bound on sentencing. Under the Guidelines, they continue to serve as a upward bound on the possible length of a sentence—even if the Guidelines would otherwise mandate a longer sentence. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 273 sentence length for otherwise comparable caseloads of defendants assigned to different judges in the same district. We compare estimates of interjudge disparity before and after implementation of the Guidelines.6 In our preferred specification, the dispersion in the judge’s effect on sentence length is represented by the variance of a judge-specific random variable in a statistical model of prison sentence length. Existing econometric random effects models for count data (such as the number of months of prison sentence length) are extended in several ways. A zero-inflated negative binomial model is developed in order to account explicitly for the fact that many cases end in dismissal or acquittal or have a sentence that involves no prison time at all. A parametric model of the random effect allows the judge effects to be correlated over time and allows a convenient interpretation of the dispersion in terms of a Gini coefficient—or twice the expected absolute difference in sentence length between two judges in the same district office relative to the office mean. While noting that other kinds of disparity may have been exacerbated by the Guidelines and there may be unwarranted uniformity in sentencing under the Guidelines, we conclude that interjudge disparity in nominal sentencing is less pronounced in the Guidelines era than it was in the era of discretionary sentencing. II. Defining Interjudge Disparity The enabling legislation and legislative history of the Sentencing Reform Act refer to reducing unwarranted disparity as ‘‘the major premise of the sentencing guidelines.’’ 7 Neither in statute nor legislative history did Congress define or explain what constituted ‘‘unwarranted disparities among defendants with similar records, who have been found guilty of similar criminal conduct.’’ 8 As the Senate Report accompanying the Sentencing Reform Act noted, ‘‘The key word in discussing unwarranted sentence disparities is ‘unwarranted.’ ’’ 9 To avoid confusion, we distinguish between three distinct types of sentencing variation: proportionality, disproportionality, and disparity. Proportionality, under our definition, is sentencing variation among a set of deci6 The post-Guidelines era differs from the pre-Guidelines era in two significant respects: the Guidelines themselves (which apply to all crimes committed after November 1, 1987) and statutory minimum sentences (which Congress has enacted with regularity since the mid1980s and which apply to a significant portion of federal prosecutions). 7 S. Rep. No. 225, 98th Cong., 1st Sess. 52 (1984), reprinted in 1984 U.S.C.C.A.N. 3182, at 3235. 8 28 U.S.C. § 991(b)(a)(B). 9 S. Rep. No. 225, supra note 7, at 161, reprinted in 1984 U.S.C.C.A.N. 3344. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 274 the journal of law and economics sion makers in the criminal justice system that is justified by relevant differences among offenders and their crimes. Conventionally, these differences include various characteristics of the criminal offense (such as the amount of harm caused) and of the criminal offender (such as prior criminal record). Its converse, disproportionality, is any variation in sentencing outcomes for a given set of decision makers that is not attributable to relevant sentencing factors; in this sense, disproportional variation is akin to what many previous commentators have referred to as disparity. In contrast, we define disparity as variation in sentencing between the sets of hypothetical decision makers that could potentially be involved in the disposition of an offender’s case. Under this approach, disparity can be thought of as the variation in sentence that would result if a single offender were processed through the criminal justice system by every possible combination of sentencing decision makers. Our definition of disparity is centered on the sentencing decision maker rather than on characteristics of the offense or the offender. Thus, for example, the fact that sentences for equal amounts of crack cocaine and powdered cocaine are dissimilar is not disparity under our definition. The crack/ powder difference is either proportional variation or disproportional variation, depending on one’s judgment as to whether the difference in the type of cocaine is relevant to the proper amount of criminal punishment. Similarly, some have argued that the Sentencing Guidelines may have increased overall penalty variation among offenders because, before the passage of the Guidelines, judges could consider the reputational consequences and the loss of potential earnings suffered by white-collar offenders.10 Under our definition, this type of variation is not ‘‘disparity.’’ It is, rather, a form of disproportionality (or proportionality, depending upon one’s views about whether reputational and earnings consequences should be considered by sentencing judges). We have adopted a definition of ‘‘disparity’’ that considers solely that variation caused by the identity of the decision maker; the concern about ‘‘white-collar inequity,’’ on the other hand, refers to variation in total punishment among different classes of offenders. The definitions of disparity, proportionality, and disproportionality that we have adopted permit us to distinguish variation that is attributable to the identity 10 John R. Lott, Jr., Do We Punish High Income Criminals Too Heavily? 30 Econ. Inquiry 583 (1992); Jonathan M. Karpoff & John R. Lott, Jr., Why the Commission’s Corporate Guidelines May Create Disparity, 3 Fed. Sentencing Rep. 140 (1990). For example, Karpoff and Lott argue that the market imposes significant penalties on corporate and white-collar offenders that are not accounted for under the Guidelines. These critics argue that the Guidelines increase ‘‘disparity’’ because judges, hamstrung by the Guidelines, are unable to equalize total punishment (including reputational sanctions) among equally culpable offenders. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 275 of the decision makers in the criminal justice system from variation in sentences based on differences between criminal cases (that may or may not be justified depending on one’s theory of the purposes of sentencing). We measure only interjudge disparity and do not attempt to gauge potential disparity at other stages in the sentencing process or potential disproportionality. We limit our inquiry for several reasons. First, the difficulties are formidable in rigorously measuring disparity in other stages in the process, despite its likely presence.11 Second, in focusing on the variation attributable only to disparity between judges rather than on disproportionality we avoid the inevitably contentious issue of the purpose of sentencing itself. Finally, interjudge sentencing disparity has generated more concern than disparity at any other stage of the sentencing process. Concern about interjudge sentencing disparity is not hard to understand. Attorney General Robert H. Jackson pithily expressed the intuitive unfairness of interjudge disparity: ‘‘It is obviously repugnant to one’s sense of justice that the judgment meted out to an offender should depend in large part on a purely fortuitous circumstance; namely the personality of the particular judge before whom the case happens to come for disposition.’’ 12 Prior to the promulgation of the Sentencing Guidelines, a federal judge’s sentencing discretion was enormous and virtually unreviewable. As the last actor in the determination of the offender’s formal sentence, the judge was in a position either to remedy or to exacerbate any disparity in the earlier stages of criminal prosecution. In addition, the judge has by far the most visible role in the sentencing process, formally announcing the polity’s exaction of punishment. Variation in law enforcement, charging, and probation practices are far less visible. By the early 1970s, considerable intellectual enthusiasm for the idea of reducing sentencing disparity developed as part of a wave of general efforts to attack indeterminate sentencing.13 The theory that punishment was designed to rehabilitate the offender had become discredited, and both parole and interjudge disparity came under attack. Perhaps the most influential 11 Under our definition, disparity may exist in law enforcement, prosecution, probation, and parole stages of criminal justice system. For example, one prosecutor may charge a defendant in such a way that he would face a mandatory minimum of 10 years, while another prosecutor would prosecute that same defendant in a way that would result in a sentence of 5 years. Law enforcement disparity may involve the manner of investigation, such as the amount of drugs offered in a reverse sting. Probation officers may take different approaches to presentence reports and may implement parole policies differently. 12 U.S. Attorney General, Annual Report 5–6 (1940). 13 See Andrew von Hirsch, Doing Justice (1976); Norval Morris, The Future of Imprisonment (1974); Kenneth Culp Davis, Discretionary Justice (1969). This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 276 the journal of law and economics critic of judicial sentencing discretion was Marvin E. Frankel, himself a distinguished judge in the Southern District of New York.14 Frankel argued that the range of choice provided to the sentencing judge was ‘‘terrifying and intolerable for a society that professes devotion to the rule of law’’ and that it should be ‘‘unthinkable in a ‘government of law, not of men.’ ’’ 15 He wrote, ‘‘[I]ndividualized justice is prima facie at war with such concepts, at least as fundamental, as equality, objectivity, and consistency in the law.’’ 16 Frankel’s anecdotal arguments appeared to be confirmed by an experimental study that he helped to organize in the district courts comprising the Second Circuit. The organizers of the study distributed identical presentence reports to 50 district court judges; each judge was asked to impose sentences for each case. Sentencing ranges varied widely, in one case from 20 years in prison and a $65,000 fine from the most severe judge to 3 years in prison from the most lenient judge. Moreover, the disparity was not attributable to either a handful of judges or outliers in each case but appeared throughout the range of sentences.17 The Second Circuit study’s finding of substantial sentencing disparity was repeated in other studies.18 Commentators also cited prison unrest 19 and racial and class discrimination 20 as problems deriving from the exercise of 14 Marvin Frankel, Lawlessness in Sentencing, 41 U. Cin. L. Rev. 1 (1972); Marvin Frankel, Criminal Sentences: Law without Order (1973). 15 Frankel, Criminal Sentences, supra note 14, at 5. 16 Id. at 10. 17 Anthony Partridge & William B. Eldridge, The Second Circuit Sentencing Study: A Report to the Judges 1–3, 9 (1974). The study also noted that neither experience with the Eastern District of New York’s practice of sentencing councils—whereby a judge confers with two other judges before sentencing—nor time on the bench seemed to increase the likelihood that a particular judge’s sentence would be consistent with that of her colleagues. 18 See William Austin & Thomas A. Williams III, A Survey of Judges’ Responses to Simulated Legal Cases: Research Note on Sentencing Disparity, 68 J. Crim. L. & Criminology 306 (1977) (47 district court judges reviewed and sentenced five hypothetical cases; wide disparity in sentence lengths noted); Whitney North Seymour, 1972 Sentencing Study for the Southern District of New York, 45 N.Y. St. B. J. 163 (noting gross sentencing variations in actual cases, not controlling for particular case attributes); Beverly Blair Cook, Sentencing Behavior of Federal Judges: Draft Cases, 42 U. Cin. L. Rev. 597 (1973). 19 Sentence disparity was thought to increase prison unrest. James V. Bennett, a former director of the Federal Bureau of Prisons, explained: ‘‘The prisoner who must serve his excessively long sentence with other prisoners who receive relatively mild sentences under the same circumstances cannot be expected to accept his situation with equanimity. The more fortunate prisoners do not attribute their luck to a sense of fairness on the part of the law but to its whimsies. The existence of such disparities is among the major causes of prison riots, and it is one of the reasons why prisons so often fail to bring about an improvement in the social attitudes of their charges.’’ J. Bennett, Of Prisons and Justice, S. Doc. No. 70, 88th Cong., 2d Sess. 319 (1964). 20 Joseph C. Howard, Racial Discrimination in Sentencing, 59 Judicature 121 (1975–76); Tom Wicker, Judging the Judges, N.Y. Times, February 6, 1976, at A29; Alec Hopkins, Is This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 277 judicial discretion in sentencing. Critics from the political Right expressed dissatisfaction with the perceived leniency of sentencing judges and parole officials.21 Both liberals and conservatives argued that sentencing disparity compromised the ideal of equal treatment under law. By the early 1980s, there had developed across the ideological spectrum a consensus that dramatic changes in the sentencing process were needed to reduce sentencing disparity stemming from the exercise of judicial discretion. III. Guidelines as a Mechanism for Reducing Interjudge Disparity The result of this political consensus was the enactment of the Sentencing Reform Act of 1984. The primary sponsors of the legislation were Senator Edward M. Kennedy, the liberal Massachusetts Democrat, and Senator Strom Thurmond, the conservative North Carolina Republican; President Reagan hailed the bill when he signed it in October of 1984. The act created a Sentencing Commission charged with developing and implementing a system of binding Sentencing Guidelines. At the same time, beginning in the mid-1980s, Congress enacted a series of laws that mandated high minimum sentences for certain crimes—including for nearly all narcotics offenses, which now constitute some 40 percent of all prosecutions in federal court. The Sentencing Reform Act itself also contained a variety of mandates, such as the requirement that repeat offenders receive sentences ‘‘at or near’’ the statutory maximum, which have also contributed to a substantial increase in the overall severity of federal criminal sentences. The centerpiece of the Guidelines is a grid containing 258 boxes (termed the ‘‘Sentencing Table’’). The grid’s horizontal axis (‘‘Criminal History Category’’) adjusts severity on the basis of the offender’s past conviction record. The vertical axis (‘‘Offense Level’’) reflects a base severity score for the crime committed, as further adjusted for those aspects of the crime that the Guidelines deem relevant to sentencing. The Guidelines, through a complex set of rules requiring significant expertise to apply, instruct the There a Class Bias in Criminal Sentencing? 42 Am. Soc. Rev. 176, 176–77 (1977); Frankel, Criminal Sentences, supra note 14, at 23–24. 21 See Jonathan D. Casper, Determinant Sentencing and Prison Crowding in Illinois, 1984 U. Ill. L. Rev. 231, 236–37 (explaining that ‘‘[c]onservatives and law enforcement interests’’ desired determinate sentencing because ‘‘parole boards seemed often to release prisoners who continued to pose a danger to society’’ and judges seemed ‘‘reluctant to send ‘marginal defendants’ to prison’’); J. Edgar Hoover, The Dire Consequences of the Premature Release of Dangerous Criminals through Probation and Parole, 27 F.B.I. L. Enforcement Bull. 1 (1958); see also Reform of the Federal Criminal Laws: Hearings on S. 1437 before the Subcommittee on Criminal Laws and Procedures of the Senate Committee on the Judiciary 95th Cong., 1st Sess., 8580, 8995 (1977) (statements of Sen. Lloyd Bentsen and Ronald L. Gainer). This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 278 the journal of law and economics sentencing judge on how to calculate both ‘‘Criminal History Category’’ and ‘‘Offense Level.’’ The box at which these two factors intersect then determines the range within which the judge may sentence the defendant. As required by the Sentencing Reform Act,22 the sentencing range in each box is small, the highest point being 25 percent more than the bottom point. This 25 percent range represents one source of discretion retained by judges under the Guidelines. The only other form of lawful sentencing discretion is authority to ‘‘depart’’ from the Guidelines. This authority is formally limited, however, to two circumstances. The first is that in which the defendant has provided substantial assistance in the prosecution of others,23 in which event the judge may pronounce a sentence that departs downward from the Guideline range—with the important caveat that the prosecutor must first agree, in the words of the Supreme Court, to ‘‘authoriz[e] the district court to depart.’’ 24 If the prosecutor does make the appropriate motion for departure, the court may depart below not only the Guidelines range but also below any applicable statutory minimum sentence.25 The second situation in which a judge may depart, up or down, from the Guideline range is where the judge is able to demonstrate on the record that there are factors or circumstances present in the case at hand that have not been ‘‘adequately’’ factored into the Guidelines’ sentencing rules by the commission and make the case ‘‘atypical.’’ The Sentencing Commission has admonished that it expects the exercise of this departure power to be ‘‘rare.’’ 26 In a 1996 survey, however, 73 percent of district judges indicated that 22 28 U.S.C. § 994(b)(2). Substantial Assistance to Authorities (Policy Statement) U.S.S.G. § 5K1.1; see also 28 U.S.C. § 994(m). 24 Melendez v. United States, 116 S. Ct. 2057, 2061 (1996). 25 See Limited Authority to Impose a Sentence below a Statutory Minimum 18 U.S.C. 3553(e). 26 Grounds for Departure (Policy Statement) U.S.S.G. § 5K2.0. The Sentencing Guidelines themselves also identify a few, preferred bases for departure; for example, the instructions on calculation of the defendant’s criminal history ‘‘score’’ advise that the judge should depart up or down depending upon whether the defendant’s record of convictions overestimates or underestimates his criminal history. In Koon v. United States, 518 U.S. 81, 116 S. Ct. 2035 (1996), the U.S. Supreme Court held that federal courts of appeals should generally review the decision of the district judge to depart from the Guidelines under an abuse of discretion standard, rather than review the district court’s application of the sentencing guidelines de novo. The Court also noted, ‘‘We do not understand it to have been the congressional purpose [for the Guidelines] to withdraw all sentencing discretion from the United States District Judge. Discretion is reserved within the Sentencing Guidelines, and reflected by the standard of appellate review that we adopt.’’ Id. at 2053. It will be interesting to see whether this decision leads to an increase in departures and/or interjudge sentencing disparity in the future. 23 This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 279 they felt mandatory guidelines were not necessary to direct the sentencing process, and ‘‘they strongly prefer a system in which judges are accorded more discretion than they are under the current guidelines.’’ 27 In recent years, judges have departed downward on the basis of substantial assistance to authority in nearly 20 percent of all cases sentenced under the Guidelines and have departed (upward or downward) due to ‘‘atypicality’’ in another 10–12 percent of cases.28 In many other cases, the defense and the prosecution stipulate to the ‘‘facts’’ that are relevant to sentencing under the Guidelines or stipulate even to a particular Guideline sentencing range; in these cases, there may be no departure as a formal matter, but it may be difficult to determine whether the Guidelines have been faithfully implemented.29 For these various reasons, the ultimate ability of the Guidelines to control interjudge disparity is an open question. IV. Measuring Changes in Disparity As we have noted, our definition of disparity is centered on the sentencing decision maker rather than on characteristics of the offense or the offender. In order to measure interjudge disparity as we have defined it, one must observe the sentencing outcomes of similar cases assigned to different judges. Previous researchers have attempted to do this in three main ways. First, simulated cases with a common set of facts have been distributed to judges who then provide a sentence. The influential Second Circuit sentencing study, and a similar study conducted nearly a decade later by the U.S. Department of Justice (1981),30 used this approach. It is quite difficult, however, for a simulation to reconstruct the full depth of information available to a judge in a real case. Moreover, there is no assurance that judges 27 Molly Treadway Johnson & Scott Gilbert, The U.S. Sentencing Guidelines: Results of the Federal Judicial Center’s 1996 Survey (1997). Earlier surveys of federal judges had indicated even less satisfaction with the Guidelines. See Don J. DeBenedictis, The Verdict Is In, 79 A.B.A. J. 78 (1993) (reporting that in poll conducted by ABA, nearly half of federal judges wanted to completely abolish the Guidelines). As one of the present authors has noted, over half of all active federal judges were appointed since the Guidelines went into effect, and these judges may be more satisfied with the Guidelines than judges who were appointed in the era of discretionary sentencing. See Stith & Cabranes, supra note 4, at 5–6, 143–44. 28 See U.S.S.C., Sourcebook of Federal Sentencing Statistics 39 (1997) (showing also that use of substantial assistance departures has increased markedly over time, from less than 5 percent in 1989 to nearly 20 percent in 1994, 1995, and 1996). 29 See Francesca D. Bowman, Probation Officers Advisory Group Survey, 8 Fed. Sentencing Rep. 303 (1996) (chief probation officers in two-thirds of federal districts responding to survey report that pleas of guilty are often accompanied by an agreement that includes Guideline stipulations or calculations). 30 John Bartolomeo et al., Sentence Decision-Making: The Logic of Sentence Decisions and the Extent and Sources of Sentence Disparity (1981). This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 280 the journal of law and economics approach simulation studies with the seriousness and deliberation that they would bring to a real case with a real defendant and real victims.31 Second, the variation among cases with common observable characteristics has been measured, with the residual variation among these observable similar cases attributed to the judges. A fundamental problem with this type of analysis is the difficulty in distinguishing between disparity, disproportionality, and proportionality.32 A 1991 study by the Sentencing Commission, for instance, compared similar cases from a pre-Guidelines year (1985) and a post-Guidelines period of 2 years (1989–90).33 The analysis conducted by the commission compared the range of sentences and mean sentence for each of these categories under the Guidelines to corresponding measures from the pre-Guidelines period. The evaluation categorized cases according to factors deemed relevant by the Sentencing Guidelines and hence concluded that ‘‘unwarranted disparity’’ existed if and only if there was deviation from the Guidelines. The commission’s finding of less disparity post-Guidelines was simply a confirmation that post-Guidelines sentences are more likely to be in accordance with the Guidelines. This study illustrates the general problem with this measurement strategy—to compare ‘‘similar’’ cases, a study must rely upon an inevitably controversial theory of what constitutes proportional and disproportional variation. Moreover, variation because of unobserved differences in the cases (proportionality) cannot be readily distinguished from variation because of the decision makers in the system (disparity). In the third approach, caseloads randomly assigned to judges have been 31 In the Second Circuit survey, the judges were mailed the case packets over the course of 6 weeks, responding with proposed ‘‘sentences’’ by return mail. In addition to lack of face-to-face contact with the defendant and lack of advocacy by the parties, there was no simulation of the presentencing procedural history of the cases, especially that relating to plea bargaining. Nor was there any discussion of parole in the study or whether judges should sentence in ‘‘real time’’ or compensate for the effect of parole. One scholar who led a reanalysis of the data of the Second Circuit study has reported that ‘‘much of the seeming difference in sentences . . . was simply a function of different understandings as to . . . how long the judge thought the defendant would actually serve.’’ Milton Heumann, Empirical Questions and Data Sources: Guidelines and Sentencing Research in the Federal System, 6 Fed. Sentencing Rep. 15 (1993) (summarizing research by Stanton Wheeler et al., Sitting in Judgment: The Sentencing of White-Collar Criminals (1988)). See also Jon O. Newman, A Better Way to Sentence Criminals, 63 A.B.A. J. 1562 (1977) (suggesting that disparity in sentencing in the Second Circuit study was increased because of different assumptions about actions of the parole board); Jon O. Newman, Foreword: Parole Decision-Making and the Sentencing Process, 84 Yale L. J. 810, 812 (1975); Shari S. Diamond & Hans Zeisel, Sentencing Councils: A Study of Sentencing Disparity and Its Reduction, 43 U. Chi. L. Rev. 109 (1975). 32 See text accompanying notes 9–11 supra. 33 U.S. Sentencing Commission, The Federal Sentencing Guidelines: A Report on the Operation of the Guidelines System and Short-Term Impacts on Disparity in Sentencing, Use of Incarceration, and Prosecutorial Discretion and Plea Bargaining 288, 292, 296, 299 (1991). This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 281 deemed to be comparable, and the average sentencing outcomes for these caseloads compared, with differences attributed to the judges. We adopt this third approach and examine the average sentences of cases to which judges were randomly assigned within a particular federal district office to assess whether there was more disparity in these averages before or after implementation of the Federal Sentencing Guidelines.34 To make the following discussion of measurement methodology more concrete, consider a simple example of a district in which cases are randomly assigned to two judges, Judge Harsh and Judge Lenient. In measurement of interjudge disparity, we focus on the difference between judges within a time period (D t ) in the mean of prison sentences for each judge (θ) relative to the mean level of prison sentence length in the district as shown in (1). Dt ⫽ θh ⫺ θl . E[θ] (1) In evaluating the effect of the Guidelines on interjudge disparity, we examine the magnitude and statistical precision of the change in this disparity measure before and after the Guidelines (time periods 1 and 2), denoted as ∆D in (2). ∆D ⫽ D 2 ⫺ D 1 . (2) This empirical strategy is the most straightforward, but it has several important implications. First, the null hypothesis that interjudge disparity is the same in both periods (∆D ⫽ 0) can be tested directly.35 This is conceptually distinct from 34 A methodology similar to ours was first used by Frederick Gaudet, George S. Harris, & Charles W. St. John in Individual Differences in the Sentencing Tendencies of Judges, 23 J. Crim. L. Criminology 811 (1933); and Frederick Gaudet, The Differences between Judges in the Granting of Sentences of Probation, 19 Temp. L. Q. 4 (1946). More recently, it has been used by Joel Waldfogel in Aggregate Inter-Judge Disparity in Sentencing: Evidence from Three Districts, 4 Fed. Sentencing Rep. 151 (1991); Joel Waldfogel, Inter-judge Disparity in Federal Sentencing: Evidence from Three Federal Districts 1984–90 (1992) (unpublished manuscript on file with authors); Abigail Payne, Does Inter-judge Disparity Really Matter? An Analysis of the Effects of Sentencing Reforms in Three Federal District Courts, 17 Int’l J. Law & Econ. 337 (1997); Paul Hofer, Kevin Blackwell, & Barry Ruback, The Effect of the Federal Sentencing Guidelines on Inter-judge Sentencing Disparity (1998) (unpublished manuscript, U.S. Sentencing Commission). 35 Note that the hypothesis is about the change in overall disparity and not about changes in behavior by particular judges over time. For example, if the two judges simply reversed roles between the two periods, then there would be a change in θ h and θ l but no change in ∆D, the expected disparity from the point of view of the defendant. Waldfogel, Inter-judge Disparity in Federal Sentencing, supra note 34, tests for a change in disparity by comparing a model in which the judge means are restricted to be the same over time and one in which they are allowed to vary. This is implicitly a test of changes in the behavior of individual judges and not of overall disparity. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 282 the journal of law and economics statistical tests that rely on a null hypothesis of no disparity. Assume, for example, that the judge means were the same in both periods and the hypothesis of no disparity was rejected in period 1. Suppose that the null hypothesis of no disparity were accepted in period 2, because of a smaller sample size or increased variance in the distribution of sentence length. Under these assumptions, a false inference that there was a change in interjudge disparity may be drawn when there was in fact no change.36 Second, the disparity measured by ∆D is variation relative to the mean of sentences in all cases. This has the desirable property of being an inequality measure that is scale invariant. For example, simply multiplying all sentences by a constant factor (say, to make sentences stiffer in the later period) will not affect the magnitude of ∆D. Defendants who are acquitted or whose cases are dismissed are assigned a sentence length of zero, because cases (and not convictions) are randomly assigned to judges and because only complete caseloads are comparable between judges rather than just convictions. Third, the judges compared in the two periods are the same. This allows isolation of changes in behavior of the same individuals and avoids convolution with the potentially different sentencing patterns of other judges who heard cases in only one of the periods.37 To move beyond the illustrative statistical model in (1) and (2), we now discuss features of models suitable for estimation with data on multiple judges. We begin with the model in (3) based on the Gini coefficient, where g is the expected difference of J judge means relative to twice the overall mean. g⫽ 1 J( J ⫺ 1) J J j⫽1 k⫽1 冱冱 θj ⫺ θk . 2E[θ] (3) If the true judge means θ were known, this would be an attractive measure. Unfortunately, the fact that θ is measured with sampling error results in an upward bias in the sample estimates of g, substantially complicating matters. One way to see this immediately is to examine the special case where the true judge means are all the same. In a finite sample of cases, the esti36 Analyses based on percentage of variation explained or on F-tests for equality of judge means to evaluate the impact of the Guidelines—as used by Payne and Hofer, Blackwell, & Ruback, supra note 34—are vulnerable to these potentially confounding factors. 37 Because the behavior of the same judges is examined over time, the judges are necessarily older in the later periods. Experience may affect behavior, perhaps bringing judges closer together over time as they observe each other’s work. We do not expect these changes to be substantial over short periods of time, however, such as the consecutive 2-year periods that are used in the empirical analysis. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 283 mated means will not be exactly the same, so g will always be estimated to be some positive number when the true value is zero. This bias from sampling error is an inherent feature of this and many other simple estimation strategies that summarize the dispersion of estimated parameters.38 Another disadvantage of these methods is that standard errors on changes in dispersion of estimated means are analytically intractable—as with the Gini coefficient—or have poor finite sample properties—as with analysis of variance methods.39 These estimation concerns lead us to develop a parametric model to estimate interjudge disparity. Instead of summarizing the distribution of imprecisely estimated judge means, we adopt a strategy that incorporates the estimation of the judge effects directly in a statistical model of the underlying distribution of sentence lengths. As a point of departure, we consider the negative binomial model that has been used previously in econometrics and that is part of a larger class of Generalized Linear Models well known in statistics.40 This type of count data model is attractive for this application because it accounts explicitly for the fact that the data are nonnegative integers. Introducing a random variable into this model that corresponds to the judge assigned to the case allows us to directly estimate parameters that capture the dispersion of this random variable or the variance of the ‘‘random effect’’ due to the judge. We extend the negative binomial random effects model in several ways. 38 The intuition behind this bias is that we do not actually know which judge is truly more harsh and which truly more lenient. By using the absolute difference, we always infer that the one with the higher average sentence length is harsher. In repeated sampling, the truly harsher judge may have a lower mean sentence length in some samples even though the mean sentence length is higher on average. These misclassified instances, where the true difference between the harsh and more lenient judge is negative but measured as positive, are the source of the bias. When the true means are close together and the variance of the underlying distribution is large so that the means are imprecisely estimated, this type of misclassification may occur frequently even though the estimates of the means themselves are unbiased. Similar intuition applies to other dispersion measures like the variance. 39 A related alternative to the Gini coefficient would be estimation of the variance of a random judge effect if the judge effects are cast in a variance components model, as discussed by Shayle Searle et al., Variance Components 168–226 (1992). When the data are not normal, the asymptotic distribution of the variance of the random effect depends on fourth moments. Our Monte Carlo experiments performed using data distributed like actual sentencing data found that the asymptotic approximation for the standard error on changes between periods in the between-judge component of the variance was far too dispersed to be useful, with 2 SDs covering the truth nearly 100 percent of the time, as opposed to the predicted 95 percent coverage. 40 The negative binomial model in particular is well suited for the ‘‘overdispersed’’ nature of the sentence length data relative to the more traditional Poisson model. See Jerry Hausman, Bronwyn Hall, & Zvi Griliches, Econometric Models for Count Data with an Application to the Patents-R&D Relationship, 52 Econometrica 909 (1984); A. Colin Cameron & Pravin Trivedi, Regression Analysis of Count Data 27–37 (1998). This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 284 the journal of law and economics First, we separate the likelihood into two parts, allowing an additional ‘‘zero-inflation’’ parameter for the probability of receiving no prison sentence.41 Second, the mean of the random effect is allowed to differ by covariates, so that we can measure dispersion relative to each district mean. Third, a lognormal distribution is used for the random effect, which allows a convenient interpretation of the dispersion of the random effect as a Gini coefficient, the expected absolute difference in average sentence length between judges in the same district.42 Finally, the random effects are allowed to be correlated over time, and the extent of the correlation can be directly estimated. Let Y denote the number of months of a prison sentence. For judge j in period t, there are a total of N tj realizations of y itj . To model the distribution of sentence lengths in cases heard by a judge, we use a negative binomial distribution with parameters m and p, augmented with an extra parameter d that affects the probability that y itj ⫽ 0. The joint likelihood function L Tj for cases heard by a judge in T periods is given in equation (4), where Γ⵺ is the gamma function. L Tj ⬅ Pr( y 11j , . . . , y NTjTj J ⫽ j ) T ⫽ N tj 兿兿 t⫽1 i⫽1 w t Γ(m t ⫹ y itj ) [d t ⫹ p tjmt ]1(y itj ⫽0) [p mtj t (1 ⫺ p tj ) y itj ] 1( yitj ⬎0) . Γ(m t )Γ(y itj ⫹ 1) (4) The density of the zero-inflated negative binomial is normalized to one by setting w t ⫽ (1 ⫹ d t )⫺1. When p is defined as a particular function of m, d, and a judge-specific parameter θ, the mean sentence length for judge j in period t depends only on the judge effect θ as in (5). 冢 p tj ⬅ 1 ⫹ θtj ⫺1 冣 (1 ⫹ d t ) mt ⇒ E(Yitj T ⫽ t, J ⫽ j ) ⫽ θ tj . (5) 41 This augmentation has been previously developed for the Poisson model. See John Mullahy, Specification and Testing of Some Modified Count Data Models, 33 J. Econometrics 341 (1986); Diane Lambert, Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing, 34 Technometrics 1 (1992); Shiferaw Gurmu, Paul Rilstone, & Steven Stern, Semiparametric Estimation of Count Regression Models, 88 J. Econometrics 123 (1999). 42 Hausman, Hall, & Griliches, supra note 40, parameterizes the random effect with a beta distribution, which results in a likelihood function that has a particularly simple form for a single period but is not easily extended to allow correlation in the random effects across periods. Multivariate normal random effects for the Poisson model have been explored by Siddhartha Chib, Edward Greenberg, & Rainer Winkelman, Posterior Simulation and Bayes Factors in Panel Count Data Models, 86 J. Econometrics 33 (1995). This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 285 Since we are interested in the distribution of the judge effects, we directly model the distribution of θ. For the complete data in any one period (T ⫽ 1), the likelihood L 1 is obtained by integrating out θ (the judge ‘‘random effect’’) using a lognormal density function with mean µ t and standard deviation σ t , denoted as f 1 in equation (6). J L1 ⫽ 兿冮 j⫽1 ∞ L 1j (θ 1j ) f 1 (θ 1j µ 1 , σ 1 )dθ 1j . (6) 0 For any single district office, the number of judges is relatively small (2– 21), and there are a reasonably large number of cases per judge (N tj ⱖ 30). Thus, estimates of σ correspond to the dispersion about the overall office mean in consistent estimates of average sentencing for an observed small sample of judges.43 Data from multiple district offices are pooled for efficiency of estimation. We allow the mean of the judge effects to differ by office, since cases are assigned randomly within district offices but not between offices. Denoting X as a set of indicators for each office, we let µ t ⫽ Xβ t . In order to account for correlation of judge effects across two periods, we also formulate a model in which the likelihood L 2 for two periods uses joint lognormal density function with correlation coefficient ρ, denoted as f 2 in equation (7). J L2 ⫽ ∞ 兿冮 冮 j ⫽1 L 2j (θ 1j , θ 2j )f 2 (θ 1j , θ 2j µ 1 , σ 1 , µ 2 , σ 2 , ρ)dθ 2j dθ 1j . (7) 0 A primary interest in this study is measures of interjudge disparity in sentencing. Denote γ as the Gini coefficient of concentration of the judge means derived from (6) or (7), as opposed to g in (3), which is computed using the estimated judge means. Using the properties of a lognormal parametric form for θ, the Gini coefficient measuring relative disparity in average sentence length between judges in two periods depends only on the two 43 To see the implication of this, consider a simple example. Say that an office has a superpopulation of potential judges with f 1 (3.5, 0.1) and receives two judges h and l, and θ h ⫽ 36 and θ l ⫽ 30. The standard deviation of log(θ) in this sample of actual judges in the district is 0.091. This estimation strategy will identify the dispersion in the actual small sample, 0.091, and not the dispersion in the superpopulation of potential judges, 0.1. In general, the dispersion in a small sample will be less than the superpopulation from which it is drawn. In a simulation of 100,000 samples of judge effects from a lognormal distribution f 1 (3.5, 0.1), the average standard deviation of the log was 0.056 for a sample of two judges, 0.080 for four judges, and 0.096 for 22 judges. This implies it is important that the number of judges within a district office are the same when making comparisons between time periods using this methodology. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 286 the journal of law and economics variance parameters of the random effect. A change in γ can be computed as in equation (8), where Φ is the cumulative normal distribution function. γ 2 ⫺ γ 1 ⫽ 2Φ 冢 冣 冢 冣 σ2 σ1 ⫺ 2Φ . √2 √2 (8) When data from multiple offices are pooled and µ includes indicators for each office, then γ measures the overall dispersion in judge means relative to their own district office mean. Furthermore, since γ is the expected absolute difference between two judges relative to twice the overall mean, it has the straightforward interpretation of interjudge disparity as a percentage difference relative to the overall level of sentence length. One simple hypothesis to explain changes in interjudge disparity over time is that the types of offenses within a judge’s caseload are changing over time. We would like to distinguish between changes in the behavior of judges and changes in the types of cases to which they are assigned. What would trends in interjudge disparity look like if judges had been assigned caseloads with the same shares of offense types every year? One way to answer this question is to statistically adjust by reweighting the caseloads. For example, the adjusted average sentence length for a judge in 1982–83 and 1992–93 might take the average in each period for drug cases and for nondrug cases and compute a weighted average using the same weights in both periods. In this paper, we use a set of weights for each district office based on the shares of offense types within that office in 1986– 87. Denote N 86-87 as the total number of cases in a district office during 1986–87. Let superscript z refer to a type of offense, so that N zjt is the number of cases assigned to judge j in time period t for offense type Z. Weights w ijt are then defined in equation (9). w ijt ⬅ 冢 冣 冢 冣冢 冣 1 1 ⫹ dt N z86-87 N 86-87 N jt . N zjt (9) For results in the next section that use weighting for offense type comparability over time, we change the weights w in equation (4), using w ijt from (9) instead of w t ⫽ (1 ⫹ d t )⫺1. In this paper, we focus on disparity in the overall average prison sentence length, so we note that disparity in the overall average is heavily influenced by cases with long sentence lengths. If there are differences in disparity by type of offense, the offenses with longer sentences will have a large effect on disparity in the overall average. To see this, say that Judge Harsh sentences a fraud offender to 11 months and a violent offender to 63 months, This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 287 while Judge Lenient sentences a fraud offender to 9 months and a violent offender to 57 months. The Gini coefficient between these two judges for the fraud cases is 0.1 and for the violent cases is 0.05. The Gini coefficient for the overall average of 33 and 37 is 0.057, which is indicative of the weight given to the offense type with the larger average sentence length.44 V. Data Description In order to implement the measurement strategies outlined in the previous section, we minimally required data on the universe of cases filed within various districts and on the judge, disposition, and prison sentence length in the case. A special extract was prepared for this research by the Statistics Division of the Administrative Office of the U.S. Courts that included a previously unavailable nonidentifying code that was used to group together cases heard by the same judge.45 In order to create a dataset of cases randomly assigned to judges, we excluded judges who did not hear a full caseload (and therefore were unlikely to have fully participated in the randomization). This selection rule was based on the number of cases heard.46 Under random assignment, the caseload should be approximately balanced across judges. On the basis of this 44 To enhance the comparability between results for the overall mean and those for accounting for offense types, we use direct weighting in equation (9) so that the estimates are sensitive to the sentence length in the same way as the estimates for the unweighted data. Another way to account for offense types would be to include covariates directly in the model, such as including indicators for offense types in µ. If there were differences in disparity by offense type, this strategy would instead essentially weight by the number of cases for each offense type without regard to the average sentence length for that offense type— complicating the comparison between estimates adjusted to account for offense types and unadjusted estimates. 45 The data used in this paper are the same as the Cases Terminated files of the Administrative Office of the U.S. Courts (AOUSC), with documentation available from the Federal Justice Statistics Program data in the National Archive of Criminal Justice Data, except that the judge code is not suppressed as it is in the public-use file. By agreement with the AOUSC, we cannot provide any information that would result in the disclosure of data about specific judges. Therefore, we do not report the identity of specific district offices used in this analysis. 46 In order to be included in a sample for analysis, a judge had to have been assigned a proportion of cases within a certain range of the maximum assigned to any judge in that office in any 1-year period. A chi-square test statistic was then computed for the null hypothesis that the cases were distributed between with equal probability in the final sample of cases used for analysis. The p-values from this test should be uniformly distributed between zero and one. The size of the range used was 3.9 standard errors of the maximum proportion, which was calibrated so that the median was 0.50 for the 304 test statistics for offices and periods used in the final sample. It appears the most judges excluded by this rule either were not in the district for the entire period or were likely on senior status and consistently receiving a reduced caseload. The identity of the judges is masked in our data, however, so senior status cannot be directly verified. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 288 the journal of law and economics logic, we constructed a sample in which judges were deemed to be ‘‘active’’ in a particular year.47 In order to have a sufficient number of cases to consistently assess the sentencing patterns over time, cases were dropped from the sample if the assigned judge had fewer than 30 cases within a 2year period. Since random assignment is usually done within each of several offices in any district, we restricted our data to offices that had at least two judges. When judges were assigned cases in more than one office, cases were included only for the office from which the judge was assigned the largest number of cases. Since judges are randomly assigned to cases, but the cases may have more than one defendant, we randomly selected one defendant from each case. A central premise of this analysis is that cases are randomly assigned to judges. We marshal two types of evidence in support of our claim that random assignment was used in the districts included in our analysis. A primary source of evidence regarding randomization is the distribution of offense types among the caseloads of each judge. For example, the proportion of drug cases, embezzlement and fraud cases, violent and firearms cases, and other crimes should be the same for each judge in a district office except for sampling error. Differences in these proportions form the basis for the chi-square test of independence of the offense types and the judges. We performed these chi-square tests for each district office for judges with at least eight cases during 25 6-month periods from July 1981 to December 1993. Under random assignment, we would not only expect that 95 percent of these tests would have chi-square test statistics less than the .95 critical value, but that the test statistics for the various time periods would be uniformly distributed over the (0,1) interval with an average p-value of 0.5. A statistical exclusion rule based on this logic was used to identify districts unlikely to have used random assignment throughout the 25 periods, where districts with a mean p-value below a threshold (the fifth percentile of the mean of 25 uniform random variables, .405) were excluded. Districts were also excluded if there were not at least two active judges in eight or more 6-month periods both before and after implementation of the Guidelines. As a second source of information on random assignment, we drew upon interviews with the court clerks in the districts. We conducted 40 interviews 47 The exact time periods cover one calendar year, with the following exceptions. To increase the sample size for the two earliest periods, the 1982 period is actually from July 1981 to September 1982, and the 1983 period is from October 1982 to December 1983. The 1988 period includes November 1987 through January 1989, including cases from the initial promulgation of the Guidelines through the ruling on their constitutionality. (In Mistretta v. United States, 488 U.S. 361 (1989), the U.S. Supreme Court upheld the constitutionality of both the Sentencing Guidelines and the commission in the face of challenges under the nondelegation doctrine and the separation of powers.) This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 289 sentencing disparity TABLE 1 Quantiles of p-Values from Chi-Square Tests of Independence between Judges and Offense Types Quantile .10 .25 .50 .75 .90 p-value .09 .23 .47 .76 .92 Note.—Six hundred six chi-square statistics were computed for 26 offices and for up to 25 6-month periods. Quantiles of p-values are weighted by number of cases in that office and time period. The offense types were grouped into four categories: drug, violent and weapons, embezzlement and fraud, and other cases. ourselves and also utilized a similar investigation by researchers at the U.S. Sentencing Commission.48 The statistical exclusion rule based on the mean p-value of chi-square test for independence of offense types identified all four offices that had consistently had at least two active judges but were reported to have used nonrandom assignment of cases according to the qualitative research, as well as 19 other offices. For the 26 offices included in the analysis, we present quantiles of the p-values from the chi-square tests of independence of judges and offense types in Table 1. The mean of the 606 test statistics is 0.49, and they are distributed fairly uniformly from zero to one. The resulting sample used for estimation includes 77,201 cases, 27 percent of the total universe of over 285,000 cases from July 1981 to December 1993. About half the cases are excluded because an office does not have at least two judges who consistently had cases assigned to them throughout the period. Roughly another one-quarter of cases are excluded because they were assigned in an office that did not appear consistently to use random assignment of cases to judges throughout the time period under study. Both the regional composition and the offense types are similar in the estimation sample and in the universe of all cases. In our analysis, we focus on comparison of interjudge disparity for 2year periods before and after promulgation of the Guidelines. Descriptive 48 The particular methods used to assign cases to judges vary between districts. In some districts, judge’s names are drawn from a deck of cards—sometimes shuffled in the dark and/or sealed with wax to prevent tampering. In others, envelopes with the judges’ names are placed in a box or circular hopper that is spun before the clerk pulls an envelope out of the box. In the 1990s, many districts have adopted computer software to implement the random assignment. Also, in order to more evenly distribute potentially lengthy cases, random assignment is sometimes stratified by projected length of the case. Paul Hofer at the U.S. Sentencing Commission also provided information about case assignment practices based on interviews conducted during his own research and from interviews previously conducted by the Federal Judicial Center as part of the FJC Time Study (on file at the Federal Judicial Center). Letter from Paul Hofer to Jeffrey R. King, December 4, 1997. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 290 the journal of law and economics TABLE 2 Descriptive Statistics of Prison Sentences in Months 1982–83 1984–85 1986–87 1988–89 1990–91 1992–93 Percentages: % 0 months (acquittals) % 0 months (dismissals) % 0 months (convictions) % 1–24 months % 25–48 months % 49–96 months % 97–540 months Summary statistics: Mean SD Sample sizes: Offices Judges Cases 3 13 32 26 12 9 6 2 13 35 25 11 8 6 2 11 33 24 11 10 8 2 10 27 29 11 12 9 2 10 28 27 9 12 11 1 10 24 28 12 14 11 24 50 24 53 29 57 32 60 36 66 38 67 23 139 11,997 26 148 12,007 26 154 12,575 25 157 13,689 26 161 13,604 26 159 13,329 Note.—Data are from Cases Terminated files of the Administrative Office of the U.S. Courts. For sample creation details, see Section V. statistics for the 2-year periods from 1982 to 1993 are given in Table 2. The percentage with zero sentence length in our data include zeros for acquittals and dismissals, which have declined slightly as a share of all cases over time. The distribution of sentence length shifted toward higher sentences throughout the 12.5-year period covered by these data as mandatory minimums and Guidelines took effect over time. VI. Results On the basis of our interviews and statistical tests, we are fairly confident that the offices included in our sample used random assignment of cases to judges. Since the caseloads should therefore be comparable, differences in the average sentence length of these caseloads can be attributed to judges themselves. This section reports the results of the methods outlined in Section IV above. In Figure 1, we graph estimates of interjudge disparity for 2-year time periods from 1982 to 1993 using the data described in Table 2. The triangles in Figure 1 are Gini coefficients computed from the absolute difference of the judge means, based on g from equation (3) using the average of estimates for each district office weighted by the number of cases in that office. The circles in Figure 1 are from the dispersion in the random effect of the zero-inflated negative binomial model for each single period, based on estimates using equation (6) transformed into the Gini coefficient γ. The This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). Figure 1.—Interjudge disparity for all judges This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 292 the journal of law and economics changes over time in g and γ are quite similar, with a peak in 1984–85, followed by a decline that accelerates from 1986–87 through 1988–89 and a leveling off thereafter. As discussed in Section IV above, the estimates of g are biased upward by sampling error. Estimates of the magnitude of the bias depend on the assumptions used to model the judge means.49 The estimates of γ account for sampling variability by explicitly modeling the underlying distribution of sentence lengths and the distribution of judge means, formalizing the intuition that larger deviations of a judge from the district office mean are increasingly likely to be caused by sampling error. The point estimates of g are approximately 0.07 greater than γ for each time period. Based on the model of γ, the sampling error bias in g appears to be fairly constant over time.50 Our main interpretation of these results is that the same temporal dynamics of interjudge disparity are apparent in measurements of both g and γ. This increases our confidence that our results about changes in disparity over time are not highly sensitive to the modeling strategy. For the remainder of this section, we focus on estimates of γ from the random effects model so that we can account for sampling error bias, assess the statistical precision of changes between periods, and estimate the correlation of judicial sentencing patterns between time periods. To first verify that the model defined in equations (4)–(7) is an appropriate model for these data, we compare observed cell probabilities for the pooled data from 1986–87 with predicted values from a simple two-part negative binomial model, assuming that δ, γ, and θ are constant across all cases. The actual distribution and predicted distribution are presented in Table 3. The model 49 One model is that the true judge means are identical within each office. To simulate the bias when the true g is zero, we created data where each judge was randomly matched with the sentencing outcomes of N jt cases from the district office. An alternative model is that the estimated judge mean is an unbiased estimate of the true mean. Treating the observed distribution as the true distribution, an estimate of the bias can then be computed from the difference between the mean of the bootstrap estimates and the original estimate. We computed bootstrap estimates of the bias for this model by drawing bootstrap samples for each judge from the observed distribution of sentencing outcomes for that judge. In 1986–87, for example, the estimated g was .154. In 500 replications, the bias was estimated to be .122 from the true mean zero simulation and .038 from the bootstrap procedure. The middle ground between these two models would allow differences in the judge means while incorporating the intuition that an estimated judge mean particularly different from the district office mean is more likely to have come from an unusual sample of cases (rather than treating these extremes as unbiased estimates of the true mean for that judge). 50 Note that the bias in absolute measures of disparity, such as the variance or the mean absolute deviation, is increasing over time as the underlying variance of the data is increasing. The mean of the data is also increasing over time, however, so that ratio of the standard deviation to the mean in Table 2 has actually decreased slightly over time, implying slightly less bias for relative disparity measures in later periods. A counteracting factor, however, is that the true means appear to be closer together in later periods, which increases the bias as discussed in note 40 supra. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 293 TABLE 3 Observed versus Predicted Cell Probabilities of Three Parameter Model Months of Prison Term 0 1–12 13–24 25–48 49–96 97–192 193–540 Observed Predicted .464 .154 .088 .107 .106 .059 .020 .464 .135 .084 .110 .113 .072 .020 Note.—Estimates are based on equation (4), assuming that the three parameters d, m, and p are constant across all cases, using pooled data for 1986–87. has been constructed to fit zero exactly and does a reasonable job of representing the rest of the distribution—even though the large amount of data results in a chi-square statistic (χ 2 ⫽ 66) that rejects the hypothesis that the model exactly fits the data. Of course, the model does not account for the fact that within the cells of Table 2 the data are clustered at particular months (6, 12, 18, . . . ) rather than distributed smoothly across all months, but the model fits the basic features of the data quite well. Allowing θ to vary across the judges when maximizing the likelihood in (7) requires evaluating J double integrals for every function evaluation in nonlinear optimization, but these can be computed efficiently after an appropriate transformation of variables using Gaussian quadrature based on Hermite polynomials.51 Standard errors for γ, the Gini coefficient in (8), are computed using a numerical approximation to the Hessian and the delta method.52 The estimates of γ for Figure 1 were estimated separately for each period based on equation (6) and data for all judges available in the period. The Gini coefficient estimates peak in 1984–85 and fall by .019 in 1986–87 and again by .031 in 1988–89 before largely leveling off. While not statistically 51 The weights and abscissae used were from the 16-point Gauss-Hermite rule in Gwynne Evans, Practical Numerical Integration 308 (1993); 25-point and 40-point rules were also tried and resulted in almost identical point estimates. (Estimation programs in MATLAB are on file with the authors.) 52 In order to assess the usefulness of these asymptotic approximations for inference in finite samples, we conducted Monte Carlo simulations in which the true parameters of the model were known. The asymptotic standard errors reported in the paper appear to be slightly too small, with the confidence interval of 1.96 times the standard error including the truth about 90 percent of the time as opposed to the asymptotic prediction of 95 percent coverage. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 294 the journal of law and economics significant, these results also suggest that interjudge disparity may have been decreasing prior to 1986–87—an issue that we discuss further below. Taking the average over the three periods from 1988 to 1993 as our postGuidelines measure, interjudge disparity fell from .085 to .054 from 1986– 87 to 1988–93, a decrease of .031 with a standard error of .010. The changes are more pronounced than the mixed results of previous researchers.53 On the basis of these results, the expected difference between two typical judges is twice the Gini coefficient—about 17 percent in 1986–87 and about 11 percent in 1988–93. Since overall sentence lengths are rising over time, a given percentage difference in interjudge disparity implies a larger absolute difference in months of prison sentence length. Multiplying the percentage difference by the overall average sentence length in each period from Table 2 expresses our measure of interjudge disparity in terms of months. For 1986–87 when the mean sentence length was 29, the expected interjudge difference was 4.9 months, which fell to 3.9 months in 1988–93 when the mean sentence length was 35.54 53 Waldfogel, Inter-judge Disparity in Federal Sentencing, supra note 34, analyses three districts (CT, SDNY, NDCA) from 1984–90 and finds an increase in interjudge disparity in two of the three districts during 1988–90, although no standard errors for these estimates are reported. Payne, supra note 34, analyses three districts (EDNY, SDNY, EDPA) from 1980– 91. For property crimes, she finds that interjudge disparity declines for one of three districts. For drug crimes, she finds declines for all three districts. Payne emphasizes that the fraction of variation explained by mean judge effects is small relative to the total variation. This is true and tells us that there are many additional factors that drive differences in sentences, but it does not lead us to conclude that interjudge disparity itself is small or unimportant. Just as with other empirical relationships, such as the amount of variation in wages that is explained by differences in education levels, the fact that the percentage of explained variation is small is often less important than the magnitude of the coefficients associated with the variable of interest, such as differences in average wages between education levels or (as we believe in this case) with differences in average sentences between judges. Hofer, Blackwell, & Ruback, supra note 34, compare 1984–85 to 1994–95 using the same 42 judges in nine district offices in the part of their analysis most comparable to ours. They also focus on percentage of variation explained and report that the partial R 2 drops from 2.32 in 1984–85 to 1.08 in 1994–95. As pointed out in Section IV, there are several methodological differences between our analysis and those by Waldfogel, Payne, and Hofer, Blackwell, & Ruback. Most important, we measure the magnitude of interjudge disparity directly and provide a confidence interval to assess the statistical precision of the estimate of the change before and after the Guidelines, so the null hypothesis of no change can be tested. Our data are also more complete, covering 26 district offices for up to 12.5 years. In addition to the advantages of sample size and representativeness, our data have been constructed with criteria selecting districts using random assignment of cases to judges that appears to be more stringent. Of the 13 districts in these three studies, only three had procedures and caseloads that appeared to us to consistently use random assignment in the time periods under study here and were included in our analyses. 54 The conversion of percentage differences into months depends in part upon the treatment of acquittals and dismissals. Cases (and not convictions) are randomly assigned to judges, and we use all cases assigned in computing our estimates. An argument can be made This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 295 Figure 2.—Interjudge disparity for the same judges over two periods Between 75 and 86 percent of judges in any single 2-year period were also in the sample during the following period. To ensure that results are not simply being driven by changes in the composition of judges, we also analyze changes based on the same judges in both periods. The lines in Figure 2 connect estimates of disparity between consecutive 2-year periods, based on estimates of γ from equation (7). The changes over time are very similar to those reported in Figure 1. The decrease in interjudge disparity before and after the promulgation of the Guidelines is sharper when comparing the same judges over time, as the γ falls by more than half between 1986–97 and 1989–90 from .090 to .039, a decrease of .051 with a standard error of .013. The estimates from 1988–93 range from .055 to .059 to .046, that acquittals and dismissals are independent of the judge assigned to the case, based on the logic that judges have much more discretion over sentencing for convicts than for other dispositions. Statistical tests for independence analogous to those for offense types described in Section V indicate that the combined fraction of acquittals and dismissals is roughly balanced across judges. Under true independence, the mean p-value of the 606 chi-square statistics (testing for independence of judge and acquittal/dismissal) should be .5, and the actual value is .46. Inclusion of acquittals and dismissals appears to have a negligible effect on the estimates of the Gini coefficient, since this is roughly equivalent to rescaling the judge means by a constant factor. If we were to use the mean sentence for convictions (33 in 1986–87 and 39 in 1988–93) instead of the mean for all cases, the results would be an expected difference of 5.7 months in 1986–87 and 4.4 in 1988–93. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 296 the journal of law and economics TABLE 4 Share of Cases by Offense Type 1982–83 1984–85 1986–87 1988–89 1990–91 1992–93 Drug Violent and weapons Embezzlement and fraud Other .21 .16 .26 .37 .25 .14 .28 .33 .27 .14 .31 .28 .32 .15 .28 .25 .30 .18 .29 .23 .33 .17 .26 .24 Note.—Shares based on data described in Table 2. and these differences are indistinguishable from sampling error. The most conservative estimate, not shown in Figure 2, compares the same judges in 1986–87 to those in 1989–90, omitting 1988 to allow a transition period for adjustment to the new regime and because legal challenges to the Guidelines were not resolved until January 1989. This estimate shows that γ falls from .083 to .067, a decrease of .016 with a standard error of .013. This estimate is conservative in the sense that it is a smaller change than that from 1986–87 to 1988–89 and that our other estimates of disparity for years after the Guidelines other than 1989–90 are all lower than .067. We conclude from measures using the same judges that the expected difference between two judges (twice the Gini coefficient) decreased from 17–18 percent in 1986–87 to 8–13 percent in 1988–90. This range using the same judges in both periods brackets a decrease from 17 to 11 percent reported above for all judges based on Figure 1 for 1986–87 to 1988–93. Another factor changing over time is the mix of offenses in the overall caseload. Table 4 shows that the overall share of drug offenses increased from .21 to .33 from 1982–83 to 1992–93, while the share of ‘‘other’’ offenses (such as forgery) fell from .37 to .24. If the disparity in sentencing drug cases was always lower than disparity for other cases, then we might observe a decrease over time in measured overall interjudge disparity that was caused by a change in the caseload coming before judges. In an attempt to separate out changes in judicial behavior from changes in the types of cases, we compute weighted results, replacing w t in (4) with w ijt defined in (9). These weights statistically adjust so that the shares of offense types in the overall distribution for each judge in each time period are equal to the share for their district office in 1986–87. The four offense types used are violent and firearms, drug, embezzlement and fraud, and other cases. The choice of a base period does not affect trends over time but does affect the levels of the estimates; 1986–87 is chosen to address the counterfactual in which the Sentencing Guidelines were later adopted but the mix of offense types did not change. The unweighted results using w t from Figure 2 are reproduced in the first This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 297 sentencing disparity TABLE 5 Changes over Time in Gini Coefficient Estimates of Interjudge Sentencing Disparity Weighted for Offense Type Comparability Unweighted 1982–83 1984–85 .079 (.009) .107 (.008) 1984–85 1986–87 .102 (.009) Change 1982–83 1984–85 .092 (.008) .102 (.009) Change 1984–85 1986–87 Change .092 (.009) ⫺.010 (.012) .097 (.009) .080 (.008) ⫺.017 (.011) 1986–87 1988–89 Change 1986–87 1988–89 Change .090 (.008) .039 (.010) ⫺.051 (.013) .079 (.008) .042 (.010) ⫺.038 (.012) 1988–89 1990–91 Change 1988–89 1990–91 Change .055 (.010) .057 (.008) .057 (.008) .053 (.009) ⫺.004 (.014) 1990–91 1992–93 Change 1990–91 1992–93 Change .059 (.008) .046 (.010) ⫺.012 (.012) .045 (.008) .041 (.011) ⫺.004 (.014) .028 (.011) .002 (.013) Judges Change .011 (.012) 118 126 133 128 120 Note.—Estimates are for Gini coefficients γ (with standard errors in parentheses) from equations (7) and (8) using data summarized in Table 2. Unweighted estimates are based on w t in (4), while weighted estimates are based on w ijt from equation (9). As described in the text, the weights statistically adjust each judge’s caseload over time to reflect the district office’s mix of offense types in 1986–87. The four offense types used are drug, violent and weapons, embezzlement and fraud, and other. three columns of Table 5, weighted results using w ijt are shown in columns 4–6, and the number of judges active in the consecutive 2-year periods is shown in column 7. In addition to making the shares of offense types comparable over time, weighting equalizes the shares of offense types within a period. The magnitudes of the weighted point estimates of interjudge disparity are slightly lower than the unweighted estimates, because of the correction for variability due to the fact that the shares are similar but not exactly equal when cases are assigned randomly.55 The point estimates differ slightly, but the overall pattern of results is quite similar for the unweighted results and those weighted for comparability over time. For example, the 55 Since the data are weighted to reflect the 1986–87 case mix within each district office, the roughly 10 percent reduction in the magnitude of point estimates for the 1986–87 period is not due to changes in the offense mix over time. Instead, it reflects the fact that different offense types are weighted exactly equally in modeling the overall distribution for each judge, instead of the approximate equality of random assignment. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 298 the journal of law and economics weighted estimate of γ in 1986–87 is .079 and falls to 0.042 in 1988–89, implying that the expected difference between two judges fell from 16 percent to 8 percent. Our main interpretation of these results is that changes in interjudge disparity are not due to changes in the types of offense in the overall caseload. The aspect of the results that appears to be most sensitive to changes in specification is the change in disparity from 1984–85 to 1986–87. In all specifications, disparity appeared to be stable or increasing through 1986. For example, the weighted estimates for using the same judges in 1983–84 and 1985–86 (not shown in Table 5) were .088 and .092. Decreases in disparity appear to be concentrated during 1987–89, but there are not enough cases per judge to more precisely identify the timing of the changes. Our preliminary research on disparity for particular offense types suggests that the decrease in interjudge disparity is concentrated within the violent, weapons, and drug crimes. Estimation for particular offense types substantially reduces the number of cases per judge in each period, however, and Monte Carlo simulations suggest that the methods used in this paper are substantially less reliable when there are fewer than 30 cases per judge used in the estimation. In future research we intend to pool additional years of data and model the dispersion in judge means as a parametric function that is changing over time, in order to obtain reliable estimates for various offense types. Regarding the correlation of the judge effects, we find that the behavior of judges appears to be fairly consistent over time prior to the Guidelines. Table 6 reports the correlation of judge effects between time periods. Prior to 1986–87, this measure is greater than .70.56 There is some evidence that the consistency of judicial sentencing patterns declined thereafter, but the results are mixed. It is increasingly difficult to reliably estimate the correlation between two periods when the variance of the random effect is small in both periods. In comparisons for 2-year periods subsequent to those reported in Table 6, the correlation varies from .44 to ⫺.24. However, the standard errors are at least .21, and we cannot draw any credible conclusions from these very imprecise estimates in the later periods. Finally, we return to the question of causality. Were these changes over time in interjudge sentencing disparity caused by the Federal Sentencing Guidelines? Clearly, the largest change in disparity occurs between 1986– 87 and 1988–89, which corresponds to the effective date of the Guidelines 56 For context, note that the consistency in public behavior over time is even greater among representatives in the U.S. Congress, where the correlation in indices of voting patterns is 0.78 to 0.95 (depending on the index) when examining the voting by the same representative separated in time by two terms. See John Lott & Stephen Bronars, Time Series Evidence on Shirking in the U.S. House of Representatives, 76 Pub. Choice 125 (1993). This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 299 sentencing disparity TABLE 6 Correlation of Judge Effects between Time Periods 1982–83 and 1984–85 1983–84 and 1985–86 1984–85 and 1986–87 1985–86 and 1987–88 1986–87 and 1988–89 1988–89 and 1989–90 .73 (.09) .70 (.09) .75 (.07) .22 (.14) .68 (.13) .34 (.17) Note.—Estimates of ρ (with standard errors in parentheses) are from maximum likelihood estimation of equation (7), using data summarized in Table 2. in November 1987. Disparity from 1988–93 has remained at levels lower than observed from 1982 to 1987. While this timing is suggestive, the Guidelines applied only to offenses committed after November 1, 1987. Because of the lag from commission of offense to case filing and because of constitutional challenges to the Guidelines, about half of the cases filed in 1988 and 1989 were not sentenced under the Guidelines.57 We suspect that the 1986 enactment of mandatory minimum sentences for drug offenders (which applied only to crimes committed after October 1, 1986) may have substantially contributed to the decrease in disparity after 1986.58 57 Our data record only date of case filing and termination, and not date of offense or use of the Guidelines in sentencing. As shares of cases filed in 1988–89, .3 were terminated in 1988, .43 in 1989, and .27 in 1990 or later. The U.S. Sentencing Commission, Annual Report 39 (1990), tabulates the fractions of terminated cases sentenced under the Guidelines to be .18 in 1988, .55 in 1989, and .70 in 1990. As a rough estimate of Guideline application, we use the shares of cases filed as weights to infer that the average fraction of Guidelines application was about 48 percent for cases filed in 1988–89. 58 Before the Sentencing Guidelines were even promulgated, Congress enacted the Anti-Drug Abuse Act 1986, Pub. L. No. 99-570, 100 Stat. 3207, which provided an array of mandatory minimum penalties for narcotics offenses and violent crimes. Most significantly, the act ‘‘set up a new regime of non-parolable, mandatory minimum sentences for drug trafficking offenses that tied the minimum penalty to the amount of drugs involved in the offense.’’ U.S. Sentencing Commission, Special Report to Congress: Mandatory Minimum Penalties in the Federal Criminal Justice System 8 (1991). These mandatory minimum sentences were severe for all drugs, but especially for offenses involving crack cocaine, and they applied to nearly all drug offenses prosecuted in the federal courts. Because statutory changes in sentencing law are not applied retroactively to crimes before the enactment of the statutes and because there is a substantial delay between the commission of a crime and sentencing, the effect of mandatory minimums (like the Sentencing Guidelines) phases in gradually over several years. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 300 the journal of law and economics VII. Discussion By focusing our analysis on the nominal length of prison sentences, we have not considered interjudge disparity in the length of time actually served in prison. It is possible that parole policies in the pre-Guidelines period reduced interjudge sentencing disparity in the time actually served by an offender. The Sentencing Reform Act eliminated parole, so parole cannot affect time served in cases sentenced under the Guidelines. For defendants who were sentenced to terms of imprisonment prior to the Sentencing Guidelines, parole authorities actually determined the date of release from prison and, thus, the time actually served by the offender. These determinations were based upon the Parole Guidelines, which were similar in function to the Sentencing Guidelines. Like the Sentencing Guidelines that were modeled after them, the heart of the Parole Guidelines was a grid of boxes that indicated actual sentence length based on the offender’s prior record and the seriousness of the offense. Since this determination was independent of, and subsequent to, the offender’s sentencing, the Parole Guidelines may have substantially mitigated interjudge disparity in nominal sentences. None of the influential studies that indicated widespread disparity in the pre-Guidelines era considered the effect of the Parole Guidelines in reducing interjudge time-served disparity. In future research, we intend to examine interjudge disparity in time served. Despite this caveat, we believe our focus on disparity in nominal sentences is appropriate. First, as we have related above, interjudge sentencing disparity was a critical impetus to the passage of the Guidelines, and reducing it was a central goal of the still-controversial Sentencing Guidelines. Second, interjudge sentencing disparity is an interesting phenomenon apart from its ultimate outcome on an offender’s sentence. The actual ceremony of sentencing has an expressive function that is important independent of the actual time served. The sentencing is the moment at which the community publicly expresses its disapproval of the offender’s action. The prosecutor and defense counsel offer a few words, offered more for the victim, the defendant, their friends and family, and any press than for the judge. The defendant is asked to speak, if she wishes, to accept responsibility, to ask for forgiveness, or to say nothing. Finally, the judge formally articulates a measure of the defendant’s offense against the community. That different judges publicly express different measures of justice for the same offenses is therefore both interesting and troublesome, even if the offenders ultimately serve the exact same sentence. Frankel cited the fact that some judges sentenced draft evaders to the maximum sentence while other judges imposed almost no prison time for defendants who broke the law in adherence to principle.59 This disparity is notable because it shows 59 Frankel, Criminal Sentences, supra note 14, at 29. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 301 that two judges, each purportedly expressing the will of the community, differ dramatically in their representation of the proper sentence for a particular crime. As Frankel noted, ‘‘It is not directly pertinent here whether either category of judge is right,’’ 60 but the fact of their disagreement is pertinent. The disagreement undermines the expressive function of sentencing by suggesting that a sentence is not so much a measure of the offense to the community as simply the personal judgment of the judge. Despite the importance that the progenitors of the Guidelines placed on interjudge sentencing disparity and our focus on it in this paper, it would be a mistake to equate interjudge sentencing disparity with ‘‘unwarranted sentencing disparity’’ and consider our findings a simple vindication of the Sentencing Guidelines. We note, first, that we have sought to measure only disparity among judicial participants in sentencing. There are, of course, other sources of sentencing disparity in the federal criminal justice system. As many commentators have noted, considerable disparity exists in charging policies at various U.S. Attorney’s offices,61 in the policies of law-enforcement personnel, and in the manner in which the probation officer conducts an independent investigation of the offense.62 The Guidelines did nothing to address these sources of disparity. 60 Id. Officially, offenders are to be charged with the most serious offense that can be proved at trial with certain limited exceptions. In practice, however, charging policies vary widely. See Ilene H. Nagel & Stephen J. Schulhofer, A Tale of Three Cities: An Empirical Study of Charging and Bargaining Practices under the Federal Sentencing Guidelines, 66 S. Cal. L. Rev. 501; Ahmed Taha, The Effect of the Federal Sentencing Guidelines on the Disposition of Criminal Cases (1998) (unpublished manuscript, U.S. Department of Justice) (provides evidence that prosecutors filed less serious charges against the average defendant, defendants pleaded guilty to charges that were closer to the charges prosecutors filed, more defendants initially pleaded guilty); Joe Brown, Quo Vadis? What Congress and the Department of Justice Should Do in Response to the Justice Department’s Analysis of Non-Violent Drug Offenders with Minimal Criminal Histories, 7 Fed. Sentencing Rep. 25, 26–27 (1994) (former U.S. Attorney criticizes disparity among federal prosecutors in use of 5K1.1 motions and in charging to avoid mandatory minimums: ‘‘popular way to avoid mandatory sentences entirely [is] by charging a telephone count under 21 U.S.C. § 843(b), which does not involve a mandatory sentence’’); Robert H. Edmunds, Jr., Guidelines Sentencing and Department of Justice Policies under the Reagan-Bush Administrations, 6 Fed. Sentencing Rep. 306 (1994) (noting that in the district in which he was U.S. Attorney, the Middle District of North Carolina, an illegal alien who was convicted of reentry after deportation could expect a ‘‘term of years’’ while in a California border district, illegal immigrations were common and the aliens would receive far lighter sentences). There is an enormous range in policies among U.S. Attorney’s offices for making a motion for downward departure on the basis of substantial assistance (§ 5K1.1 motion), which allows a judge to sentence below the Guidelines range. In the District of Connecticut, any § 5K1.1 motion must be approved by a committee including the U.S. Attorney for the district. The § 5K1.1 departure rate in the district was 8.5 percent in 1994. In contrast, judges in the Eastern District of Pennsylvania departed downward on § 5K1.1 grounds in 49.3 percent of cases in 1994. 62 Gerald H. Heaney, The Reality of Guidelines Sentencing: No End to Disparity, 28 Am. Crim. L. Rev. 161, 200 (1991) (role of probation officers varies widely by district; one federal defender describes probation as more adversarial than prosecution). 61 This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 302 the journal of law and economics Moreover, the reduction in the exercise of judicial discretion resulting from the Sentencing Guidelines and the imposition of mandatory minimum sentences increased the impact of any disparity at these earlier stages in the criminal process. Reduced discretion for judges at the end of the process magnifies the importance of decisions made by the prosecutor, probation office, and law enforcement officials. Since the sentence will be determined by what is proven by a preponderance of the evidence under the Guidelines, the prosecutor exerts far more influence over the sentence than she did preGuidelines. Similarly, the offender’s sentence will directly reflect any disparity between probation officers, because they are the ‘‘Guidelines experts’’ who initially advise the judge of the facts of the case and how the Guidelines apply to these facts. Under the Guidelines, law enforcement officials also wield more influence over final sentences. Many Guidelines sentences (including narcotics sentences and sentences for all crimes with monetary losses) depend upon the measurable quantities involved in the offense. In many investigations of such crimes, governmental authorities exert substantial control over the quantity that will be used to calculate the defendant’s sentence under the Guidelines. For instance, in narcotics investigations, the undercover agent often determines the amount of drugs either purchased from or sold to a putative defendant. Her decision as to the quantity to attempt to buy or sell from the target offender will play a large role in determining the ultimate sentence under the Guidelines.63 By giving prior actors (law enforcement officials, probation officers, and prosecutors) more influence over the ultimate sentence, the Guidelines provide opportunities for these earlier actors to pursue their own agendas that did not exist pre-Guidelines. If these prior actors vary in their willingness to engage in manipulative tactics aimed at achieving a higher or lower Guidelines sentence for particular defendants, prosecutorial sentencing disparity will actually have been increased by the Guidelines. Pre-Guidelines, any disparity resulting from these practices was comparatively less because the judge and the parole board exercised overwhelming control over the sentence. In addition, mean sentence length has substantially increased under the Guidelines. Critics of this increase may argue that any reduction in judicial sentencing disparity was achieved primarily by reducing the frequency of sentences that, pre-Guidelines, were relatively lenient64 and that the result is a plethora of long sentences disproportionate to the crime and the offender. In this respect the Guidelines may be contrasted with a regime that successfully reduced disparity without dramatically increasing sentence length. 63 See, for instance, United States v. Giles, 768 F. Supp. 101 (S.D.N.Y.), aff’d, 953 F.2d 636 (2d Cir. 1991), cert. denied, 503 U.S. 949 (1992). 64 This result may reflect legislative intent. Sponsors of Sentencing Reform Act included both liberals primarily concerned with disparity in sentencing and conservatives primarily concerned with leniency in sentencing. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 303 Finally, we note that there are other consequences of the Guidelines besides their effect on sentencing disparity and sentence length. Judges have complained that the Guidelines rules themselves are as arbitrary as any exercise of judicial discretion may have been prior to the Guidelines and that their arbitrariness and their extraordinary complexity have made the sentencing process incomprehensible and inaccessible to victims, defendants, and the general public.65 VIII. Conclusion Our study indicates that the Guidelines (and concomitant statutory minimum sentences) have been successful in reducing interjudge nominal sentencing disparity. To the extent that this was the central goal of the Sentencing Reform Act of 1984, Congress successfully achieved this goal. The Guidelines have reduced the net variation in sentence attributable to the happenstance of the identity of the sentencing judge. The expected difference in the sentence lengths of two judges receiving comparable caseloads was 16 to 18 percent in the pre-Guidelines period of 1986–87. Comparing interjudge disparity before and after the Guidelines, we find that this measure declined substantially, with estimates of the expected difference ranging between 8 and 13 percent during 1988–93. Unfortunately, the very success of the Guidelines in reducing interjudge disparity by constraining judicial discretion may have exacerbated the impact and the degree of disparity at earlier stages of the criminal justice process, through the elimination of parole and the severe reduction in the judiciary’s ability to compensate for interactor disparity earlier in the criminal justice process. Also, although our empirical results show large changes in interjudge nominal sentencing disparity, we have not measured disparity in time served. Disparity in time served and disparity among decision makers earlier in the criminal justice process are both subjects in need of further research. We conclude by noting that elimination of disparity is only one objective of a just sentencing system. Other commentators have argued that the present regime has purchased a reduction in interjudge sentencing disparity at the price of undue severity in sentences, undue uniformity of those sentenced,66 and unwarranted complexity.67 Even if disparity from all sources 65 See Stith & Cabranes, supra note 4, at 78–103. See Lott, supra note 10; Karpoff & Lott, supra note 10. These critics have argued that the Guidelines have wrongly eliminated sentencers’ ability to consider the reputational effects of a conviction for white-collar and corporate offenders. They argue that as a result of this inability to consider significant extralegal sanctions, variation in total penalties (incarceration, fines, reputational penalties, and lost earnings) has increased. 67 Albert Alschuler, The Failure of Sentencing Guidelines: A Plea for Less Aggregation, 58 U. Chi. L. Rev. 901 (1991); Stephen Schulhofer, Assessing the Federal Sentencing Process: The Problem Is Uniformity, Not Disparity, 29 Am. Crim L. Rev. 833 (1992); Daniel Freed, Federal Sentencing in the Wake of the Guidelines: Unacceptable Limits on the Discretion of Sentencers, 101 Yale L. J. 1682 (1992). 66 This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 304 the journal of law and economics (judges, prosecutors, law enforcement agents, probation officers, and so on) could be eliminated, the result would not necessarily be a just or fair sentencing system. In particular, implementation of the Sentencing Guidelines and statutory minimum sentences have led to complaints of undue uniformity in sentencing. If a sentencing regime fails to take account of characteristics of the offense or of the offender that are believed to be relevant to a just sentence, then its results may be unwarranted even if there is no measurable disparity because of the identity of particular decision makers. Bibliography Alschuler, Albert W. ‘‘The Failure of Sentencing Guidelines: A Plea for Less Aggregation.’’ University of Chicago Law Review 58 (1991): 901–51. Austin, William, and Williams, Thomas A., III. ‘‘A Survey of Judges’ Responses to Simulated Legal Cases: Research Note on Sentencing Disparity.’’ Journal of Criminal Law and Criminology 68 (1977): 306–10. Bartolomeo, John, et al. Sentence Decision Making: The Logic of Sentencing Decisions and the Extent and Sources of Sentence Disparity. Washington, D.C.: U.S. Department of Justice, 1981. Bowman, Francesca D. ‘‘Probation Officers Advisory Group Survey.’’ Federal Sentencing Reporter 8 (1996): 303–13. Brown, Joe. ‘‘Quo Vadis? What Congress and the Department of Justice Should Do in Response to the Justice Department’s Analysis of Non-violent Drug Offenders with Minimal Criminal Histories.’’ Federal Sentencing Reporter 7 (1994): 25–27. Cameron, A. Colin, and Trivedi, Pravin. Regression Analysis of Count Data. New York: Cambridge University Press, 1998. Casper, Jonathan D. ‘‘Determinant Sentencing and Prison Crowding in Illinois.’’ University of Illinois Law Review, 1984: 231–52. Chib, Siddhartha, et al. ‘‘Posterior Simulation and Bayes Factors in Panel Count Data Models.’’ Journal of Econometrics 86 (1995): 33–54. Cook, Beverly Blair. ‘‘Sentencing Behavior of Federal Judges: Draft Cases.’’ University of Cincinnati Law Review 42 (1973): 597–634. Davis, Kenneth Culp. Discretionary Justice. Baton Rouge: Louisiana State University Press, 1969. Diamond, Shari S., and Zeisel, Hans. ‘‘Sentencing Councils: A Study of Sentencing Disparity and Its Reduction.’’ University of Chicago Law Review 43 (1975): 109–49. Edmunds, Robert H. ‘‘Guidelines Sentencing and Department of Justice Policies under the Reagan-Bush Administrations.’’ Federal Sentencing Reporter 6 (1994): 306–9. Evans, Gwynne. Practical Numerical Integration. New York: John Wiley & Sons, 1993. Frankel, Marvin E. ‘‘Lawlessness in Sentencing.’’ University of Cincinnati Law Review 41 (1972): 1–54. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 305 Frankel, Marvin E. Criminal Sentences: Law without Order. New York: Hill & Wang, 1973. Freed, Daniel J. ‘‘Federal Sentencing in the Wake of the Guidelines: Unacceptable Limits on the Discretion on Sentencers.’’ Yale Law Journal 101 (1992): 1681– 1784. Gaudet, Frederick J. ‘‘The Differences between Judges in the Granting of Sentences of Probation.’’ Temple Law Quarterly 19 (1933): 471–84. Gaudet, Frederick J., et al. ‘‘Individual Differences in the Sentencing Tendencies of Judges.’’ Journal of Criminal Law and Criminology 23 (1933): 811– 18. Gurmu, Shiferaw, et al. ‘‘Semiparametric Estimation of Count Regression Models.’’ Journal of Econometrics 88 (1999): 123–50. Hausman, Jerry, et al. ‘‘Econometric Models for Count Data with an Application to the Patents-R&D Relationship.’’ Econometrica 52 (1984): 909–38. Heaney, Gerald H. ‘‘The Reality of Guidelines Sentencing: No End to Disparity.’’ American Criminal Law Review 28 (1991): 161–231. Heumann, Milton. ‘‘Empirical Questions and Date Sources: Guideline and Sentencing Research in the Federal System.’’ Federal Sentencing Reporter 6 (1993): 15–18. Hofer, Paul. Letter to the author. December 4, 1997. Hofer, Paul, et al. ‘‘The Effect of the Federal Sentencing Guidelines on Inter-judge Sentencing Disparity.’’ Unpublished manuscript. U.S. Sentencing Commission, November 1998. Hoover, J. Edgar. ‘‘The Dire Consequences of the Premature Release of Dangerous Criminals through Probation and Parole.’’ F.B.I. Law Enforcement Bulletin 27 (1958): 1–2. Hopkins, Alec. ‘‘Is There a Class Bias in Criminal Sentencing?’’ American Sociological Review 42 (1977): 176–77. Howard, Joseph C. ‘‘Racial Discrimination in Sentencing.’’ Judicature 59 (1975– 76): 121–25. Johnson, Molly Treadway, and Gilbert, Scott. The U.S. Sentencing Guidelines: Results of the Federal Judicial Center’s 1996 Survey. Washington, D.C.: U.S. Judicial Center, 1997. Karpoff, Jonathan M., and Lott, John R., Jr. ‘‘Why the Commission’s Corporate Guidelines May Create Disparity.’’ Federal Sentencing Reporter 3 (1990): 140– 41. Lambert, Diane. ‘‘Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.’’ Technometrics 34 (1992): 1–14. Lott, John, and Bronars, Stephen. ‘‘Time Series Evidence on Shirking in the U.S. House of Representatives.’’ Public Choice 76 (1993): 125–49. Lott, John R., Jr. ‘‘Do We Punish High Income Criminals Too Heavily?’’ Economic Inquiry 30 (1992): 583–608. Morris, Norval. The Future of Imprisonment. Chicago: University of Chicago Press, 1974. Mullahy, John. ‘‘Specification and Testing of Some Modified Count Data Models.’’ Journal of Econometrics 33 (1986): 341–65. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 306 the journal of law and economics Nagel, Irene H., and Schulhofer, Stephen J. ‘‘A Tale of Three Cities: An Empirical Study of Charging and Bargaining Practices under the Federal Sentencing Guidelines.’’ Southern California Law Review 66 (1992): 501–66. Newman, Donald J. ‘‘Court Intervention in the Parole Process.’’ Albany Law Review 36 (1972): 257–304. Newman, Jon O. ‘‘Foreword: Parole Decision-Making and the Sentencing Process.’’ Yale Law Journal 84 (1975): 812–13. Newman, Jon O. ‘‘A Better Way to Sentence Criminals.’’ American Bar Association Journal 63 (1977): 1562. Partridge, Anthony, and Eldridge, William B. The Second Circuit Sentencing Study: A Report to the Judges. Washington, D.C.: Federal Judicial Center, 1974. Payne, Abigail. ‘‘Does Inter-judge Disparity Really Matter? An Analysis of the Effects of Sentencing Reforms in Three Federal District Courts.’’ International Journal of Law and Economics 17 (1997): 337–66. Schulhofer, Stephen. ‘‘Assessing the Federal Sentencing Process: The Problem Is Uniformity, Not Disparity.’’ American Criminal Law Review 29 (1992): 833–73. Searle, Shayle, et al. Variance Components. New York: John Wiley & Sons, 1992. Seymour, Whitney North. ‘‘1972 Sentencing Study for the Southern District of New York.’’ New York State Bar Journal 45 (1973): 163–71. Stith, Kate, and Cabranes, José A. Fear of Judging: Sentencing Guidelines in the Federal Courts. Chicago: University of Chicago Press, 1998. Taha, Ahmed. ‘‘The Effect of the Federal Sentencing Guidelines on the Disposition of Criminal Cases.’’ Unpublished manuscript. Washington, D.C.: U.S. Department of Justice, August 1998. Trott, Stephen S. Letter to Hon. William W. Wilkins, Jr. (April 7, 1987), reprinted in Federal Sentencing Reporter 8 (1995): 196–98. Trott, Stephen S. Letter to Chairman of U.S. Sentencing Commission (November 9, 1994), reprinted in Federal Sentencing Reporter 8 (1995): 199. U.S. Attorney General. Annual Report. Washington, D.C.: U.S. Government Printing Office, 1940. U.S. Board of Parole. Biennial Report. Washington, D.C.: U.S. Board of Parole, 1970–72. U.S. House of Representatives. Hearings on Corrections, Federal and State Parole System before Subcommittee No. 3 of the House Committee on the Judiciary. 92d Cong., 2d Sess. (1973). U.S. Senate. Of Prisons and Justice. U.S. Senate Document No. 70, 88th Cong., 2d Sess. (1964). U.S. Senate. Report No. 225. 98th Cong., 1st Sess. (1984). Reprinted in U.S. Code Congressional and Administrative News 56 (1984). U.S. Senate. Committee on the Judiciary. Reform of the Federal Criminal Laws: Hearings on Section 1347 before the Subcommittee on Criminal Laws and Procedures. 95th Cong., 1st Sess. (1977): 8580, 8995. U.S. Sentencing Commission. Annual Report. Washington, D.C.: U.S. Sentencing Commission, 1990. U.S. Sentencing Commission. The Federal Sentencing Guidelines: A Report on the Operation of the Guidelines System and Short-Term Impacts on Disparity in Sen- This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). sentencing disparity 307 tencing, Use of Incarceration, and Prosecutorial Discretion and Plea Bargaining. Washington, D.C.: U.S. Sentencing Commission, 1991. U.S. Sentencing Commission. Special Report to Congress: Mandatory Minimum Penalties in the Federal Criminal Justice System. Washington, D.C.: U.S. Sentencing Commission, 1991. U.S. Sentencing Commission. Annual Report. Washington, D.C.: U.S. Sentencing Commission, 1995. U.S. Sentencing Commission. Sourcebook of Federal Sentencing Statistics. Washington, D.C.: U.S. Sentencing Commission, 1997. von Hirsch, Andrew. Doing Justice: The Choice of Punishments. New York: Hill & Wang, 1976. Waldfogel, Joel. ‘‘Aggregate Inter-Judge Disparity in Sentencing: Evidence from Three Districts.’’ Federal Sentencing Reporter 4 (1991): 151–54. Waldfogel, Joel. ‘‘Inter-judge Disparity in Federal Sentencing: Evidence from Three Federal Districts, 1984–90.’’ Unpublished manuscript. New Haven, Conn.: Yale University, Department of Economics, 1992. Wheeler, Stanton, et al. Sitting in Judgment: The Sentencing of White-Collar Criminals. New Haven, Conn.: Yale University Press, 1988. Wicker, Tom. ‘‘Judging the Judges.’’ New York Times. February 6, 1976, p. A29. This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). This content downloaded from 104.247.035.210 on October 03, 2017 11:23:39 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).