Report of the Expert Panel on the Impact of the ISTEP+ Scoring Issue Edward Roeber, Derek Briggs and Wes Bruce December 21, 2015 Summary An expert panel’s independent review of data provided by CTB finds no evidence that students were erroneously given a lower score on the Spring 2015 ISTEP+ Writing tests, the first part of the two-part E/LA assessments. The State Board of Education (SBOE) independent expert panel was comprised of Derek Briggs, Professor, University of Colorado; Wes Bruce, Consultant; and Edward Roeber, Assessment Director, Michigan Assessment Consortium. The panel analyzed the anonymous allegation that a software glitch caused CTB to erroneously assign a lower score to the Writing assessments. In response to this allegation, CTB asserted that the software glitch in question was extremely difficult to reproduce, had no effect on student scores, was addressed through a procedural change when it was brought to the attention of scoring manager, and then the software was updated to avoid the possibility of the glitch happening again eight days after being brought to CTB’s attention. The panel’s analyses included, among other things, a comparison of: (1) the percentage of students receiving high scores (e.g., 5s and 6s) scores on the Writing portion of the ISTEP+ E/LA assessment before and after the software problem was fixed, and (2) the percentage of students receiving identical scores on both parts of the ISTEP+ Writing tests before and after the software problem was fixed. These were the areas where a glitch, if it occurred, would have had the most pronounced impact. The panel found no evidence of changes in student scores on the writing section of the ISTEP+. Based on these analyses, the expert panel also believes that this issue did not have an impact on the scores in the remaining parts of the ISTEP+ assessments. Introduction On Sunday, December 13, 2015, the Indianapolis Star newspaper ran a story based on an anonymous letter it received alleging that a glitch that occurred during the scoring of the 2015 ISTEP+ assessments had resulted in erroneous scores being given to students on both the ISTEP+ mathematics and English language arts assessments. This newspaper story resulted in additional efforts to investigate this situation to determine whether or not such a glitch did occur and if it did, how many students’ scores were affected. The purpose of this report is to summarize the steps taken to investigate this situation, the data examined, and the conclusions drawn. The anonymous letter writer indicated that scorers were encouraged to enter the two scores using number keypads attached to the computers, since this would be more efficient than using pull-down menus. Scoring software was set so that the scorer would enter the first score, then the computer would move on to show where the second score was to be entered, and the scorer would then enter the second score. The computer would then move on to the next student to be scored. The anonymous letter also indicated that the glitch was first reported on April 22, 2015, eight days after scoring students’ responses began. According to the letter, CTB was apprised of the issue and scorers were told not to use the number keypad but instead, to use the mouse and dropdown menus for score entry. The letter indicated that a meeting with scoring supervisors was held on April 30, 2015. CTB was asked, according to the anonymous letter writer, about re-scoring all of the student responses already scored during these eight days. CTB staff indicated that the responses would not be rescored. 1 ISTEP+ Scoring A number of written-response mathematics and English language arts ISTEP+ assessment items require scorers to score each of the items on two dimensions. For example, in ELA the writing prompt at each grade is first scored using a 6-point writing rubric and then on a 4-point grammar rubric; and in mathematics the problem solving items at each grade are scored using a 2- or 3-point content rubric and a corresponding 2- or 3-point process rubric. Samples of these item types can be found at http://www.doe.in.gov/assessment/istep-grades-3-8. During scoring, two checks were made on the reliability and validity of scoring respectively. First, a random sample of 5% of student responses was read by a second reader. These so-called “readbehinds” are performed by scoring leaders. This serves as a check to make sure that scorers are rating the responses consistently with one another. The second check gives scorers expert-pre-scored responses at pre-determined intervals, such as after each scorer has scored 25 responses. These checksets are embedded among the items being scored and appear like all other responses. Failure to score a checkset correctly creates an immediate flag for the scoring supervisor, who may review with the scorer the rubric being used, re-train the scorer if this re-occurs, or may even dismiss the scorer is this occurs frequently. This checkset serves to assure that scorers are validly scoring responses according to the criteria of the scoring rubrics and sample student responses on which they were trained and certified. These are quality control steps typically taken during scoring of student written responses. It was alleged in the anonymous letter that it was possible to quickly enter the two scores on a student’s essay for the two dimensions such that the second score given not only was entered into the second field on the computer but also replaced the first score. If true, this would result in the same two scores being applied to each student for which this glitch occurred. And, because the range of possible scores on the second dimension of the Writing assessments was only 0-4, fewer students would receive high scores (5s and 6s) on the first dimension of the Writing assessments. Steps Taken by CTB and IDOE to Investigate The same anonymous letter was received by the Indiana Department of Education (IDOE) on November 25, 2015 and forwarded to CTB for response on November 30, 2015. By the time of the newspaper story, the IDOE had already begun with an investigation of the allegation through the ISTEP+ program contractor CTB. The director of the ISTEP+ program had sent CTB a number of questions and requests for data that might be used to investigate this situation. On behalf of CTB, Ellen Haley responded on December 8, 2015 to the anonymous letter, indicating that it was first apprised of the issue on April 22, 2015, and that the issue was brought to the attention of scoring management the same day. Ms. Haley writes: “[s]scoring management immediately looked into the keypad issue but had difficulty reproducing it. Only when the scoring director, after many tries, entered two scores in very rapid succession, before the second trait screen had fully loaded, did the issue occur. The director had to enter the two scores nearly simultaneously to cause the override.” The letter gives a detailed chronology of events in the April 22-30, 2015 period. The letter indicates that on May 1, a fix for the scoring issue was released and began to be used on May 2, 2015. Ms. Haley concludes her letter with this statement: “In sum, my investigation confirmed the scoring director’s opinion that the keypad occurrence did not impact the scoring. If an evaluator entered successive scores too quickly, he or she would see the overwrite/scores change and could go back and re-enter both scores. As soon as the issue was reported, evaluators were instructed not to use the keypad and to use the mouse for score entries. In addition, quality metrics – check 2 sets, read behinds, inter-rater reliabilities for responses read twice, pattern of “split scores” versus same scores by trait – none of these indicated an issue for student scores. The issue was not a common occurrence, was actually difficult to create in live scoring, and was fixed quickly – both in on-the-floor instructions and then technically, in the PEMS software. Based on CTB’s quality control tests, there was no need to rescore any tests as a result of the keypad issue.” Additional information about actions taken by IDOE and CTB are shown in the attached Appendix. It shows the chronology of steps taken to investigate the scoring issue in the appendix of this report. Steps Taken by Expert Panel to Investigate Upon learning of this issue on December 13, 2015, the experts suggested several ways to investigate the issue. Their questions and requests for data were forwarded to CTB by SBOE staff, and CTB responded promptly with answers to the questions, data sets as requested, and interpretations of the data for expert panel review. The panel started by assembling a chronology of the scoring Incident, shown in Figure 1. Figure 1. Timeline of Investigation of the Scoring Incident Date April 8 April 22 Activity ISTEP scoring begins A “Writing” scorer reports keypad issue to their supervisor Supervisors were instructed to direct scorers to stop using the numerical keypad April 30 Regular meeting with all scoring supervisors (ISTEP and other programs). Supervisors were reminded the keypad was not to be used and told that a fix would soon be deployed May 1 At about 9:30 pm, a software update that eliminated the use of the numerical keypad in CTB scoring system was deployed. November 25 Anonymous letter received at IDOE November 30 Anonymous letter sent by IDOE to CYB for response December 8 CTB responds to IDOE December 13 Indianapolis Star prints story about ISTEP+ scoring December 13-21 Expert panel suggests methods to determine whether the glitch occurred systematically and if so, what impacts it had During the December 13-20, 2015 period, the expert panel was provided with the data sent to the IDOE as well as additional data requested by the SBOE on behalf of the expert panel. Two types of data were especially important for the expert panel to review. These are 1) differences in the proportion of student responses given high scores (5s and 6s) on the extended writing prompt before and after the glitch was fixed, and 2) whether the percent of students given identical first and second scores went down after the glitch was corrected. These two sets of data are important in detecting whether a second score over-riding a first score changed the response that should have been given to students. Three time periods were examined in order to determine the impact of the glitch on scoring. These are: o o o Period 1 (Beginning of scoring to April 22) – Scoring that took place before the glitch was discovered Period 2 (April 23 to May 1) – Scoring that occurred during the time when scorers were told to not use their number keypads use for data entry but the glitch had not been fixed Period 3 (May 2 to end of scoring) – Scoring that took place after the glitch was corrected by CTB. The expert panel looked at both the mathematics and ELA data supplied to it, but felt that given the 3 score scales used to score the Writing prompts, this content area was most likely to show differences if the glitch occurred on a wide-scale basis during scoring. Scores for writing quality are scored on a scale of 0 to 6 point scale, and these would have been reduced by the second score, which was scored on a 0 to 4 point scale. If the glitch resulted in the first score (which could range up to 6) being replaced by the second score (that could only go as high as 4), then there should be fewer 5s and 6s in the Period 1 and perhaps in Period 2 as well (especially if all scorers did not switch to scoring using pull-down menus as alleged by the anonymous letter writer), when compared to the number of 5s and 6s in Period 3 (when the number keypads were de-activated). Table 1 shows the percent of high scores data for all of the ISTEP+ Writing prompts. Table 1. Scoring Trends Before, During and After Discovery of Error – All Writing Prompts Grade RIB # Rubric Score 3 3 4 4 6 6 7 7 7 7 1 1 1 1 1 1 1 1 2 2 5 6 5 6 5 6 5 6 5 6 Percent of Students Receiving Score Period 1 Period 2 Period 3 4.96 5.12 4.64 1.08 1.06 0.70 4.05 4.16 3.38 0.91 0.87 0.30 7.28 7.91 7.93 0.62 0.47 0.39 5.96 8.38 3.29 0.71 0.79 0.33 5.42 4.88 4.08 1.21 0.36 0.20 % Period 3– % Period 1 -0.32 -0.38 -0.67 -0.61 0.65 -0.23 -2.67 -0.38 -1.34 -1.01 Table 1 shows that with one exception (grade 6), a slightly smaller proportion of 5s and 6s were received in Period 3 than in Period 1. There appears to be little or no evidence that the glitch, if it did occur, caused scores to be lowered in Period 1. The expert panel also used a second type of data to investigate whether the glitch impacted students’ scores. Had the glitch been occurring on a wide-scale basis during Period 1, not been fully corrected in Period 2 (because some scorers were continuing to use the number keypads in spite of CTB directions not to do so, as alleged by the anonymous letter writer), and then eliminated in Period 3, then the data should also show significantly large percentages of exact agreement in Period 1 versus Period 3 for the two scores given to each Writing response. Table 2 shows the number of exact scores given to student responses on the two dimensions used to score Writing responses during the three time periods (before and after the glitch was discovered and corrected). The comparisons shown are for assessment items in which significant numbers of student responses were scored during each of the three scoring periods. 4 Table 2. Percent of Students Receiving Duplicate Scores Before, During and After Discovery of Error Writing Task 3 Writing RIB 1 4 Writing RIB 2 6 Writing RIB 2 7 Writing RIB 1 7 Writing RIB 2 Percent Exact Agreement First/Second Scores Time Period 1 Time Period 2 Time Period 3 47.35 43.29 48.72 42.38 27.91 46.27 38.94 50.00 50.94 36.34 48.56 33.00 35.86 36.52 35.49 Period 3 – Period 1 1.37 3.89 12.00 -3.34 -0.37 As can be seen, there is virtually no pattern of higher exact agreement on the first and second dimension scores in Period 1 versus Period 3 across the five prompts for which such comparisons are possible. Three prompts showed more exact agreement between the first and second dimension scores in Period 3 than Period 1, and two prompts showed less exact agreement in Period 3 versus Period 1. However, none of these are large differences and could be accounted for by differences in the students scored in each time period. Since a clear pattern is not evident, evidence of a glitch that changed students’ scores is not evident here, either. Conclusions The IDOE, CTB, SBOE, and the expert panel took steps to investigate this situation. Unfortunately, since scoring of the ISTEP+ tests concluded over six months earlier, some of the steps that might have been possible to examine as the glitch was supposedly occurring aren’t possible so many months after the conclusion of scoring. In addition, looking at large data files does not mean that each student was scored accurately, only whether large effects of the glitch can be discerned. Finally, although the expert panel investigated data from three periods (before the glitch was discovered, when scorers were not to use the number keypads, and after the number keypads were disabled), the lack of comparability between the students scored during each time period hampers the cross-period analyses. With these disclaimers aside, however, the expert panel did not see evidence, in the form of either reduced scores or higher exact agreements among scores for the same responses, that supports the allegations of the anonymous letter writer. There does not appear to be a discernable impact of this scoring glitch on overall scores given to students’ responses. 5 APPENDIX Chronology of Steps Taken to Investigation of the ISTEP+ Scoring Issue As summarized in our report, there were a number of steps taken by IDOE, CTB, the SBOE, and the expert panel to investigate the allegations contained in the anonymous letter sent to both the Indianapolis Star newspaper and the Indiana Department of Education (IDOE). This chronology of the steps taken is provided to assure educators and the public that the allegations were carefully and thoroughly investigated, using all available data. Several pieces of information are cited here: o o o o o o Anonymous letter sent to the IDOE and the IDOE (attached) December 8 letter from CTB, which was a response to the anonymous letter sent to it on November 30 by IDOE (excerpted in the expert panel report and attached) December 15 CTB response to the second communication from IDOE dated Dec 10 December 17 CTB response to the expert panel questions sent on December 13 December 18 CTB response transmitted with additional data provided to the expert panel CTB responses attached to the December 18 transmittal containing CTB interpretations of the additional data provided to the expert panel Anonymous Letter This letter is attached. December 8 Letter from CTB to the IDOE This letter is attached. December 15 CTB Response to December 10 E-Mail from IDOE Based on the December 8 letter from CTB, IDOE responded on December 10, 2015 via e-mail with a series of additional questions for CTB. CTB responded to these questions (in italics) on December 15, 2015: · “The number of test items that were scored between April 13 and April 22 (all dates inclusive). Answer: 72 items had at least one student response scored. See “ISTEP Part 1” attachment. · The number of students whose tests were scored between April 13 and April 22 (all dates inclusive). Answer: 227,373 students had at least one item scored in this time frame. See “ISTEP Part 1” attachment. · The schools and school corporations that had their tests scored between April 13 and April 22 (all dates inclusive) Answer: 1,503 schools had at least one item scored in this time frame. See “ISTEP Part 1” attachment. · The number of test items that were scored between April 23 and April 30 (all dates inclusive). Answer: 63 items had at least one student response scored. See “ISTEP Part 1” attachment. · The number of students whose tests were scored between April 23 and April 30 (all dates inclusive). Answer: 354,059 students had at least one item scored in this time frame. See “ISTEP Part 1” attachment. · The schools and school corporations that had their tests scored between April 23 and April 30 (all dates inclusive). Answer: 1,822 schools had at least one item scored in this time frame. See “ISTEP Part 1” attachment. · Within the date ranges in question (April 13-22 and April 23-30), were the test items being scored narrowly focused to one item type (i.e., writing prompt)? Please identify the item type(s) being scored at that time. Answer: All types of constructed response items were scored during this time period (Math, ELA CR, ELA ER and Writing). See “ISTEP Part 1” attachment. · What specific assurances can CTB provide that the tests scored between April 13 and April 30 were scored accurately? Are these assurances based on statistical analysis after the fact, real-time data generated during testing, current review of the actual tests themselves, or some other manner? Answer: 6 See six quality reports attached. Based on these reports, generated daily during scoring, with three days highlighted here for comparison purposes, and a careful review of all quality data, we see no evidence of any impact from the keypad occurrence and can assure Indiana that student scores are correct.  Each day, scoring quality was monitored to ensure valid and reliable scoring was taking place. As part of the quality monitoring process, the following statistics were captured for each item every day that item was being scored o Validity: Pre-scored responses sent at random to readers to test adherence to scoring rubrics and guidelines o Inter-Rater Reliability: 5 percent of all documents were scored twice, allowing tracking of reader agreement rates o Score point distribution: Percent of responses scored at each score point  As part of the validity process, readers were given immediate feedback if they miss scored a response. This included showing the reader the scores they provided as well as the correct scores. Had there been an issue with the first trait score not being captured correctly, this would have been noted immediately by the reader. With hundreds of readers scoring, thousands of validity responses were being scored each day.  Validity statistics for items that were being scored prior to the issue being resolved were in expected range, and were comparable to the validity statistics for items scored after the issue was corrected.  Inter-rater reliability statistics for the first trait of items do not indicate an issue with reader agreement, which we would see if first trait scores were being overwritten. IRR stats for items scored prior to the issue were comparable to similar items that were scored after the issue was corrected.  Score point distribution for multi-trait items do not indicate issues with the first trait being overwritten by the 2nd. While split scores are less common in the writing items, and thus the SPD of the 2 traits align (this is the case both for items scored before the issue was corrected as well as after), this is expected. For math, however, SPD of the 2 traits are relatively independent, and this is reflected in both the items that were scored prior to the issue being corrected as well as the items scored after.  Also as a note, when the keys were hit in such a way to make the defect occur, the score changes visible on screen. No reader noted this occurring prior to 4/22 despite hundreds of thousands of item reads that were completed to that point, indicating this was not a common occurrence.” December 17 CTB Response to Expert Panel Questions from December 13 Several questions were posed by the expert panel by December 13, 2015. Both the expert panel questions and responses from CTB (which were received on December 17, 2015) are shown below: Question: On the QA charts, which items are the items in question. Answer: ISTEP items with 2 dimensions: a. All Math Items (score ranges of 0-2/0-2 or 0-3/0-3) b. All Writing items (score range of 1-6/1-4) c. All ELA ER - RIB 4 and RIB 8 for each grade (score range of 1-4/1-4) Question: content area the rater who reported was scoring. Answer: Supervisor that reported the issue was overseeing scoring of Writing. Question: Definitions of terms that are not defined such as "red flags" or "yellow flags." Answer: Red flags and yellow flags – a reader will, on average, take between 10 and 15 checksets per day. Each day, the reader is expected to maintain a certain level of exact agreement against the key scores for the checkset responses that they score. A reader falling below this standard received a red flag, which results in corrective action being taken up to and including removal from scoring for that item and resetting of responses. A yellow flag is given if the reader is above the required quality standard, but below the standard required for qualification on the item. Question: Number of dimensions scored on each of the open-end items Answer: All items on ISTEP are either single dimension or 2 dimension items. The 2-dimension items are listed in #1 above. 7 Questions with Answers to come later today, or Friday morning: · The pre-disabling and post-disabling data on levels of exact agreement among the various dimensions on items where there are two or more dimensions for all of the items. · As I understand the issue, if the second score over-rides the first and shows up as both the first and the second score, there should be a higher level of exact agreement for the scores students received who were scored prior to disabling the key pads versus those scored after the key pads were disabled. This could be a cumulative score report from scoring prior to disabling the key pads (what the official date of this May 3 or May 4?) and a cumulative one for scoring done after the key pads were disabled (not cumulative through all scoring). I did not see this information, but perhaps I am not reading the reports correctly. · The RIB reports seem to show exact agreement between scorers - either of the check sets or the readbehinds. This is different from the information that I requested and is only tangentially related to intrastudent score agreement that I am interested in. · The two windows provided are from the time the issue was identified and before and the time the issue was identified until the fix was put into place. I/we need to see how this compares with AFTER the fix was in place. Same numbers provided, but for AFTER the issue was corrected. · The comparison (number and percentage) of 5s and 6s awarded in these windows and for all three windows. · Scores (or score distribution) up to keypad being disabled ("was released on May 1, 2015 at 9:30 p.m") and after -- at a minimum from beginning of scoring through last shift on May 1 (Friday) and from May 4 to the end of scoring. Need it only for live items, even week by week. · Read behind data on items with identical scores (4/4, 3/3, 2/2, 1/1).” December 18 CTB Response with Additional Data Provided to the Expert Panel The following response from CTB was forwarded to the ISBE staff and the experts on December 18, 2015: “Please find attached the data to answer your remaining questions, noted below. You requested GA summary data, but the data belongs to that customer, and they did not give me consent to share it with you. I believe the data here and in my previous two emails should answer your questions for Indiana. Our findings on the attached set of data are noted in 1 and 2 at the bottom of this email. We do not see any evidence of the keypad overwrites, and we see no impact on student scores in this or any of the data. Scores (or score distribution) up to keypad being disabled ("was released on May 1, 2015 at 9:30 p.m.") and after -- at a minimum from beginning of scoring through last shift on May 1 (Friday) and from May 4 to the end of scoring. Need it only for live items, even week by week. Read behind data on items with identical scores (4/4, 3/3, 2/2, 1/1) The pre-disabling and post-disabling data on levels of exact agreement among the various dimensions on items where there are two or more dimensions for all of the items. And As I understand the issue, if the second score over-rides the first and shows up as both the first and the second score, there should be a higher level of exact agreement for the scores students received who were scored prior to disabling the key pads versus those scored after the key pads were disabled. This could be a cumulative score report from scoring prior to disabling the key pads (what the official date of this May 3 or May 4?) and a cumulative one for scoring done after the key pads were disabled (not cumulative through all scoring). I did not see this information, but perhaps I am not reading the reports correctly. The comparison (number and percentage) of 5s and 6s awarded in these windows and for all three windows. There are 2 major indicators in this particular set of data: 1. Looking at the ISTEP Writing, a similar percentage of responses were given 5’s and 6’s for trait A in all 3 8 time periods for which we gathered data (the time before the defect was discovered, the time between discovery and the fix, and the time after the fix). Since the 2nd trait has a max score of 4, we would see a lower percentage of 5’s and 6’s for trait 1 had the first trait been overwritten by the trait 2 scores. 2. Looking across Writing, ELA ER and Math, the percentage of responses receiving the same score for trait A and trait B likewise was comparable when measured across the three time periods. Since the effect of the defect would be to artificially increase the number of responses receiving the same score for both traits, we would see a larger percentage of these in the earlier time period had the defect impacted scoring, and we do not see this. We do not see any evidence of the overwriting occurring. The data is very clean. We do not see any impact on student scores. “ December 18 CTB Responses Attached to the Transmittal E-Mail Containing Additional Data Provided to the Expert Panel Three additional responses to the questions posed listed above were attached to the December 18, 2015 CTB letter. These responses serve as explanations of the accompanying data files sent to the expert panel, as well as CTB’s interpretation of what each data file shows. These explanations are: “ISTEP_13_5s_6s_Comparison There are 3 tabs: "BOS to EOD 4-22" represents data from the start of scoring until the end of day on 4-22. $22 was when the issue was discovered and readers were told to score using the mouse instead of the keyboard. "BOD 4-23 to EOD 5-1" represents data from scoring on the beginning of the day on 4-23 to the end of the day on 5-1. The fix was put in place after scoring ended on 5-1. "BOD 5-2 to EOS" represents data from the beginning of the day on 5-2 to the end of scoring. This is scoring that occurred after the issue had been fixed. RIBNAME represents the name of the item (all items on this report are Writing items ITEM represents the item number and data point (all data points on this report are for trait A (the first of the two traits) ITEMGUID represents the internal item number ITEMTRSITGUID represents the internal trait number "Score" represents the score given to the trait. This report shows the number of responses given either a 5 or a 6. Count is the number of responses given the listed score point Total count is the number of responses scored in the given time frame Percent is the percent of responses scored that were given the score during the given time frame. Interpreting this report: This report is intended to show how often students on the extended writing essay received scores of 5 or 6 in the three time periods in question. The reason why this is important is that the second trait for the extended writing has a maximum score of 4. so, if the first trait scores were being overwritten due to the defect, we would likely see a lower number of scores of 5 or 6 on the first trait, as a scorer that intended to score a 5-4, for example, would have instead had the score recorded as a 4-4. This would show in the statistics as fewer students receiving scores of 5 and 6 in the time period that the defect was present vs. the number of 5's and 6's given in the period when the defect had been corrected. Observations on the data in the report: Looking at items which had significant amounts of scoring in more than one time period, we do not see any patterns which indicate fewer 5's and 6's were being given during the period when the defect was present. For example, 4 Writing RIB 2 had 4.05 percent of responses (out of 55597) receive a score of 5 and 0.91 percent of responses scored as a 6 during the first time period. During the 2nd time period, this was 4.16 percent 5's (out of 20240) and 0.87 percent 6's. During the third time period, we see 3.38 percent 5's and 0.3 percent 6's (out of 1330). The first time period is when we would expect fewer5's and 6's had the defect been impacting the score data, but we do not see this. The same hold for the other items which have significant amounts of scoring taking place in multiple time periods. There is no indication that fewer 5's and 6's were being given when the defect was present.” “ISTEP_09_Exact_Agreement 9 There are 3 tabs: "BOS to EOD 4-22" represents data from the start of scoring until the end of day on 4-22. $22 was when the issue was discovered and readers were told to score using the mouse instead of the keyboard. "BOD 4-23 to EOD 5-1" represents data from scoring on the beginning of the day on 4-23 to the end of the day on 5-1. The fix was put in place after scoring ended on 5-1. "BOD 5-2 to EOS" represents data from the beginning of the day on 5-2 to the end of scoring (there is a typo here, spreadsheet says 5-22 instead of 52). This is scoring that occurred after the issue had been fixed. RIBNAME is the name of the item TotalResponseCount is the number of responses scored during the given time period ExactResponseCount is the number of responses scored during the given time period where the score for the first trait and the score for the second trait were the same numerical value (0-0, 1-1, 2-2, 3-3 or 4-4). ExactResponse percent is the percentage of responses scored during the given time period where the score for the first trait and the score for the second trait were the same numerical value. Interpreting this report: This report shows how often the score for the 2 traits/dimensions for an item matched. For example, on a Math CR item, the score range for trait A is 0-2and the score range for trait B is 02, so possible numerical score combinations are 0-0, 0-1, 0-2, 1-0, 1-1, 1-2, 2-0, 2-1 and 2-2. If the defect were impacting scores, we would tend to see more scores where the score for trait A and the score for trait B matched, so this percentage would be higher during the earlier time period when the defect was present. Observations on the data: We do not see any trends where an item is showing a higher percentage of matching A/B scores in the earlier time periods, thus showing that the defect was not having an impact on the students' scores. To see this, we would look at items which had significant amounts of scoring in multiple time periods, and compare the percent of A/B matching scores. For example, 4 math RIB 2 had 26380 responses scored in the first time period, 33779 in the second time period, and 14116 in the third time period. The percent of matching A/B scores was 38.44 percent in the first time period, 40.90 in the second time period, and 38.29 percent in the third time period. Had the defect been impacting scores, we would see a higher percent of matching scores in the first time period of scoring that took place before the defect was noted. This holds true as you look at all of the items scored in the earlier time period. We do not see any trend of higher percentage of matching A/B scores for scores applied before the defect was notice or before it was corrected.” “ISTEP_02_ScoreFreq There are 3 tabs: "BOS to EOD 4-22" represents data from the start of scoring until the end of day on 4-22. 422 was when the issue was discovered and readers were told to score using the mouse instead of the keyboard. "BOD 4-23 to EOD 5-1" represents data from scoring on the beginning of the day on 4-23 to the end of the day on 5-1. The fix was put in place after scoring ended on 5-1. "BOD 5-2 to EOS" represents data from the beginning of the day on 5-2 to the end of scoring. This is scoring that occurred after the issue had been fixed. RIBNAME is the name of the item Item is the item number and trait ITEMGUID is the internal item number ITEMTRAIGUID is the internal trait number Zero-Six and A-E - This shows the number of responses scored at each numerical score point and each condition code during the given time frame. Interpreting this report: It is more difficult to use this report to make a claim about the impact of the defect, but the defect, if it were impacting students' scores, would show an impact in the score distributions for trait A (score distributions would be different during the earlier time period when the defect was present). Observations about the data: There is no indication of different score distributions for trait A in the earlier time period vs. the later time periods. 10 Anonymous Letter Sent to IDOE and the Indianapolis Star To Whom it May Concern: A new system was utilized this spring for scoring the.2015 spring test. A major flaw in the system was brought to management?s attention after 8 days of scoring. The fl aw was that the system changed students? scores when utilizing the numerical keypad. Specifically, on those items in which 2 scores were assigned, the second score entered on the numerical key pad would override the first score. For example, if a student was to receive a score of (4, 2) and a score of (4, 2) was entered too quickly when utilizing the numerical keypad, the system would assign a score of 2 to both parts since 2 was the last number entered. ihus the student would receive}; score of (2, 2) rather than (4, 2). This problem was common knowledge among evaluators, team leaders and supervisors, all of whom were very concerned._ A meeting was held a week later on April 30?" to discuss this issue. Mike Conarroe chaired the meeting. Mr. Conarroe made two decisions regarding what was happening. The first decision was that supervisors were to instruct their groups to no longer use the numerical keypad and instead use the mouse. Unfortunately, some evaluators continued to use the keypad anyway out of habit since that is how they were initially taught and had been scoring in this fashion for years. This in fact was always previously encouraged because the method was faster than using the mouse. This bene?tted the evaluators because having high production-numbers helped to ensure that they would be called in to work on other projects. The second decision was that all the students who had potentially been assigned incorrect scores would not be reuscored, because it would put the project behind. My only concern is that students be assigned correct scores. Because of this flaw in the system students scores were obviously compromised. Because the majority of evaluators have always used the keypad, it is safe to say this has had a major impact on the integrity of the results. i certainly hope that you can convince (3TB to rescore these student tests since the stakes on this test are so high. However, because of the decisions that were made, and because it is unlikely that you were ever informed about what occurred, i fear that Mr. Conarroe and CTB Management will not be forthcoming about this matter. if so, i hope that others present in the April 30?" meeting will be. Unfortunately, i must communicate this information to you anonymously. Even though i am no longer employed by CTB, i did sign a confidentiality agreement when I was hired. Sincerely, 203.5 (3TB Employee 11 December 8, 2015 CTB Letter Sent to IDOE in Response to Anonymous Letter Ellen Haley Executive Vice President I McGraw-Hill Education I I December 8, 2015 Michele Walker, Director of Student Assessment Indiana Department of Education South Tower, Suite 600 115 W. Washington Street Indianapolis, IN 46204 Dear Dr. Walker, I am writing to confirm our discussion from last week about the issues raised in the anonymous letter sent by someone identifying himself or herself as a former CTB employee, received by your office on November 25, 2015 and forwarded to OTB by e-mail on November 30, 2015 concerning CTB's Performance Evaluation Monitoring System (PEMS.) At the request, we conducted a thorough review of the allegations raised in the anonymous letter. This letter summarizes our investigation. During the investigation, I personally interviewed the employees who were involved in spring scoring processes and supporting scoring software. These included: Mike Conarroe, Director of Hand Scoring; Derek Adams, Director of PEMS Software Development, Christy Huggins, Director of Software OA, and Brenda Williams, Sr. Director of Operations. I reviewed relevant emails and other documents from the April 22, 2015 time period. In sum, as a result of the investigation, I found that there was a very rare, anomalous, temporary keypad issue, which was resolved immediately in the scoring process and fixed quickly in the software. The issue did not affect student scores, as evidenced by our many quality metrics -- check sets, read behinds, inter- rater reliabilities for responses read twice, pattern of ?split scores? versus same scores by trait all of which are monitored throughout each day of scoring. While we do not know the identity of the author of the anonymous letter, we suspect that it could have been a temporary scoring supervisor, who was terminated recently. If the author was in fact a temporary scorer or scoring supervisor, he or she would not have had access to the above documents or information. Below I will summarize the sequence of events and details related to the keypad occurrence: On April 22, 2015, an evaluator notified his scoring supervisor that when using the numeric keypad to enter trait scores for a student?s response, in certain instances, the assigned scores in PEMS were being changed after he entered them. On the same day, the supervisor escalated the issue to scoring management and suggested restricting keypad use until the issue was investigated. Scoring management immediately looked into the keypad issue but had difficulty reproducing it. Only when the scoring director, after many tries, entered two scores in very rapid succession, before the second trait screen had fully loaded, did the issue occur. The director had to enter the two scores nearly simultaneously to cause the override. The director found that when the User Interface was transitioning between the traits, if an evaluator were to enter the second trait score while the first trait was still on the screen, the system would enter the second score 20 Ryan Ranch Road Monterey, CA 93940 phone: (831) 393.7757 12 for both traits. But if the evaluator waited for the second trait to finish loading, and the first trait was completely collapsed in the accordion, then the second score would enter properly and did not change the first score. If the scores were entered with a pause, rather than nearly simultaneously, the scores got entered as expected. Management concluded that the override occurrence was very rare. Readers were trained to read the student response for the first trait and provide a score. Then read the same response for the second trait, and enter that score, Even if a reader assigned two traits after one reading, normal keyboard techniques and careful score entry by our trained readers would indicate that the very rapid entry was highly unusual. Despite the infrequency of the override, on April 22, 2015, scoring supervisors and evaluators were immediately notified of the issue and were directed to not use the keypad but to instead use the mouse to enter the trait scores while CTB investigated the cause of the problem and developed a fix. Using the mouse was a straightforward approach most scorers use the mouse in any case. Scoring management submitted the issue as a defect to our Tier 3 technology team, and it was entered as a defect in the tracking system by Tier 3 on April 24, 2015. On April 28, 2015, the Tier 3 team developed a fix, which disabled the keypad completely, so that the mouse was -- and still is -- the only way to enter a score. The fix for the keypad issue was tested on April 30, 2015 and was released on May 1, 2015 at 9:30 pm. Scoring had the new release for use when scoring resumed the next day. I also wanted to specifically address your three questions: 1) Was there a change in the scoring system/tool evaluators used to score the Spring 2015 assessment? If yes, please provide specific details regarding this change (when was it implemented, what were the major differences between the new system and the prior system, etc.) In Spring 2015, CTB used our PEMS (Performance Evaluation Monitoring System) hand scoring software. PEMS was developed over the last few years to replace our EHS (Electronic Handscoring System) and to support the complex content introduced by the new state and common core standards of recent years. PEMS is a significant upgrade to hand scoring capabilities and was first used in by CTB in January 2014. EHS was a desktop, client-server system. PEMS is a web-based, distributed scoring system. EHS was able to best handle simple open ended items and prompts. PEMS was designed to score technology-enhanced items, images, complex content, and to handle the volume of this content for consortia states as well as states like Indiana that adopted more rigorous standards and introduced tests with more involved content. 2) What (if any) issues or problems arose with the scoring of student responses using the scoring system (if the system used for Spring 2015 scoring was not new, please respond to this question based on the existing system as it relates to the allegations in the complaint letter)? There were scoring interface and backend issues related to the new system, as is to be expected with any fairly new system. Daily meetings between Scoring and Technology senior staff were held, as they are every spring, to proactively monitor system usage and to discuss and resolve possible technical 20 Ryan Ranch Road Monterey, CA 93940 phone: (831) 393.7757 13 topics. We had a ?war room? to monitor and check for issues and to resolve any issues detected. This is the same process used every scoring season for all systems. 3) Was there a meeting that took place on April 30, 2015, (or any other date) to discuss issues with the scoring system used for the Spring 2015 assessment? If so, please provide a copy of the agenda and specific details regarding what was discussed/shared during this meeting as it relates to the allegations in the complaint letter. The author of the anonymous letter referenced a meeting on April 30, 2015. confirmed that there was a site meeting with the scoring supervisors on that date. The meeting was an informational, non-project specific update for our supervisors, working on all projects, notjust PEMS projects, at the site. The meeting included approximately 20-22 temporary supervisors and 1-3 CTB regular employees. The April 30, 2015 meeting was not specifically called to discuss the keypad issue, but an update on the keypad issue was one of several topics covered. Specifically, supervisors were informed that the fix was in progress and that evaluators were to continue to use the mouse. There was no written agenda prepared for the meeting. In addition to the meeting on April 30, 2015, there were daily quality meetings with all the supervisors both the day and the evening supervisors. The day supervisors met in the morning and the evening supervisors met in the late afternoon. In sum, my investigation confirmed the scoring director?s opinion that the keypad occurrence did not impact the scoring. If an evaluator entered successive scores too quickly, he or she would see the overwrite/scores change and could go back and re-enter both scores. As soon as the issue was reported, evaluators were instructed not to use the keypad and to use the mouse for score entries. In addition, quality metrics check sets, read behinds, inter-rater reliabilities for responses read twice, pattern of ?split scores? versus same scores by trait none of these indicated an issue for student scores. The issue was not a common occurrence, was actually difficult to create in live scoring, and was fixed quickly both in on-the-floor instructions and then technically, in the PEMS software. Based on quality control tests, there was no need to rescore any tests as a result of the keypad issue. Sincerely, Zea/W Ellen Haley 20 Ryan Ranch Road Monterey, CA 93940 phone: (831) 393.7757 14