Observations on the ESSA “Menu” Option for High School Statewide Assessment Programs D. J. McRae, Ph.D. 04/20/18 When ESSA was approved by the feds 2+ years ago, it included statutory language allowing states to implement a “Menu” option for the design and execution of ESSA required statewide assessment programs for grades 3 thru 8 plus one grade level in high school. A “Menu” option suggests that multiple tests might be approved by state authorities for any given grade level, just like a restaurant menu has multiple options for dinner entrees, provided certain requirements are met for tests approved as menu choices. The requirements address both test development characteristics [content specifications as well as test administration features] and test score utilization characteristics [for example, ability to aggregate test results statewide (and disaggregate for statewide subgroups) following good educational measurement practices]. With several exceptions, not many states have seriously considered implementing the ESSA “Menu” option for their statewide assessment program. Those that have taken action to date have limited their considerations to optional use of college entry exams such as the SAT or ACT at a district (or school) level, replacing use of an existing statewide high school test focused on the individual state’s approved academic content standards. A more general conceptual goal for a “Menu” option program would be to consider individual student choice of tests from an approved menu of tests. This consideration is attractive at the high school level simply because many of our high schools offer multiple instructional pathways for their students, both college-bound instructional pathways including specialty emphases [for example, targeted STEM courses] and vocational career-ready pathways including specialty emphases [for example, health-related career choices]. One underlying common-sense requirement for the design of any large-scale K-12 testing program is that test content should follow instruction; for our high schools, with multiple instructional pathways, no one-size-fits-all test will ever satisfy all instructional pathways offered to students, regardless whether a single test is designated at the state level or the district/school level. A “Menu” option testing program with choice at the individual student level makes a lot of common sense given the variety of instructional pathways offered by comprehensive high schools. Test Development / Test Administration Observations One obstacle for a true individual student “Menu” option is the current situation that authorizing legislation (at both state and federal levels) frequently includes the notion that one-size-fits-all high quality academic content standards are required to be the basis for testing all students. When one thinks about the variety of instructional pathways available to students at a comprehensive high school, any 1 one-size-fits-all set of target content standards will not fit all instructional pathways equally. For example, some states have Career Technical Education standards that can serve as the basis for tests for CTE students better than so-called high quality academic content standards for college-bound students. Changes to statutory language and/or regulations and/or guidance documents and/or assessment program peer review protocols may be needed for a true individual student “Menu” option for a high school statewide testing program. Another obstacle for a true individual student “Menu” option is the practical problem of supplying even a modest quantity of different tests to any given group of students that need to be tested with one or more of those different tests on an “approved” test list. Here the considerable progress toward computer-administered tests in recent years provides real hope – think about any group of high school students taking a statewide test in a specified location within a specified time and add the logistics of different paper/pencil tests and test directions to the mix and one ends up with a nightmare of a test administration situation. But, with computeradministered tests, each student’s log-in ID can be programmed to deliver a specified test that is matched to that student’s chosen instructional pathway. With test directions delivered by computer, and with relatively equal anticipated test administration times for all tests on the menu, the logistics of computeradministered test administration allows for individualized testing for all students with multiple tests without the nightmare of paper/pencil logistics. Test Score Utilization / Aggregate-Disaggregate Data Observations Interpretations of scores for multiple approved tests on the menu will add to the complexity of individual student test score interpretation, but personnel in our high schools for each of the different instructional pathways are available to address individual student score interpretations for students and parents. The issue of statewide aggregation-disaggregation is more complex, but I believe it is addressable within the framework of acceptable good educational measurement practice. There is a need to put scores from approved tests on the menu on a common scale of measurement to aggregate these scores to the statewide level, for potential use in a statewide accountability system. There are a number of different methods available to accomplish this technical task. A journal article by highly regarded educational measurement specialist Bob Linn about 25 years ago provides good guidance for what’s acceptable and what’s not acceptable along these lines [Linn, Robert L. (1993) Linking Results of Distinct Assessments. Applied Measurement in Education, 6(1), 83-102]. Linn outlined five different methods for linking scores from distinct tests, along with properties for each of the five linking methods. In particular, Linn discusses  Equating methods: These methods require strict comparability of content on the different tests, strict alignment to a common set of target standards, strict conformity of test administration flexibilities such as accommodations and modifications, and strict mathematical rules for establishing scores on a common metric for the tests. These methods are needed for high stakes tests 2     for individual students, such as college entry tests, and have most notably been routinely used for multiple parallel forms of tests used for any given sitting of SAT and ACT test administrations. Equivalency methods: These methods do not require the same degree of strict comparability of content or alignment to a common set of content standards or use of flexibility rules for special groups of students, but do involve a requirement that all tests have good estimates for statewide distributions of scores for each test potentially on the menu. These methods are useful for aggregate data (and disaggregations) such as use for district or school or subgroup accountability use, as long as the specifications do not require individual student comparability of scores across tests. The widely used equi-percentile method for establishing equivalency of scores [the most frequently used method for K-12 national tests from the 1960’s through the 1990’s] falls in this category for potentially linking scores to a common scale of measurement for approved tests for the Menu option for statewide tests. Estimation methods: These methods involve using scores from one test to “estimate” scores on another test. This method of linking scores is widely used for validity research to establish the credibility of any given test’s results, frequently using regression statistical methods to establish acceptable validity data. This method does not entirely account for the full distribution of scores for tests, rather focusing on averages and variances of scores. One of the disadvantages of this method, according to Linn, is that regression methods frequently do not provide stable estimates over time; new estimates are required from year-to-year to establish comparability of scores from year-to-year. Statistical Moderation Methods: These methods require human judgments of comparability across tests with the assistance of available statistical evidence. These methods have not been used widely in the US for comparability of K-12 test scores for data aggregations, but have been more widely used for European elementary-secondary testing programs. I do not think these methods would be sufficiently rigorous for use for aggregations for US statewide accountability system data. Social Moderation Methods: These methods are simply human judgments of comparability across tests without statistical evidence at all. Again, these methods are sometimes used in other countries, but have not widely been used in the US. Again, I do not think these methods would be sufficiently rigorous for use for US statewide accountability system use. Given these five methods, I’d comment that the current federal assessment program peer review requirements for aggregation of scores tend to be based on the Equating methods for comparability of scores. My opinion is that aggregation (and disaggregation) of scores for statewide accountability use can be accomplished using less strict Equivalency methods, and at times by even less strict Estimation methods, for use of multiple tests for the ESSA Menu design option. I would argue that this is a case of a policy trade-off: The potential for use of multiple tests as an ESSA “Menu” option statewide testing design at the high school level is sufficiently attractive to tolerate a less rigorous method used to place the approved tests on the 3 menu on a common scale of measurement for statewide accountability aggregation (and disaggregation) purposes. Conclusion The above set of observations for a potential ESSA “Menu” option use of multiple tests for high school statewide testing needs considerable additional detail before it is ready for implementation. My guess is it will take 2-4 years of planning and preparation work. I would suggest one way to initiate such work would be via a formal Task Force appointed for 6-12 months to meet as a public body to discuss the additional details needed, with a recommendation via a written report to authorizing policy makers at least one year before implementation, well in advance of an operational implementation timeline. This would be a responsible and realistic way to consider the ramifications of an ESSA “Menu” option design for the high school component of a statewide testing program. One of the considerations for formal Task Force discussion and recommendation would be whether or not to include tests that might be considered “sub-menu” options that would further refine whether approved tests might include specialty emphases within the broad categories of college-ready and vocationally-ready tests. Sub-menu tests [such as Advanced Placement tests, or International Baccalaureate program tests, or Cambridge end-of-course tests for the college-ready category; such as individual career tests for the vocational-ready area] may well be linked to common scale of measurement via one or more of the alternate linking methods described above. In addition, the existing high school content-based test statewide test should also be on the menu (perhaps as the default test should a student not select one of the alternative tests) and is the logical test to provide the underlying common scale of measurement to which other tests would be linked, thus providing continuity for any already established statewide accountability system. Finally, another topic area for Task Force discussion and recommendation might how best to deal with special testing needs for Students with Disabilities. ESSA and states have already have recognized this area via special tests for severely challenged SWDs, but there is a need for additional considerations for the remaining SWDs [both IEP and 504 Plan students] in the context of a multiple test “Menu” option for high school statewide assessment programs. Q & A for the ESSA “Menu” Option for High School Statewide Assessment Programs Q: Can the Menu option also apply to grade 3-8 required ESSA tests? A: While ESSA statutory language permits this option, in my view it does not need to be pursued. For the most part, grade K-8 curriculum and instruction programs are easily focused on one set of high quality academic content standards for each state, and thus existing one-size-fits-all tests based on those academic content standards do not need a “menu” of optional tests to fit different instructional pathways at the individual student level. [Add on 4/24: An exception may be optional Math and 4 Science end-of-course tests for middle school students taking accelerated high school Math and Science courses.] Q: How would Menu tests be assigned to individual students? A: I would advocate assignments would be left to student/parent choice, just like course selections for different instructional pathways are largely left to student/parent choice. Q: Would individual students be allowed to take more than one test on the Menu? A: The full answer to this question might be left for Task Force discussion and recommendation, but I would think an allowance for more than one test per student would be a reasonable feature for a high school Menu design. At least two issues would need to be discussed: (1) Who pays for the additional tests? (2) Which score would be used for accountability system use? There might be a variety of ways to resolve these two issues. Q: How would tests be chosen and approved for the Menu? A: I would think existing procurement systems could be used for RFI and RFQ processes, with test developers/owners required to provide data showing their tests can generate acceptable representative statewide distributions, data needed for linking to an established common scale of measurement. Developers/owners of prospective tests for the menu may have to fund subsidized test administrations to generate acceptable data for this requirement, but this situation has been common within the K-12 testing industry for many years. 5