Validity is the big concept that defines quality in educational measurement. It is the extent to which an assessment actually measures what it is intended to measure and provides sound information supporting the purpose(s) for which it is used. Thus, benchmark assessments themselves are not valid or invalid. The validity of benchmark assessments resides in the evidence provided by an assessment and its specific use. Some benchmark assessments have a high degree of validity for one purpose, but may have little validity for another. For example, a benchmark reading assessment may be valid for identifying students who may not reach the proficiency level on a state test. However the assessment could have little validity for diagnosing and identifying the cause of students’ reading challenges.
The evaluation of quality in any benchmark assessment system begins with a clear explanation of the purpose(s) of an assessment and serious consideration of a range of issues that tell how well a given assessment or assessment system serves that purpose(s). The dynamic between an assessment's purpose and the resulting data generated by the assessment is key to determining the validity of benchmark assessments.
The evaluation of quality in any benchmark assessment system starts with a clear understanding of the purpose(s) an assessment is intended to serve and consideration of other issues that indicate how well a given assessment or assessment system serves that purpose. Benchmark assessments must:
These interrelated factors influence the validity of benchmark assessments and the role benchmark assessments play in a comprehensive assessment system.
Benchmark alignment describes how well what is assessed matches both what schools are teaching and the purpose for giving the assessment. For benchmark assessments to provide information for making valid inferences about student learning, the assessment must be aligned with the learning goals, standards, or success criteria from the beginning of the development or adoption process. One way to ensure alignment is “...to create benchmark assessments that enrich student learning opportunities, focus on the big ideas of a content area and counteract curriculum narrowing by designing benchmark assessments that allow students to apply their knowledge and skills in a variety of contexts and formats” (Herman and Baker, 1998, p. 56).
The essential issues for benchmark alignment focuses on these questions:
The answers to these questions help to illuminate the importance of selecting and using a benchmark assessment system that aligns well with a school, district or state curriculum. An analysis of student responses can further help to assure that individual benchmark assessments are measuring the identified concepts at the appropriate time during the academic year.
Consider how items, such as the one displayed below, might serve as part of a benchmark assessment if appropriately aligned with learning goals. This task elicits information about a student’s understanding of part/whole relationships, a concept (or “big idea”) foundational to mathematical thinking. The item is also well matched to the district’s math curriculum, which emphasizes conceptual thinking and problem solving. Information generated by student responses to the task could be used to guide instruction as well.
Open-ended items, like the fraction/number-line example below, may also be used on multiple occasions to measure student progress towards a specific learning goal, if aligned with learning goals. Because the task is open-ended, it is possible to see different levels of student understanding.
Many assessment companies report that their assessments are “aligned” with specific state standards. This may not always be the case, however, so consumers of benchmark assessments systems may consider conducting their own analysis.
Benchmark alignment analysis begins with having the learning goals clearly in mind. Educators can then evaluate alignment with the following questions:
Example: Mathematics benchmark alignment and review
In this example, educators wanted to know how well their quarterly mathematics benchmark assessments aligned with the school learning goals and district and state standards. The table table presents a summary of four benchmark assessments and identifies the number of items on each assessment by topic, mathematical "big idea," and intellectual demand of the assessments. Notice the following (information found in the "Mathematics Concept" column):
Is this a reasonable distribution of content? Does the distribution reflect intended learning goals? Is the benchmark assessment aligned with the district's curriculum?
There is no right or wrong answer to these questions. This information provides an opportunity for educators to ask themselves if the distribution aligns with what matters most conceptually in 4th grade mathematics in their context. If the standards and conceptual analysis indicate that this distribution aligns with the intended learning goals, then they have reasonable assurance that the benchmark assessments are aligned conceptually to their 4th grade mathematics standards. If not, the school, district, or state will need to take steps to improve benchmark alignment with learning goals.
Click here for a blank copy of the alignment review for your use in examining your own benchmark assessments.
Reliability is an indication of how consistently an assessment measures its intended target and the extent to which scores are (relatively) free of error. Low reliability means that scores should not be trusted for decision-making. Measurement experts have different ways of looking at reliability. For example, they look at how consistent scores are across different times or occasions. Do scores change if an assessment is given in the morning or in the afternoon, or if it is administered in September or May? For the assessment to provide reliable information, student scores should be basically the same on different days and times, unless there was instruction between assessments, a change in settings, and/or changes to the assessment.
Measurement experts also look for reliability in scoring. Results should be consistent regardless of who scores the assessment or when it is scored. Consistency in machine-scored benchmark multiple-choice items is rarely problematic. Reliable scoring is more complex with constructed or open-ended responses. A team of raters typically scores this type of assessment item. To ensure reliability, it is important that raters use anchor papers to develop a sound understanding of the scale used to code or score the benchmark items. Raters need to periodically recalibrate their scoring, that is, score the same work or items again to ensure their coding is consistent with other raters. Reliability is a necessary part of test validity, but high reliability does not ensure a valid test. For example, an assessment may be highly reliable (i.e., provide consistent results) but might not measure the knowledge and skills for which the assessment was intended to measure. If not, the inferences made from the assessment are not valid. Alternately, an assessment may provide a highly reliable total score without providing reliable diagnostic information.
Test publishers typically provide reliability indices for benchmark assessments along with other technical information about item difficulty and discrimination. It is essential to review this technical information before purchasing or using benchmark assessments or item banks. The four reliability scores that should be examined include:
For schools and districts developing their own benchmark assessments, specific statistical guidelines should be used to evaluate the reliability of assessment items prior to their widespread use (Brown & Coughlin, 2007).
Fairness and bias comprise a critical feature of quality benchmark assessments. A fair test is accessible to all students and does not advantage some students over others. Bias emerges when features of the assessment itself limit some students’ ability to demonstrate their knowledge or skill. Bias is present when students from different subgroups (e.g., race, ethnicity, language, gender, disability) with the same level of knowledge and skill perform differently on an assessment.
There are two primary forms of test bias: offensiveness and unfair penalization (Popham, 1995). Offensiveness is an issue when the content of an assessment offends, upsets, or distresses a subgroup of students, thus negatively impacting their performance. Examples of offensiveness include assessment items that present unfavorable stereotypes of different cultures, genders, or other subgroups which could adversely affect these subgroups’ performance.
Unfair penalization occurs when aspects of an assessment make the test more challenging for some students than for others because of differences in language, culture, locale, or socio-economic status. Mathematics tests, for example, that use complex or technical language structures may reduce English learners’ ability to show their mathematics knowledge. Students may be unfairly penalized because they can't understand the questions even if they know the content.
Technical information regarding fairness and bias should be provided by benchmark assessment developers. It should include demographics of the sample as well as scores and other technical evidence for various subgroups. Additionally, benchmark assessments should be examined prior to their use to ensure that the particular items will not be offensive to students.
Guidelines for developing benchmark assessments that are free from bias reflect the same steps outlined above:
Instructional sensitivity refers to the degree to which student performance on an assessment accurately reflects the quality of instruction s/he has received (Popham, CCSSO, 2009) and their learning. If students have been well taught and have learned that which is assessed, they should perform well on the test. Pretty obvious, right?
This idea is often assumed, but it is not always the case for students and assessment results. Students may underperform on an assessment if some items are only distantly related to what teachers have taught. Poor performance may also occur if items are confusing, excessively difficult, or involve unfamiliar applications or item contexts.
If benchmark assessments are not sensitive to instruction, the scores they produce have little value for evaluating schools, teachers, or their instructional programs. Instead, the assessments provide faulty evidence for improving teaching and learning.
To avoid instructional sensitivity problems, educators should conduct thorough reviews of the alignment between assessment items and curriculum. Further, educators should insist that the assessment focus on concepts that are central, not tangential, to learning goals. Be on the look out for item flaws that may confuse students or enable them to guess the right answer without having content knowledge. A list of potential flaws for multiple-choice questions can be found in the resource section of the CSE Technical Report #723, Appendix B.
Another important point to consider when selecting or developing benchmark assessments is utility. The overarching question that schools, districts and states should ask to determine a benchmark assessment’s utility is: “Will this assessment be useful in helping us to accomplish our intended purposes?” To maximize utility, benchmark assessments must be user-friendly, feasible to administer, and should be scored and interpreted in a timely way. Evaluating the utility of a benchmark assessment means revisiting the purpose of the assessment, its intended users, and how teachers are expected to use the results to guide instruction.
Publishers typically showcase various ways that test data can be summarized and displayed. While these reporting features may be appealing on the surface, districts and schools should closely review the technical manuals that accompany each test for evidence supporting each intended use. Useful reports have the following characteristics: