Selecting and Using Benchmark Assessments

Criteria for benchmark assessment selection and use

Hundreds of benchmark assessments are currently available.  They range from expensive, high-tech web versions to inexpensive, locally developed assessments tied to specific curriculum.  We describe important criteria and principles that schools, districts and states should consider when selecting or developing benchmark assessments.

What is validity?

Validity is the big concept that defines quality in educational measurement.  It is the extent to which an assessment actually measures what it is intended to measure and provides sound information supporting the purpose(s) for which it is used. Thus, benchmark assessments themselves are not valid or invalid. The validity of benchmark assessments resides in the evidence provided by an assessment and its specific use. Some benchmark assessments have a high degree of validity for one purpose, but may have little validity for another. For example, a benchmark reading assessment may be valid for identifying students who may not reach the proficiency level on a state test.  However the assessment could have little validity for diagnosing and identifying the cause of students’ reading challenges.

The evaluation of quality in any benchmark assessment system begins with a clear explanation of the purpose(s) of an assessment and serious consideration of a range of issues that tell how well a given assessment or assessment system serves that purpose(s).  The dynamic between an assessment's purpose and the resulting data generated by the assessment is key to determining the validity of benchmark assessments.

Validity: What to look for

The evaluation of quality in any benchmark assessment system starts with a clear understanding of the purpose(s) an assessment is intended to serve and consideration of other issues that indicate how well a given assessment or assessment system serves that purpose.  Benchmark assessments must:

  • Be aligned with district and school learning goals and intended purpose
  • Provide reliable information for intended score interpretations and uses
  • Be instructionally sensitive
  • Be fair and accessible
  • Have high utility
  • Provide useful reporting for intended users and purposes

These interrelated factors influence the validity of benchmark assessments and the role benchmark assessments play in a comprehensive assessment system.

What is alignment with learning goals?

Benchmark alignment describes how well what is assessed matches both what schools are teaching and the purpose for giving the assessment.  For benchmark assessments to provide information for making valid inferences about student learning, the assessment must be aligned with the learning goals, standards, or success criteria from the beginning of the development or adoption process.  One way to ensure alignment is  “...to create benchmark assessments that enrich student learning opportunities, focus on the big ideas of a content area and counteract curriculum narrowing by designing benchmark assessments that allow students to apply their knowledge and skills in a variety of contexts and formats” (Herman and Baker, 1998, p. 56).  

The essential issues for benchmark alignment focuses on these questions: 

  1. Do the assessments reflect what is most important for students to know and be able to do?
  2. Do the assessments capture the depth and breadth of learning goals? 
  3. Is the assessment framework consistent with the local curriculum framework?  
  4. Does the sequence of assessment content on successive tests match that of the curriculum?  
  5. Which curriculum goals should each assessment be aligned with—those of the prior instructional period, those of subsequent instructional periods, or both?  

The answers to these questions help to illuminate the importance of selecting and using a benchmark assessment system that aligns well with a school, district or state curriculum.  An analysis of student responses can further help to assure that individual benchmark assessments are measuring the identified concepts at the appropriate time during the academic year. 

Alignment Examples

Example 1

Consider how items, such as the one displayed below, might serve as part of a benchmark assessment if appropriately aligned with learning goals. This task elicits information about a student’s understanding of part/whole relationships, a concept (or “big idea”) foundational to mathematical thinking. The item is also well matched to the district’s math curriculum, which emphasizes conceptual thinking and problem solving.  Information generated by student responses to the task could be used to guide instruction as well. 

Student is asked to identify which image (A,B,C, or D) depicts one quarter of the area being shaded

Example 2

Open-ended items, like the fraction/number-line example below, may also be used on multiple occasions to measure student progress towards a specific learning goal, if aligned with learning goals.  Because the task is open-ended, it is possible to see different levels of student understanding. 

Student concludes that there are 10 fractions between 0 and 1 on a number line.


Alignment: What to look for

Many assessment companies report that their assessments are “aligned” with specific state standards.  This may not always be the case, however, so consumers of benchmark assessments systems may consider conducting their own analysis.

Benchmark alignment analysis begins with having the learning goals clearly in mind.  Educators can then evaluate alignment with the following questions:

  1. What framework was used to develop the benchmark assessment or items in the assessment?
  2. What is the distribution and range of items and content by grade level?
  3. What is the distribution and range of cognitive demands by grade level?
  4. What is the specific distribution of items on each assessment by content and cognitive demand?
  5. What is the range of items for diagnosing specific learning strengths and weaknesses to guide instruction, and how many items are there?

Example: Mathematics benchmark alignment and review

In this example, educators wanted to know how well their quarterly mathematics benchmark assessments aligned with the school learning goals and district and state standards.  The table table presents a summary of four benchmark assessments and identifies the number of items on each assessment by topic, mathematical "big idea," and intellectual demand of the assessments.  Notice the following (information found in the "Mathematics Concept" column):

  1. "Data Analysis" receives the most emphasis on the 4th grade benchmark math assessments (28/100, representing 28% of all items)
  2. "Patterns and Algebraic Thinking" represent 16%
  3. "Measurement" represents 20%
  4. "Geometry" represents 10%
  5. "Number Sense" represents 27%

Is this a reasonable distribution of content? Does the distribution reflect intended learning goals?  Is the benchmark assessment aligned with the district's curriculum?

There is no right or wrong answer to these questions.  This information provides an opportunity for educators to ask themselves if the distribution aligns with what matters most conceptually in 4th grade mathematics in their context.  If the standards and conceptual analysis indicate that this distribution aligns with the intended learning goals, then they have reasonable assurance that the benchmark assessments are aligned conceptually to their 4th grade mathematics standards.  If not, the school, district, or state will need to take steps to improve benchmark alignment with learning goals.

Click here for a blank copy of the alignment review for your use in examining your own benchmark assessments.

What is reliability?

Reliability is an indication of how consistently an assessment measures its intended target and the extent to which scores are (relatively) free of error. Low reliability means that scores should not be trusted for decision-making.   Measurement experts have different ways of looking at reliability. For example, they look at how consistent scores are across different times or occasions.  Do scores change if an assessment is given in the morning or in the afternoon, or if it is administered in September or May? For the assessment to provide reliable information, student scores should be basically the same on different days and times, unless there was instruction between assessments, a change in settings, and/or changes to the assessment.

Measurement experts also look for reliability in scoring.  Results should be consistent regardless of who scores the assessment or when it is scored. Consistency in machine-scored benchmark multiple-choice items is rarely problematic.  Reliable scoring is more complex with constructed or open-ended responses. A team of raters typically scores this type of assessment item. To ensure reliability, it is important that raters use anchor papers to develop a sound understanding of the scale used to code or score the benchmark items.  Raters need to periodically recalibrate their scoring, that is, score the same work or items again to ensure their coding is consistent with other raters.  Reliability is a necessary part of test validity, but high reliability does not ensure a valid test. For example, an assessment may be highly reliable (i.e., provide consistent results) but might not measure the knowledge and skills for which the assessment was intended to measure.  If not, the inferences made from the assessment are not valid.  Alternately, an assessment may provide a highly reliable total score without providing reliable diagnostic information.

Reliability: What to look for

Test publishers typically provide reliability indices for benchmark assessments along with other technical information about item difficulty and discrimination. It is essential to review this technical information before purchasing or using benchmark assessments or item banks.   The four reliability scores that should be examined include:

  1. Item difficulty:  this is an indication of proportion of students answering the item correctly.  Appropriate difficulty is in the range of .2 - .8, meaning that 20 – 80% of students answered an item correctly.
  2. Item discrimination:  this value compares results for high and low scoring students.  The discrimination index is scaled from -1.0 to +1.0.  A score of +0.3 or higher means an item can differentiate between high  and low scoring students.
  3. Reliability coefficients: this value is calculated based on internal consistency (how students respond to the same kinds of items on the same test), test-retest reliability (how students perform on the same test administered at different times), or parallel forms reliability (how students perform on different forms of the same test).  Reliability coefficients of 0.8 and above indicate that benchmark assessments are providing reliable information.
  4. Reliability scores and sub-scores:  reliability indices can be calculated for all the items on the test, or for each subsection or sub-score of an assessment.  In social studies for example, the total score may consist of five sub-scores—behavioral science, economics, geography, history and political science.  Sub-scores are very important if benchmark data are to be used for planning instruction.  A minimum of five items is needed for a reliable sub-score—that is, to make an accurate diagnosis of student learning.

For schools and districts developing their own benchmark assessments, specific statistical guidelines should be used to evaluate the reliability of assessment items prior to their widespread use (Brown & Coughlin, 2007).

What is fairness and bias?

Fairness and bias comprise a critical feature of quality benchmark assessments.  A fair test is accessible to all students and does not advantage some students over others. Bias emerges when features of the assessment itself limit some students’ ability to demonstrate their knowledge or skill. Bias is present when students from different subgroups (e.g., race, ethnicity, language, gender, disability) with the same level of knowledge and skill perform differently on an assessment.

There are two primary forms of test bias: offensiveness and unfair penalization (Popham, 1995).  Offensiveness is an issue when the content of an assessment offends, upsets, or distresses a subgroup of students, thus negatively impacting their performance.  Examples of offensiveness include assessment items that present unfavorable stereotypes of different cultures, genders, or other subgroups which could adversely affect these subgroups’ performance.  

Unfair penalization occurs when aspects of an assessment make the test more challenging for some students than for others because of differences in language, culture, locale, or socio-economic status. Mathematics tests, for example, that use complex or technical language structures may reduce English learners’ ability to show their mathematics knowledge.  Students may be unfairly penalized because they can't understand the questions even if they know the content.

Fairness and Bias: What to look for

Technical information regarding fairness and bias should be provided by benchmark assessment developers.  It should include demographics of the sample as well as scores and other technical evidence for various subgroups. Additionally, benchmark assessments should be examined prior to their use to ensure that the particular items will not be offensive to students.  

Guidelines for developing benchmark assessments that are free from bias reflect the same steps outlined above:

  1. Developers should be sensitive to the demographic characteristics of the students
  2. Documentation describing the steps taken to minimize bias in the assessment items should be provided
  3. Organizations should examine how well an item functions for specific subgroups
  4. If particular subgroups perform differently, validity of those items for the subgroups should be investigated.

What is instructional sensitivity?

Instructional sensitivity refers to the degree to which student performance on an assessment accurately reflects the quality of instruction s/he has received (Popham, CCSSO, 2009) and their learning.  If students have been well taught and have learned that which is assessed, they should perform well on the test.  Pretty obvious, right?

This idea is often assumed, but it is not always the case for students and assessment results. Students may underperform on an assessment if some items are only distantly related to what teachers have taught.  Poor performance may also occur if items are confusing, excessively difficult, or involve unfamiliar applications or item contexts.

If benchmark assessments are not sensitive to instruction, the scores they produce have little value for evaluating schools, teachers, or their instructional programs.  Instead, the assessments provide faulty evidence for improving teaching and learning.

Instructional Sensitivity: What to look for

To avoid instructional sensitivity problems, educators should conduct thorough reviews of the alignment between assessment items and curriculum. Further, educators should insist that the assessment focus on concepts that are central, not tangential, to learning goals. Be on the look out for item flaws that may confuse students or enable them to guess the right answer without having content knowledge.  A list of potential flaws for multiple-choice questions can be found in the resource section of the CSE Technical Report #723, Appendix B.

What is utility?

Another important point to consider when selecting or developing benchmark assessments is utility.  The overarching question that schools, districts and states should ask to determine a benchmark assessment’s utility is:  “Will this assessment be useful in helping us to accomplish our intended purposes?”  To maximize utility, benchmark assessments must be user-friendly, feasible to administer, and should be scored and interpreted in a timely way.  Evaluating the utility of a benchmark assessment means revisiting the purpose of the assessment, its intended users, and how teachers are expected to use the results to guide instruction.

Utility: What to look for

Publishers typically showcase various ways that test data can be summarized and displayed. While these reporting features may be appealing on the surface, districts and schools should closely review the technical manuals that accompany each test for evidence supporting each intended use. Useful reports have the following characteristics:

  1. Different reporting levels: systems should provide reports for different users in the system (i.e., student, classroom, school and district level), with the appropriate level of detail for each.
  2. Reporting formats consistent with use at each level: teachers need reports that provide information about students to provide appropriate instruction.  Schools may need information on total numbers of students off target for meeting proficiency by year end in order to allocate adequate time and resources.
  3. Scores and reporting categories consistent with local circumstances and proficiency levels: reports should match the kind of information typically used to gauge proficiency in a community. For example, some schools report proficiency as below basic, basic, proficient, and advanced.  These categories parallel state proficiency levels.  In some cases, with the right sample size, it is possible to examine reliability of benchmark assessments when compared to state assessment item performance.  That comparison is possible because of the consistency of categories and scales used.  
  4. Multiple representations of data: user-friendly reports, presented in a variety of formats, can help to effectively convey benchmark data to diverse audiences. 
  5. Flexibility: as districts become more experienced using benchmark assessments, they may want to create custom reports.  Choosing a system that can provide flexibility in reports is an important feature.
  6. Reliability of reported scores and inferences: look for reports that provide data on score reliability, especially student performance on specific topics or concepts.  If reports present information on mastery or proficiency, look for evidence to confirm these classifications.