Item and Form Analysis

Every certificate program accredited by ANAB has a summative (i.e., final) assessment following the course to determine whether learners have met the course learning objectives. ANSI/ASTM E2659-24 requires that each program conduct an item and form analysis of the assessment as part of their annual program evaluation.

The clause that contains this requirement is below. The pertinent statement is in bold.

“6.1.10.3 The program evaluation shall measure the quality and effectiveness of learner assessment methods/instruments.

(1) Assessment methods/instruments shall be reviewed to verify appropriate and accurate linkage to and measurement of the learning outcomes.

(2) Assessment question and form performance shall be reviewed.

(3) If applicable, evaluator performance shall be reviewed.”

This post presents the main statistics involved in this part of ANSI/ASTM E2659-24*.

Test Terminology – What Are Items and Forms?

Starting with some basic terminology, an assessment question is just that—a question on the test. A form is a set of questions that comprise the test (or assessment or exam). Usually, “form” is synonymous with “test,” but only if there is only one form in use. Most certificate programs have multiple forms to help boost test security or because they have more languages than just English. In all cases, a form is one version of the test. It is important to note that when there are multiple forms in use, all are subject to evaluation. In other words, question and form performance of each form should be reviewed. As an aside, most psychometricians refer to assessment questions as “items” and may use them interchangeably.

How Are Item and Form Performance Evaluated?

Item and form performance are both measured by statistics. Most certificate programs have one staff member who understands this to some extent, but without the requisite test development background, they often aren’t sure which statistics to use. The two main statistics used by certificate programs are item difficulty—how many people answer each question correctly—and pass rates—how many people pass the exam.

But these alone are not sufficient. Most staff members are not sure exactly how to interpret either statistic. So, they either do their best in guessing what they mean or avoid providing any interpretation for fear of saying the wrong thing. For professional test developers, evaluating item and form performance is rather straightforward. Here’s generally what assessors look for when evaluating tests:

High performing items have at least two properties that help make them good:

a) they are at the right level of difficulty for the learners; and

b) they are a good representation of the concept(s) that the test was designed to teach and measure

There is a statistic for each of these properties.

High performing forms have evidence of both validity and reliability. Validity means that the form represents the concepts taught in the course and the test result is used in a manner that is consistent with its purpose. Reliability means that the form(s) measure consistently. The basis for the validity of a test is established by all the steps involved in developing the course. Evidence that demonstrates this includes linking the course contents and learning objectives to the end-of-course assessment as well as evidence of the process of item and form creation. Validity is a crucial element of any assessment and is in the ANSI/ASTM E2659-24 standard as the first requirement of 6.1.10.3: “(1) Assessment methods/instruments shall be reviewed to verify appropriate and accurate linkage to and measurement of the learning outcomes.”

There are two main statistics for conducting an item analysis, and one main statistic for conducting a form analysis.

Two main statistics for item analysis

Item difficulty, measured by the p value, is the statistic that tells you whether the question is at the right level of difficulty for your learners. It is obtained by calculating the proportion of examinees who got the correct answer on a question. For most practical purposes, this value would ideally fall between .60 and .90 for each question (Haladyna, 2016). This range (.60 to .90) is considered “medium difficulty” and represents the range of most operational cut scores. When most of the item difficulties on a form fall within a range surrounding the cut score for the form, the reliability of the test is at its highest. This is because better decisions are made with an exam that has item difficulties of similar value to the point at which a pass/fail decision is made.

Item discrimination, measured by the point-biserial, helps you understand whether the item is a good representation of the concept(s) that the test was designed to measure. It is obtained by correlating the score on each item with the total score on the exam, which in the case of multiple-choice tests, is in fact the definition of a point-biserial. A correlation works by considering this relationship across all people in the data to come up with a single statistic for each item. Correlations can range from
-1.0 to +1.0, but ideally the correlation is above zero, meaning that there is a positive relationship between the item score with the total score on the exam. A negative relationship means that the item is detracting from the total score, which is a very undesirable situation. If items are considered as bits of information about a person’s knowledge, you want each item to be adding information about a person’s knowledge rather than confusing the matter. Items with negative discriminations are in that sense a waste of testing time and a drag on form reliability (more on that next). Most test developers would consider a good item as having an item discrimination above .15.

When conducting an item analysis, psychometricians look for items that are out of bounds on either difficulty or discrimination and target those items for either removal from the item bank or revision. This is essential—an item analysis alone is not doing anything unless someone takes action on the results.

One main statistic for form analysis

Form reliability, measured by the Kuder-Richardson Formula 20 (KR-20) (for a multiple-choice test), helps you understand how well the items work together to measure the area of knowledge targeted by the test. This is also referred to as the “internal consistency” of the test. It represents how well the items fit together and is a combined function of the difficulty and the discrimination values of all of the items that are included in a single test form. KR-20 values can range from -1 to 1. Most multiple-choice tests will have KR-20 values greater than 0 but less than 1. The closer the value is to 1, the higher reliability the form has. For most certificate programs, a KR-20 that is above .70 is considered acceptable.

When conducting a form analysis, psychometricians look for a KR-20 above .70. If it is above that, that the form is considered performing well. If it is below that, the reliability of the form must be increased either by adding items (poor reliability can be caused by the form being too short) or by removing low-performing items, such as those with very high or low difficulties, scoring errors, or items with low discrimination. Poor items can have a detrimental effect on a form’s reliability.

Conclusion

It is fundamental that certificate programs get the test development right, as it is the primary source of information for determining whether learners have met the learning objectives. For a very helpful walkthrough of actually conducting these statistics, there is a highly recommended book specific to developing tests for certification and program evaluation: “Test Development: Fundamentals for Certification and Evaluation” by Melissa Fein. It’s a useful primer for anyone needing a quick crash course in the concepts and methods of test development. This book is very accessible to non-psychometricians, and the entire second part is devoted to tools for item and form analysis that can be utilized by anyone with a basic proficiency in Excel.

References

Haladyna, T. M. (2016). Item analysis for selected-response test items. In Lane, S., Raymond, M.,R., & Haladyna, T. M. (Eds.), Handbook of test development. New York: Routledge.

*Most ANAB-accredited certificate programs use multiple-choice items, so the statistics mentioned here assume that the test is multiple-choice with right/wrong scoring on each item and not another format or scoring method, such as work samples and partial credit scoring models.

Contributing Author: Kathy Tuzinski

Kathy Tuzinski is a Contract Assessor for ANAB’s Certificate Accreditation Program. She has a science-practioner’s background in the testing industry as a psychometrician, consultant, and industrial/organizational psychologist. She started Human Measures to provide the testing industry support in developing testing programs that are based on science – advising companies on job analysis and competency modeling, test development and test validation, legal requirements for employee selection, and relevant testing standards, including the AERA, APA, and NCME Standards for Educational and Psychological Testing and the SIOP Principles for the Validation and Use of Personnel Selection Procedures. She is a member of the American Psychological Association and the Society for Industrial/Organizational Psychology, has published several articles in peer-reviewed journals, and is co-editor of the book Simulations for Personnel Selection. She holds a Master of Arts degree in Industrial/Organizational Psychology from the University of Minnesota. You can reach her at https://www.human-measures.com/.