Transitioning from Manual to Automated Test Assembly: A Comparison of Equating Methods
Authorship
Kimberly M. Hudson, PhD, National Board of Osteopathic Medical Examiners
Yue Yin, PhD, University of Illinois at Chicago
Tsung-Hsun Tsai, PhD, National Board of Osteopathic Medical Examiners
Grant Number/ Funding Information
Not applicable
Corresponding Author
Kimberly Hudson, 8765 West Higgins Road, Suite 200, Chicago, Illinois 60631; 773-714-0622; Kimberly.shay86@gmail.com
Key Words
Equating, Automated Test Assembly, Optimal Test Assembly, IRT, Rasch Model, CINEG
Abstract
As early as the 1960s, testing organizations began implementing Automated Test Assembly (ATA) to simplify the laborious process of manually assembling test forms and to enhance the psychometric properties of the examinations (Wightman, 1998; van der Linden, 2005). But it is unclear what impact transitioning to ATA has on equating outcomes. The purpose of this research study was to evaluate outcomes from different IRT scale linking and equating methods when a testing organization transitioned from manual test assembly to ATA.
After crossing each scale linking procedure with each equating method, I calculated error and bias indices (e.g., RMSD, MAD, MSD) and evaluated the decision consistency of the equating outcomes.
The results showed that the mean/mean scale linking procedure paired with the IRT preequating method produced the lowest bias and error, and highest level of decision consistency.
The results of this study support the importance of aligning psychometric and test development procedures. The findings of this study suggest that the equating outcomes were related to the similarity in statistical test specifications. ATA resulted in more parallel test forms with better psychometric properties than forms assembled manually. Therefore the modifications to assembly practices warrant the reconsideration of a new base form for scaling and standard setting.
Introduction
In high-stakes medical licensure testing programs, test developers and psychometricians work together to develop multiple test forms that can be administered simultaneously to examinees to enhance examination security. Although the volume of forms may differ between testing programs, it is crucial that all test forms are built according to the same test specifications (von Davier, 2010). Furthermore, scores on the test forms must be interchangeable and candidates should perceive no difference between the test forms administered (Kolen & Brennan, 2014). The test development processes and psychometric procedures are inherently connected and both must be considered when developing multiple test forms.
Traditionally, test developers have manually assembled multiple test forms according to a set of content requirements. Test developers typically evaluate statistical criteria such as mean proportion of correct responses (p-value) or mean point-biserial correlation upon completion and make adjustments to confirm that statistical specifications are met. Manual test assembly (MTA) is a time-intensive process, typically requiring the attention and work of multiple test developers. However with the widespread use of computers, testing organizations can improve the laborious manual process by developing and employing computer programs to automatically assemble tests. If staff members possess technical computer programming skills, they might create computer programs that can assemble multiple test forms simultaneously by balancing the content and statistical constraints.
When assembling tests manually, test developers use a variety of informational inputs, or constraints, to create multiple forms of an assessment that are balanced in terms of content, difficulty of items, item formats, contextual information of items (e.g., the patient’s life stage), item duration, word count, and exposure rate. Test developers first compile an item pool, which contains a selection of items that meet some basic requirements for inclusion on a test. Scorable items function as operational or anchor items and often have known item parameters based on a prior administration. Test developers iteratively select a group of items that meet the minimum proportions of each domain as specified by the test blueprint and evaluate the range of item statistics or average item statistics, such as p-values and point-biserial correlations. The number of parallel test forms and the number of constraints undoubtedly impacts the complexity of manually assembling forms. Moreover, many testing organizations implement this resource-intensive process across numerous testing programs on an annual or semi-annual basis.
Automated Test Assembly (ATA) is an efficient alternative to this laborious process with unique challenges (Wightman, 1998). Unlike MTA, ATA programs utilize the test information, the summation of item information across the ability continuum, in the creation of multiple parallel test forms. Thus, ATA improves the manual procedure by not only saving time and resources, but also enhancing the psychometric quality of balanced forms according to a predetermined set of constraints and maximization of the specified objective function. ATA may improve reliability across examination forms due to the standardization of the test development process. Therefore the impact of ATA is not just a question of “Can the computer do it,” but rather “Can the computer do it better?”
In medical licensure examinations there is a critical need for score comparability across test forms to not only ensure that scores are an accurate, reliable representation of examinee ability, but also to make pass/fail distinctions based on the scores. Earning a passing score on a medical licensure examination allows examinees to enter into supervised medical practice. Therefore psychometricians work to maintain decision consistency, regardless of the test assembly method and the form administered to examinees. Decision consistency refers to the agreement of an examinee’s pass/fail decisions on two (or more) independent administrations of unique forms and decision accuracy refers to the agreement between an examinee’s pass/fail decision and whether the same decisions made based on an examinees’ true ability (Livingston & Lewis, 1995). These two indices are necessary to evaluate in high-stakes medical licensure testing. In this research, I compare the decision consistency of equated results after implementing ATA.
The results of this research provide a psychometric framework to evaluate results from different equating methods upon the implementation of ATA. When testing organizations implement new test development processes, it is critical to examine the impact on examinee scores (AERA, APA, & NCME, 2014). Testing organizations monitor and evaluate scores and decision consistency of scores on examinations that ultimately license examinees to practice medicine in supervised or unsupervised settings. Neglecting to examine this may inadvertently lead to passing unqualified physicians, or failing qualified physicians.
In ATA, psychometricians and test developers often define linear and/or non-linear constraints in order to maximize a specific objective function, typically the test information function (TIF), at a given score point on the true-ability continuum (van der Linden, 2005). In a high-stakes licensure examination, the minimum passing standard (or cut-score) is commonly used for optimization because it maximizes test information near the cut-score and minimizes the standard error of measurement (SEM) at the cut-score. This leads to increased reliability of scores closest to the cut-score and better accuracy of pass/fail distinctions. Therefore, ATA is designed to enhance the psychometric qualities based on prior item information (i.e., higher reliability coefficients, and lower standard error of measurement near the cut-score), and the efficiency of assembling test forms. However, research has not yet addressed the impact of transitioning from MTA to ATA on results from equating methods. In this study, I investigate the differences in equated results between MTA and ATA forms.
Most ATA processes use a framework of Item Response Theory (IRT) to construct forms with computer programs integrating item-level information according to a set of predetermined constraints. The use of IRT typically goes hand-in-hand with the psychometric framework utilized by the testing program. In IRT, items have a set of unique characteristics; some items are more informative than others at different ability levels. Psychometricians investigate the individual contribution of an item to a test by reviewing the item information function (IIF). The TIF is the summation of IIFs across the ability continuum. The TIF in the ATA represents the characteristics and composition of all items on each test form. Moreover, in the context of medical licensure examinations, using the minimum passing standard as the value for optimization ensures that scores are precise ability estimates for minimally qualified examinees. Thus, when the TIF is optimized at the cut-score, it ultimately reduces the probability of Type I error (unqualified examinees passing the examination). Furthermore, Hambleton, Swaminathan, and Rogers (1991) suggest that the test characteristic curve (TCC) creates the foundation for establishing the equality of multiple test forms, which is certainly the case when optimizing the TIF. The TIF provides aggregate information from each item on the examination, whereas the TCC shows the probability of an expected raw score with a given ability level, . If we wish to create parallel test forms, then the TCC provides evidence that a given ability level relates to similar expected scores for two parallel forms of the same test. Furthermore, the use of content and statistical constraints in ATA computer programs provides evidence that all test forms are balanced in terms of statistical specifications.
Once parallel forms are assembled, reviewed, published and administered, the results must be analyzed and equated. Equating refers to the use of statistical methods to ensure that scores attained from different test forms can be used interchangeably. Equating can be conducted through a variety of designs, approaches and methods (Kolen & Brennan, 2014). Although there are key differences between IRT and the Rasch model, this research will focus on the applicability of IRT equating methods to a testing program that utilizes the Rasch model as a psychometric framework. Within IRT equating methods, both preequating and postequating methods are widely implemented in K-12 educational settings to ensure scores can be used interchangeably (Tong, Wu, & Xu, 2008). Psychometricians may use IRT to preequate results prior to the start of examination administration, which assuages the tight turnaround time between examination administration and score release. Alternatively, postequating methods use response data from complete current examination administrations (Kolen & Brennan, 2014).
In IRT preequating methods, item parameters are linked from prior calibration(s) to the base form of an examination. For the purpose of this research, item difficulties will be the only item parameter used, which is in alignment with the testing program’s psychometric framework (the Rasch model). The base form (denoted as Form Y) is the form in which the cut-score was established. In order to implement preequating methods, item difficulties for scorable items must be estimated prior to examination administration. Prior to ATA, scorable item difficulties must be known to calculate and maximize the TIF. The alignment of previously calibrated item statistics that are used both for assembling forms using ATA and for preequating may support the applicability of this equating method.
Measurement Models and ATA
IRT allows test developers to “design tests to different sets of specifications and delegate their actual assembly to computer algorithms,” (van der Linden, 2005, p.11). By setting constraints for computerized test assembly, including blueprint domain representation or reasonable ranges for item statistics, test developers can create multiple forms of examination that are parallel in difficulty. ATA can incorporate item details regardless of the psychometric paradigm used to calibrate or score examinees and can be applied to polytomously or dichotomously scored examination. As discussed previously, this study uses data previously calibrated using the Rasch model.
While CTT, IRT, and Rasch approaches to ATA can utilize population dependent item statistics (i.e., p-values and discrimination indices) as constraints, in CTT there is no equivalent metric to the TIF. In order to construct parallel test forms in ATA, Armstrong, Jones and Wang (1994) maximized score reliability through a network-flow model. The authors stated that it was advantageous to use the CTT approach because it was computationally less expensive and produced comparable results in relation to the IRT approach to ATA. When this research was published, computational power was indeed a challenge; however, advances in computer memory and technology are much greater now, so the cited advantage does not hold the test of time. As such, IRT or Rasch approaches to ATA are more supported in the literature and are the focus of this study.
Prior to beginning ATA, test assemblers must calibrate response data to estimate item parameters from the sample population. Psychometricians often examine the goodness-of-fit of the data to determine the best IRT model (i.e., 1-PL, 2-PL) or confirm that the data fit the Rasch model. Once the examination is administered, psychometricians anchor item parameters based on prior calibrations to estimate examinee ability (van der Linden, 2005). In this study, I calibrate data using the Rasch model and will provide some evidence supporting the appropriateness of the model.
In ATA, test assemblers often optimize the TIF and evaluate the similarity of the forms by comparing the TIFs. However, even well-matched TIFs do not necessarily yield equitable score distributions (van der Linden, 2005). Thus, psychometricians must also continuously evaluate and monitor the score distributions once the examination forms are administered. The main question of this study is which IRT equating method (IRT observed score, IRT true score, or IRT preequating) yields the most comparable scores and decision consistencies when transitioning from MTA to ATA. In the following section, I provide a foundation of linking, equating, and scale linking as it pertains to this study.
Equating
Equating is the special case of linking in which psychometricians transform sets of scores from different assessment forms onto the same scale. By definition, equating methods are only applied to assessment forms that have the same psychometric and statistical properties and test specifications. The primary goal of equating is to allow scores to be used interchangeably, regardless of the form that an examinee was administered (Holland & Dorans, 2006; Kolen & Brennan, 2014).
Assessment programs can employ a variety of equating designs and methods, each design with unique characteristics and assumptions. Assessment programs often administer examinations within and across years. For the purpose of this section, I notate an original form of an examination as Y and a new form of an examination as X, with the understanding that assessment programs may administer multiple new forms ( ) or multiple original forms ( ). CINEG design are commonly used and require previously administered items from original forms of an examination to be included on new forms by a set of common or anchor items. The CINEG design is considered a more secure design than the random groups design because only a set of common items are exposed from an original form, rather than exposing an entire original form.
The CINEG design not only accounts for the difference in form difficulty, but also accounts for the difference in the population of test-takers. The statistical role of the common items is to control for differences in the populations, therefore removing bias from the equating function. In order to implement the common items design, the common items must meet several requirements (Dorans et al., 2010). First, the common items must follow the same content and statistical specifications as the entire full-length test. Second, there should be a strong positive correlation between scores on the full-length test form and scores on the common items because the common items follow the same specifications as the full-length test. Thirdly, measurement and administration conditions for the common items must be similar across new and original forms. Lastly, prior research recommends the use of common item sets include at least 20% of the full-length test, or consist of at least 30 items (Angoff, 1971; Kolen & Brennan, 2014). Satisfying these requirements ultimately ensures that the reported scores and the decisions based on the reported scores are accurate and reliable. The testing program used for this study meets the conditions described above.
IRT equating methods can be applied to data calibrated using the Rasch model and are the focus of this study. In this section, IRT equating methods are discussed in detail; however, first psychometricians must use scale linking procedures to examine the relationship between newly estimated item parameters and original estimations of item parameters from two independent calibrations. Due to the assumption of item invariance, if item parameters are known, no equating or scale linking is necessary and IRT preequating methods can be implemented prior to test administration (Hambleton et al., 1991). However, in practice it is important to implement scale linking procedures because there are often differences in item parameter estimates (Stocking, 1991).
Scale linking is the process by which independently calibrated item difficulties are linked onto a common scale. Several methods can be used to calculate scaling constants in order to place the item difficulties from form X on the same scale as Y (Hambleton et al., 1991). The mean/mean, mean/sigma, and TCC methods are discussed in their application to this study. Prior research supports the performance of TCC methods over other methods (i.e., mean/mean or mean/sigma) for scale linking due to the stability of the results and the precision, even when item parameters had modest standard errors (Kolen & Brennan, 2014; Li et al., 2012). Other research investigated the adequacy of different scale linking procedures within the Rasch model.
The mean/sigma method calculates scaling constants A and B based on the mean and standard deviation of the difficulty parameters of the common items on form X. There are two main TCC scale linking procedures, which are iterative processes that utilize item parameter estimates; the focus of the current student is on the Stocking and Lord (1983) procedure. The scale indeterminacy property of IRT is used in this method, such that an examinee with a given ability will have the same probability of answering an item correctly regardless of the scale used to report scores. The Stocking and Lord TCC procedure calculates the probability of correctly answering an item on the original scale ( ) and the new scale ( ) for each common item ( ) by taking the difference in examinee ability into consideration. Equation 10 represents the difference in TCCs ( ) between common items administered on form Y and form X, respectively. Then an iterative process solves for A and B by minimizing across all examinees.
(10)
(11)
Once item parameters are on the same scale, IRT equating methods are employed. IRT true score equating is the most commonly used IRT equating method. In IRT true score equating, true scores ( ) are represented as the number-correct score for examinee ( ) with given ability ( ; Kolen & Brennan, 2014). Additionally, true score equating assumes that there are no omitted responses (von Davier & Wilson, 2007). In a simplistic example, psychometricians first identify a true score on form X, then estimate the corresponding ability level is determined (see equation 12). Then, the true score on form Y ( ) is determined by using the corresponding ability level (see equation 13). Therefore, the equivalent score is the inverse of the ability distribution. This process is iterative, which typically involves the Newton-Raphson Method (Kolen & Brennan, 2014; Han et al., 1997).
and (12)
(13)
Unlike IRT true score equating methods, the IRT observed score equating method depends on the distribution of examinee abilities. The IRT observed score equating method is similar to equipercentile equating methods without the application of additional smoothing techniques, as previously discussed. It requires specifying the distributional characteristics of examinees prior to equating, using prior distributions (Kolen & Brennan, 2014).
All of the IRT equating methods previously discussed require data from the current test administration cycle. However, the IRT preequating method can be used when items are pretested prior to operational use. Once items are on the same scale, psychometricians generate raw-to-scale conversion tables prior to form administration, which ultimately decreases the workload for score release (Kolen & Brennan, 2014). Many testing organizations utilize IRT preequating in order to shorten the window for score release after examination administration. Testing organizations may also prefer IRT preequating methods due to their flexibility when equating scores for computer-based examinations that are administered intermittingly over a long testing cycle.
Researchers have compared the results among equating designs, methodologies and procedures; yet no researchers have compared scale linking or equating outcomes among ATA and MTA forms. In the current study, RMSD, MSD and MAD were calculated to examine the error and bias associated with scores. Researchers have commonly used these indices to evaluate the comparability of equating methods (Antal, Proctor, & Melican, 2014; Gao, He, & Ruan, 2012; Kolen & Harris, 1990).
Decision Consistency
In the context of testing programs that aim to categorize examinees into one or more groups based on their scores, such as medical licensure examinations, classification accuracy is a measurement of whether examinees were accurately classified based on their true ability (Lee, 2010).
Research Questions
The goal of this study is to compare the equating results when a testing organization moved from MTA to ATA. The research questions address the comparability of outcomes from three different methodological approaches to equating, after combining three IRT equating methods with three scale linking procedures.
- Which method of IRT equating (e.g., IRT observed score, IRT true score, or IRT preequating methods) minimizes error and bias associated between MTA and ATA developed forms?
- Which method of IRT equating (e.g., IRT observed score, IRT true score, or IRT preequating methods) yields the highest expected decision consistency of pass/fail distinctions between MTA and ATA developed forms?
METHOD
Data
In this study I used two years of response data from a large-scale medical licensure examination. From the 36 Y forms, I selected four forms (denoted ). There was item overlap among the 36 Y forms, which made it possible to concurrently calibrate data from the 36 Y forms simultaneously using the Rasch model.
First, I aggregated key information for each form, where pretest items were not embedded within each Y form. The pretest design used a total of 12 unique pretest blocks, each consisting of 50 items with overlap. The test administration vendor randomly assigned pretest blocks to examinees. Therefore, pretest items needed to be reviewed and selected for the form by form calibration for CINEG design. Figure 5 shows the design of an intact Y form (denoted form A) of operational items and plausible assignment of six pretest blocks. In Figure 5, Form A consists only of operational items, the test vendor randomly assigned a pretest blocks from group A (PTA) and a pretest blocks from group B (PTB). The diagram is a simplified depiction of the true design, which can ultimately yield more than 5,100 different combinations. Therefore, I employed a threshold of 30 responses to determine which pretest items had sufficient exposure for inclusion in the form by form calibration. Despite anchoring the item difficulties, at least 30 exposures ensured there was sufficient data to investigate data-model fit.
The concurrent calibration of the operational and pretest items on Y forms resulted in item difficulties on the same scale of measurement. Additionally, I selected four X forms (denoted ). I used three criteria to select the eight forms for this study: (a) X forms with the highest volume of administrations after the first several weeks following the examination launch, (b) X forms and Y forms with at least 20 percent overlap or at least 30 common items for scale linking purposes, and (c) the common item set was representative of the test blueprint (Angoff, 1971; Kolen & Brennan, 2014). The data design is shown in Figure 6. The common set of items on and is denoted as , the common set of items on and is denoted as , the common set of items on and is denoted as , and the common set of items on and is denoted as .
Approximately 7,600 examinees took one of the Y forms in year 1 and 4,300 first-time examinees took one of the X forms in the first testing window of year 2. After selecting the four forms, as previously described, I used data from the approximately 1,300 examinees who were administered or and the approximately 1,200 examinees who were administered or . Table IV displays a summary of the data selected for this research study.
Response data from year 1 on all 36 Y forms were concurrently calibrated using WINSTEPS® (Linacre, 2017). The estimated item difficulties were then used as anchors for each separate form calibration of . Y is considered the base form of the examination and therefore no equating on original forms was conducted.
Data Analyses
All data management and analyses were conducted in RStudio (2016), unless otherwise specified. The criterion of was used to examine the statistical significance of tests, unless otherwise specified.
1. Research Question 2: Equating Methods and Error
Which method of IRT equating (e.g., IRT observed score, IRT true score, or IRT preequating methods) minimizes error and bias associated between MTA and ATA developed forms?
I employed three scale linking approaches (mean/mean, mean/sigma, and Stocking-Lord TCC) and three equating methods (IRT observed score, the IRT true score, and the IRT preequating). I utilized the PIE computer programming to implement IRT observed score and IRT true score equating methods (Hanson et al., 2004b). To assess the equating results, I compared the root mean squared difference (RMSD), mean absolute difference (MAD), and mean signed difference (MSD) on X’ to Y. I then evaluated which method minimizes bias by identifying RMSD values close to 0 and evaluated which method minimizes error by identifying MSD and MAD close to 0. Higher indices indicate an accumulation of error and are not preferred. Findings from prior research show that IRT preequating methods often have higher levels of error associated with the examinee scores. However, due to the alignment of using precalibrated item difficulties for both ATA and preequating methods, I expect that the design of ATA may have an impact on the equated results.
2. Research Question 3: Equating Methods and Passing Rates
Which method of IRT equating (e.g., IRT observed score, IRT true score, or IRT preequating methods) yields the highest expected decision consistency of pass/fail distinctions between MTA and ATA developed forms?
Using the outcomes from research question 1, I estimated decision consistency indices using Huynh’s methodology (1990), which uses the probability density function, item curve functions (ICFs) and relative frequencies of a single population to estimate to common decision consistency indices: a raw agreement index, and kappa, (see equations 18 and 19). The raw agreement index, is calculated using the cumulative distribution function of test scores, and relative frequencies of test scores. Kappa is calculated as the difference between the raw agreement index, and , the expected proportion of consistent decisions if there is no relationship between test scores. Kappa indicates the decision consistency beyond what is expected by chance (Subkoviak, 1985).
(18)
Where
, (19)
, (20)
And
(21)
Where represents the ability level at a given raw score, ;
represents the difference in cumulative distribution functions of the raw cut-score, at ability level, ;
represents the relative frequency distribution at and
represents the number of classifications.
RESULTS
Research Question 2
To evaluate the adequacy of the results, I calculated the RMSD, MSD, and MAD. RMSD is a measure of bias, and MSD and MAD are measures of random error. Values closer to 0 indicate no raw score point differences between MTA and ATA forms. Overall, there were large differences in the amount of bias and error associated across forms and equating methods, therefore RMSD, MSD and MAD are presented separately for each form (see Table XV). Across all forms the equating and scale linking method with the least amount of error and bias was the mean/mean preequating method.
Table XV
BIAS AND ERROR INDICES BY EACH SCALE LINKING AND EQUATING METHOD
|
|
Observed Score |
True Score |
Preequating |
Form |
Index |
MM |
MS |
SL |
MM |
MS |
SL |
MM |
MS |
SL |
|
RMSD |
19.35 |
20.46 |
21.85 |
19.51 |
20.55 |
22.10 |
8.02 |
8.41 |
8.278 |
|
MSD |
-18.83 |
-20.27 |
-21.38 |
-18.97 |
-20.35 |
-21.62 |
-7.71 |
-8.05 |
-7.96 |
|
MAD |
18.83 |
20.27 |
21.38 |
18.97 |
20.35 |
21.62 |
7.71 |
8.05 |
7.96 |
|
RMSD |
10.37 |
5.47 |
9.40 |
10.35 |
5.70 |
9.40 |
4.74 |
5.24 |
4.91 |
|
MSD |
-10.31 |
-4.18 |
-9.35 |
-10.29 |
-4.29 |
-9.34 |
-4.31 |
-4.75 |
-4.46 |
|
MAD |
10.31 |
4.34 |
9.35 |
10.29 |
4.49 |
9.34 |
4.32 |
4.75 |
4.47 |
|
RMSD |
2.53 |
4.57 |
1.53 |
2.51 |
4.53 |
1.53 |
2.51 |
2.96 |
2.68 |
|
MSD |
-2.32 |
-4.45 |
-0.62 |
-2.27 |
-4.39 |
-0.58 |
-1.69 |
-2.08 |
-1.89 |
|
MAD |
2.38 |
4.48 |
1.29 |
2.35 |
4.41 |
1.28 |
1.99 |
2.34 |
2.15 |
|
RMSD |
12.42 |
12.46 |
15.91 |
12.92 |
13.03 |
16.84 |
2.55 |
3.01 |
2.82 |
|
MSD |
-11.92 |
-11.83 |
-15.29 |
-12.30 |
-12.26 |
-16.02 |
-1.90 |
-2.27 |
-2.20 |
|
MAD |
11.92 |
11.83 |
15.29 |
12.30 |
12.26 |
16.02 |
2.07 |
2.43 |
2.31 |
Note. MM represents mean/mean scale linking, MS represents the mean/sigma scale linking, and SL represents the Stocking and Lord TCC scale linking procedure. Due to the disparate index values across forms, results are shown for each form separately.
Boldface signifies values more favorable results with indices close to 0 per index per form (by row).
Preequating Method
Across the three equating methods paired with the three scale linking procedures, the results indicated that the mean/mean scale linking procedure with the preequating method performed the most favorably for three of the four forms ( ). For , the mean/mean preequating method resulted in lower bias and error in comparison to all other methods (RMSD = 8.02, MSD = -7.71, and MAD = 7.71), whereas the highest amount of bias was related to the Stocking and Lord TCC procedure paired with the IRT true score equating method (RMSD = 22.10, MSD = -21.62, MAD = 21.62). For , the mean/mean preequating method produced the most favorable results in comparison to all other methods (RMSD = 4.74, MSD = -4.31, and MAD = 4.32). However, the Stocking and Lord TCC scale linking procedure paired with the preequating method produced only slightly higher results than the mean/mean preequating method (within 0.5 raw score points). The small difference of 0.5 raw score points in RMSD, MSD and MAD between the scale linking procedures within the preequating method was present across all forms. For form 3, the mean/mean preequating method produced slightly higher RMSD in comparison to the Stocking and Lord true score equating method (RMSD = 2.51, RMSD = 1.53, respectively). These differences relate to a difference of about 1 raw score point. Therefore, the results from the mean/mean preequating method showed a slight improvement over the other scale linking procedures within the preequating method, although there was very little practical difference in the results across each scale linking procedure.
True Score and Observed Score Equating Methods
The results from the true score and observed score equating methods with each scale linking procedure were comparable across all forms. For , the true and observed score methods yielded very consistent results. Specifically, the mean/mean observed score method and the mean/mean true score method resulted in similar levels of error (MSD = -18.83, MSD = -8.97, respectively) and the maximum deviation between raw scores in terms of MSD of the mean/sigma true score and observed score methods was approximately 0.35 raw score points. Furthermore for , the Stocking and Lord TCC scale linking procedure paired with the observed score and true score methods produced similar high amounts of error (RMSD = 9.400, 9.399, respectively). Unique to form 3, the Stocking and Lord true score and observed score methods produced the lowest bias (RMSD =1.53) and error (MAD =1.279 and MAD = 1.292, respectively) across all other conditions. The results from combining each scale linking procedure with the true score and observed score methods were varied across forms; in some cases, the Stocking and Lord TCC procedure performed least favorably ( ), while in other cases, the mean/sigma scale linking procedure performed the least favorably ( ). Therefore the findings are inconclusive in terms of the preferred scale linking procedure for the IRT observed score and true score methods, although the evidence suggests that the Stocking and Lord TCC procedure produced higher levels of errors for two forms.
Research Question 3
Overall, the mean/mean preequating method and the Stocking and Lord TCC preequating method performed the most favorably ( ). The true score and observed score methods produced similar levels of decision consistency, indicating not much practical difference.
Figure 10. Mean decision consistency indices, (blue) and (red) across all forms. MM represents mean/mean scale linking, MS represents the mean/sigma scale linking, and SL represents the Stocking and Lord TCC scale linking procedure.
The average decision consistency indices across all ATA forms improved in comparison to baseline estimates using MTA forms. For example, the raw agreement index was 1% to 3% greater for ATA forms than MTA forms. For 3 of the 4 forms, the raw agreement index was the highest for preequating methods. Similar to the findings from research question 2, the results from the decision consistency evaluation indicated that the Stocking and Lord TCC scale linking procedure paired with the IRT true and observed score methods performed the most favorably for form 3 (see Figure 11). Results for each form and equating method are displayed in Appendix B.
Discussion
In order to examine the differences in equating outcomes that transitioning from MTA to ATA introduces, I employed three scale linking procedures and three equating methodologies. I calculated the RMSD, MSD, and MAD in order to determine which combination of scale linking procedure and equating method resulted in the least amount of bias and error. The preequating method with the mean/mean scale linking procedure produced the most favorable results for three of the four forms, even when the number of common items did not meet recommended criteria. Lastly, results from the decision consistency analyses indicated that the preequating method outperformed the true score and observed score equating methods in terms of , however the true score and observed score methods produced the most favorable decision consistency in terms of . The variation in error terms of equated scores across forms suggests that MTA and ATA forms cannot be directly compared. If testing organizations begin to implement ATA for form assembly, they should give thoughtful consideration to the use of MTA forms as the base forms for equating purposes.
Research Question 2: Equating Methods and Error
Due to the nature of implementing IRT equating methods following scale linking, the results of the equating methods are based on the quality of the results on the scale linking procedures. The common item sets were sufficiently sized for only form pair 1 and form pair 3; therefore the generalizability of the equating results of forms 2 and 4 are limited. Yet all previous known item statistics were used for preequating, not just those included in the common item set.
It is important to note that for the purpose of this research raw scores were used to evaluate the error and bias associated with scores that were equated for each method. Overall, the optimal method for three of the four forms was the mean/mean scale linking procedure paired with the preequating method. The mean/mean preequating method produced the lowest amount of bias as measured by RMSD. While other methods produced slightly lower MSD or MAD, there was not an appreciable or practical difference in these values from others (typically less than 0.2 raw score points). For the two forms with sufficient common item sets, the mean/mean preequating method and the Stocking and Lord TCC true score methods produced the most favorable results. Similar favorable findings from the mean/mean preequating method were found across the remaining forms; meaning, despite having an insufficient amount of items for scale linking purposes, the results still supported preequating methods. This may be due in part to the similarity between the Rasch equating model and the mean/mean preequating method.
There were large differences in the magnitude of results of RMSD between true score, observed score and preequating methods across the forms. Specifically, form 1 had the highest values of RMSD across all equating methods, whereas form 3 had the lowest values of RMSD across all equating methods. The differences in RMSD across the forms provide additional evidence that the new and original forms were not built to the same statistical specifications.
Results from prior research indicated that preequating methods have performed poorly in comparison to postequating methods. Kolen and Harris (1990) reported that the IRT preequating method resulted in the highest values of RMSD and MSD in comparison to IRT postequating methods. Tong and Kolen (2005) compared the adequacy of equated scores from the traditional equipercentile, IRT true score, and IRT observed score equating methods using three criteria and found that the IRT true score method performed least favorably in comparison to the IRT observed score. In this respect, the results from the current study disagreed with previous literature. Yet, the goal of the current study was to evaluate differences in equating outcomes between MTA and ATA forms. At this point in time, no research studies have compared outcomes from different equating methods when testing organizations transitioned from MTA to ATA, therefore the lack of consistent findings with prior literature may be in relation to the change in test development procedures, the differences in psychometric framework (i.e., IRT 2-PL versus Rasch model) or differences in the nature and purpose of the testing program (i.e., K-12 versus medical licensure). For example, much of the body of literature on equating utilizes K-12 assessment programs to investigate differences in equating methodologies. K-12 assessment programs are built to different test specifications as the purpose of these examinations may be to evaluate and monitor student growth rather than passing or failing examinees. Often these types of assessment programs have different characteristics than medical licensure examinations, including shorter administration windows and mode of delivery. The difference in results can also be explained by the utility of a purposeful equating design. Although the testing program used for the current study did not operationally implement a CINEG design, I employed this design by selecting data that conformed to the design (i.e., requirements were met). Two of the four forms had sufficiently sized common item sets. Yet the favorable findings for the preequating method were in agreement for three of the four forms used. This may be due to the fact that the preequating methodologies relied on a quality bank of linked items rather on a small common item set.
The RMSD, MAD, and MSD are commonly used measures to gain an overall understanding of the differences between equated scores and those on the base form. Yet the standard error of equating is another commonly used approach to evaluate the adequacy of equating results and can be used to gain a better understanding of the error associated across the distribution of equated scores. The standard error of equating replicates hypothetical samples to approximate the standard deviation of each equated score (Kolen & Brennan, 2014). Future research can expand on this study by calculating the standard error of equating for the IRT true and observed score equating methods.
The results of this research question suggest that prior to implementing ATA, the equating design should be thoroughly considered in light of the purpose of the assessment. In agreement with the best principles of test assembly, any time new test development procedures are implemented, results processing should be carefully considered (AERA, APA, & NCME, 2014). The implementation and variation in test assembly procedures necessitates the need for reviewing and evaluating current psychometric procedures (e.g., standard setting, equating designs, etc.). A key recommendation for practitioners is to discontinue the use of MTA forms as the base form when an organization is newly implementing ATA procedures as the findings from this research suggest that there is more variation in the statistical specifications of MTA forms to support continual use as the base form.
Research Question 3: Equating Methods and Decision Consistency
Although there are many ways to evaluate decision consistency, decision consistency was measured using two estimates, the proportion of raw agreement ( ), and the kappa index ( ) which corrects the raw agreement index by what is expected by chance. Both the mean/mean preequating methods and Stocking and Lord TCC preequating methods produced the highest raw agreement indices across forms ( ). This finding supports the findings from research question 2, in which the lowest error and bias were found in the same methods. The raw agreement indices for other equating methods differed slightly within each form, typically within 1%. In comparison, the estimates were inconsistent across forms (see Appendix B). Moreover, the decision consistency index of ATA forms was 1% to 3% greater than that of the MTA forms. This finding provides additional evidence that ATA enhances the psychometric properties of examinations that make pass/fail distinctions.
Although decision consistency is an important aspect for psychometricians to explore, it does not fully explain outcomes of examinations with pass/fail decisions. Specifically without simulation studies, where true ability is known, one cannot know for certain that decisions are accurate. Although there are ways to explore and provide evidence of decision accuracy, it was beyond the scope of the current study. Future research is warranted on different approaches like Lee (2010) to evaluate both decision consistency and decision accuracy when testing programs newly implement ATA.
Limitations
The testing program did not implement a CINEG design operationally; however, the data easily lent itself to the implementation of a CINEG design based on the use of anchor blocks in ATA. Although I confirmed key equating requirements and controlled for others by carefully selecting the forms used in study, the common item sets were not a perfect representation of the content or statistical specifications to that of the entire test. Moreover, the pretest item design had also changed between MTA and ATA forms, which may have influenced the findings. Specifically, pretest item blocks were assigned randomly to examinees in Y forms, whereas pretest items blocks were embedded within each X form. Although understandable when using operational data, there are limitations in the findings. Lastly, the 0.3 logit criteria used to establish the common item set is operationally employed although there are alternative methods one can use to identify outlying items in the common item set. Therefore, future research is warranted to address the replicability of this study when considering the purpose of the examination and equating designs in conjunction with test design decisions (i.e. ATA).
In addition, the assessment used in this study has unique and complex characteristics (e.g., test specifications, blueprint, and constraints) that may limit the generalizability of the results. For example, the forms assembled using ATA programs involved approximately 70 content and statistical constraints (i.e., domain representation, life stage of patient, clinical setting, mean item p-value, etc.), and maximized the TIF to create parallel forms. Although MTA and ATA forms were built according to the same domain representation, in MTA other variables were not as controlled as they were in ATA. Furthermore, ATA employs over 70 standardized constraints, whereas test developers loosened the constraints one by one, form by form during MTA. It is expected that the similarity of constraints between ATA and MTA procedures influenced the findings of this study. The results of this study shed light on the enhancement in quality of ATA forms; however this improvement necessitates the reevaluation or reconsideration of continuing to use MTA forms as the base examination. Future research may address the similarities or differences in MTA and ATA procedures by simulating ATA conditions and assessing the outcomes from different equating methods. Future research would provide insight as to how the similarity (or differences) between assembly procedures may influence different outcomes from equating methodologies.
Prior researchers have developed a variety of models that can be used in order to implement optimal test design. A brief overview of the different models is discussed in van der Linden (1998). It is well-documented that forms developed via ATA produce more favorable psychometric properties than MTA due to the overall test design and the defining attributes of ATA (Luecht, 1998; van der Linden, 2005). This research study provides some evidence that ATA creates more parallel test forms, not only in terms of content and statistical specifications, but also with respect to test information, data-model fit, and decision consistency. Yet, very few studies have provided empirical evidence of the quality improvement of ATA over MTA. Due to the growing popularity of ATA, more research is warranted on the replicability of this study (e.g., simulation studies), on other psychometric advantages that result from implementing ATA, and on the application to assessment programs that have different purposes and test designs.
Summary
The widespread implementation of ATA procedures has alleviated the workload of test developers by allowing computer programs to create multiple parallel test forms with relative ease. ATA procedures provide an efficient and cost-effective alternative to assembling parallel test forms simultaneously. The integral psychometric goal of ATA is the minimization of the SEM and maximization of the test score reliability (van der Linden, 2005). However, ATA is not only a question of computer programs easing the workload, but rather if computer programs improve the psychometric quality of assembled test forms. The results presented in this research study provide empirical evidence of the improvement in psychometric qualities of ATA forms. Whenever testing organizations newly implement test development practices, it is important to evaluate the outcomes (AERA, APA, & NCME, 2014). Assessing the adequacy of score outcomes of various equating methods is one way to investigate the relationship between psychometric quality and new implementation of ATA programs. In this research study, I evaluated the adequacy of different equating methods by estimating the bias, error, and decision consistency associated with score outcomes of newly developed ATA forms.
The context of evaluating equating methodologies with respect to test assembly procedures is important in today’s operational psychometric work as many testing organizations move towards ATA. Although testing organizations may utilize item parameters estimated using the Rasch model or IRT models for ATA, no previous research has connected differences in test assembly procedures to outcomes of equating methods. The results of this study further support the importance of planning and aligning the psychometric procedures to the test development procedures. The findings of this study suggest that the error and consistency of scores were related to the similarity in statistical test specifications. ATA led to the development of parallel forms that had better psychometric properties and less variation in content and statistical specifications than test forms assembled manually.
The results indicated that despite the differences in statistical specifications, the mean/mean preequating method performed the most favorably. This finding may be explained by the alignment among the mean/mean preequating method, psychometric framework of the Rasch model, and that ATA utilizes the same item difficulties to build each form. Conceptually, the mean/mean preequating method is similar to the Rasch equating method, which anchors known item difficulties. Therefore, the mean/mean preequating method is aligned with the Rasch anchored equating method. Furthermore, because ATA utilizes the same known item difficulties to build forms that have similar TIFs that peak at the cut-score of the examination, all of these methods are complimentary and work in tandem. Future research should expand on these findings by investigating the outcomes from similar equating methodologies when ATA forms are used as the base forms.
Cited Literature
Ali, U. S., & van Rijn, P. W. (2016). An evaluation of different statistical targets for assembling
parallel forms in item response theory. Applied Psychological Measurement, 40(3),
163-179.
Antal, J., Proctor, T.P., & Melican, G.J. (2014). The effect of anchor test construction on scale
drift. Applied Measurement in Education, 27(3), 159-172, doi: 10.1080/08957347.2014.905785
American Educational Research Association, American Psychological Association, National
Council on Measurement in Education, & Joint Committee on Standards for Educational and Psychological Testing (AERA, APA, & NCME). (2014). Standards for educational and psychological testing (pp. 75-94). Washington, DC: American Educational Research Association.
Armstrong, R. D., Jones, D. H., & Wang, Z. (1994). Automated parallel test construction using
classical test theory. Journal of Educational Statistics, 19(1), 73-90. doi:10.2307/1165178
Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible
paradigms? In Smith, E. V., & Smith, R. M. (Ed). Introduction to Rasch Measurement: Theory, models and applications. Maple Grove, MN: JAM Press.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.),
Educational measurement (pp. 508–600). Washington, DC: American Council on Education.
Babcock, B., & Albano, A. D. (2012). Rasch scale stability in the presence of item parameter and
trait drift. Applied Psychological Measurement, 36(7), 565-580. doi:10.1177/0146621612455090
Bock, D. R. (1997). A brief history of item response theory. Educational Measurement: Issues
and Practice, 16(4), 21-33.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York:
Holt, Rinehart, and Winston.
Debeer, D., Ali, U. S., & van Rijn. P. W. (2017). Evaluating the statistical targets for assembling
parallel mixed-format test forms. Journal of Educational Measurement, 54(2), 218-242.
Dorans, N., Moses, T. & Eignor D. (2010) Principles and Practices of Test Score Equating. ETS
RR-10-29. ETS Research Report Series.
Eignor, D. R. & Stocking, M. L. (1986). An investigation of possible causes for the inadequacy
of IRT preequating (Report No 86-14). Princeton, NJ: ETS Research Report Series. Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/j.2330-8516.1986.tb00169.x/epdf
Embretson, S. E., & Reise, S. P. (2009). Item response theory for psychologists. New York, NY:
Psychology Press.
Gao, R., He, W., & Ruan, C. (2012). Does preequating work? An investigation into a preequated
testlet-based college placement examination using post administration data (Report No 12-12). Princeton, NJ: ETS Research Report Series.
Hambleton, R., Swaminathan, H., & Rogers, H. (1991). Fundamentals of Item Response Theory.
Newbury Park, CA: Sage.
Hambleton, R., & Slater, S. (1997). Item response theory models and testing practices: Current international status and future directions. European Journal of Psychological Assessment, 13(1), 21-28. doi: 10.1027/1015-5759.13.1.21
Hanson, B.A., Zheng, L., & Cui, Z. (2004a). PIE: A computer program for IRT equating
[computer program]. Iowa City, IA: education.uiowa.edu/centers/casma
Hanson, B.A., Zheng, L., & Cui, Z. (2004b). ST: A computer program for IRT scale linking
[computer program]. Iowa City, IA: education.uiowa.edu/centers/casma
Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true- and observed-score
equatings and traditional equipercentile equating. Applied Measurement in Education, 10(2), 105-121. doi:10.1207/s15324818ame1002_1
Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied
Psychological Measurement, 9(2), 139-164.
Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In Brennan, R. L. (Ed.),
Educational measurement (pp. 187-220). Westport, CT: Praeger.
Huynh, H. (1990). Computation and statistical inference for decision consistency indexes based
on the Rasch Model. Journal of Educational Statistics, 15(4), 353-368. doi:10.2307/1165093
Huynh, H., & Rawls, A. (2011). A comparison between robust z and 0.3-logit difference
procedures in assessing stability of linking items for the rasch model. Journal of Applied Measurement, 12(2), 96.
Karabatsos, G. (2017). Elements of psychometric theory lecture notes. Personal Collection of G.
Karabatsos, University of Illinois at Chicago, Chicago, Illinois.
Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating
tests. Journal of Educational Measurement, 18(1), 1-11.
Kolen, M. J., & Brennan, R. L. (2014). Test equating scaling and linking (3rd ed). New York,
NY: Springer.
Kolen, M. J., & Harris, D. J. (1990). Comparison of item preequating and random groups
equating using IRT and equipercentile methods. Journal of Educational Measurement, 27(1), pp. 27-30.
Lee, W. (2010). Classification consistency and accuracy for complex assessments using item
response theory. Journal of Educational Measurement, 47(1), pp. 1-17.
Li, D., Jiang, Y., & von Davier, A. A. (2012). The accuracy and consistency of a series of IRT
true score equatings. Journal of Educational Measurement, 49(2), 167-189. doi: 10.1111/j.1745-3984.2012.00167.x
Lin, C.-J. (2008). Comparisons between Classical Test Theory and Item Response Theory in
Automated Assembly of Parallel Test Forms. Journal of Technology, Learning, and Assessment, 6(8).
Linacre, J. M. (2017). Winsteps® Rasch measurement [computer program]. Beaverton, OR:
Winsteps.com.
Linacre, J. M. (2017). ). Fit diagnosis: infit outfit mean-square standardized. Retrieved from
http://www.winsteps.com/winman/misfitdiagnosis.htm
Livingston, S. L. (2004). Equating test scores (without IRT). Princeton, NJ: ETS. Retrieved
from: https://www.ets.org/Media/Research/pdf/LIVINGSTON.pdf
Lord, F. M. (1977). Practical applications of item characteristic curve theory. Journal of
Educational Measurement, 14(2), 117-138. doi:10.1111/j.1745-3984.1977.tb00032.x
Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Redding, MA:
Addison-Wesley Publishing Company.
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile
observed-score “equatings.”. Applied Psychological Measurement, 8(4), 453-461. doi: 10.1177/014662168400800409
Luecht, R.M. (1998). Computer-assisted test assembly using optimization heuristics Applied
Psychological Measurement, 22(3), 224-236. doi: 10.1177/01466216980223003.
Mead, R. (2008). A Rasch primer: The measurement theory of Georg Rasch. Psychometrics
services research memorandum 2008–001. Maple Grove, MN: Data Recognition Corporation.
Penfield, R. D. (2005). Unique properties of Rasch model item information functions. Journal of
Applied Measurement, 6(4), 355-365.
O’Neill, T., Peabody, M., Tan, R. J. B. & Du, Y. (2013). How much item drift is too much?
Rasch Measurement Transactions, 27(3), 1423-1424. Retrieved from: https://www.rasch.org/rmt/rmt273a.htm
RStudio Team (2016). RStudio: Integrated Development for R. RStudio, Inc. [computer
program]. Boston, MA Retrieved from http://www.rstudio.com/.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen,
Danish Institute for Educational Research. Chicago: The University of Chicago Press.
Smith, E. V., Stearus, M., Sorenson, B., Huynh, H. & McNaughton, T. (2018). Rasch DC
[computer program]. Unpublished, Department of Educational Psychology, University of
Illinois at Chicago, Chicago, IL.
Smith, E. V. (2004). Evidence of the reliability of measures and validity of measure
interpretation: A Rasch measurement perspective. In Smith, E. V., & Smith, R. M. (Ed.),
Introduction to Rasch Measurement: Theory, models and applications. Maple Grove, MN: JAM Press.
Smith, E. V. (2005). Effect of item redundancy on Rasch item and person estimates. Journal of
Applied Measurement, 6(2), 147-163.
Smith, R. M. (2003). Rasch measurement models: Interpreting Winsteps and FACETS output.
Maple Grove, MN: JAM Press.
Smith, R. M. (1991). The distributional properties of Rasch item fit statistics. Educational
and Psychological Measurement, 51, 541-565.
Smith, R. M. & Kramer, G. (1992). A comparison of two methods of test equating in the Rasch
model. Educational and Psychological Measurement, 52, 835-846.
Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons, Journal of
the American Statistical Association, 69, pp. 730-737.
Subkoviak, M. J. (1985). Tables of reliability coefficients for mastery tests. Paper presented at
the Annual Meeting of the American Educational Research Association, Chicago, IL.
Stocking, M. (1991, April). An experiment in the application of an automated item selection
method to real data. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Boston, MA.
Taylor, C. S., & Lee, Y. (2010). Stability of Rasch scales over time. Applied Measurement in
Education, 23(1), 87-113. doi: 10.1080/08957340903423701
Tong, Y., & Kolen, M. J. (2005). Assessing equating results on different equating criteria.
Applied Psychological Measurement, 29(6), 418-432. doi: 10.1177/0146621606280071
Tong, Y., Wu, S-S., Xu, M. (2008, March). A comparison of preequating and post-equating
using large-scale assessment data. Paper presented at the Annual Conference of the American Educational and Research Association. New York City, NY.
van der Linden, W. J. & Hambleton, R. K. (1997). Handbook of modern item response theory.
New York, NY: Springer.
van der Linden, W. J. (2005). Linear models for optimal test design. New York, NY: Springer.
doi:10.1007/0-387-29054-0
van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22(3), 195-211. doi: 10.1177/01466216980223001
von Davier, A. A. (2010). Equating and Scaling. In Peterson, P., Baker, E. & McGaw, B. (Ed.), International Encyclopedia of Education. Amsterdam: Academic Press, pp. 50-55.
von Davier, A. A. & Wilson, C. (2007). IRT true score test equating: A guide through
assumptions and applications. Journal of Educational and Psychological Measurement, 67(6); 940-957. doi: 10.1177/0013164407301543
Wightman, L. F. (1998). Practical issues in computerized test assembly. Applied Psychological
Measurement, 22(3), 292-302. doi: 10.1177/01466216980223009
Wilson, M. (1988). Detecting and interpreting local item dependence using a family of Rasch
models. Applied Psychological Measurement, 12, 353-364.
Wright, B. D. (1977). Solving measurement problems with the Rasch Model. Journal of
Educational Measurement, 14(2), 97-115.
Wright, B. D. & Mok, M. C. (2004). An overview of the family of Rasch Measurement
Models. In Smith, E. V., & Smith, R. M. (Ed.), Introduction to Rasch Measurement: Theory, models and applications. Maple Grove, MN: JAM Press.
Wright, B. D., & Linacre, M. (1994). Reasonable Mean-Square Fit Statistics. Rasch
Measurement Transactions, 8(3), 370. Retrieved from: https://www.rasch.org/rmt/rmt83b.htm
Yen, W. M. & Fitzpatrick, A. R. (2006). Item Response Theory. In Brennan, R. L. (Ed.)
Educational measurement, fourth edition. Portsmouth, NH: Praeger.
Yi, H. S., Kim, S., & Brennan, R. L. (2007). A method for estimating classification consistency
indices for two equated forms. Applied Psychological Measurement, 32(4), 275-291.
APPENDIX B
TABLE XVII
COMPLETE RESULTS FROM DECISION CONSISTENCY ANALYSES
Form |
Equating |
Scale Linking |
|
Error |
|
Error |
|
Baseline Comparison |
0.949 |
0.006 |
0.682 |
0.033 |
|
OS |
MM |
0.958 |
0.007 |
0.764 |
0.035 |
|
OS |
MS |
0.955 |
0.007 |
0.736 |
0.035 |
|
OS |
SL |
0.948 |
0.008 |
0.743 |
0.035 |
|
TS |
MM |
0.958 |
0.007 |
0.763 |
0.035 |
|
TS |
MS |
0.955 |
0.007 |
0.734 |
0.035 |
|
TS |
SL |
0.948 |
0.008 |
0.745 |
0.035 |
|
PE |
MM |
0.971 |
0.007 |
0.633 |
0.063 |
|
PE |
MS |
0.971 |
0.007 |
0.641 |
0.061 |
|
PE |
SL |
0.972 |
0.007 |
0.646 |
0.061 |
|
Baseline Comparison |
0.958 |
0.006 |
0.657 |
0.040 |
|
OS |
MM |
0.958 |
0.007 |
0.649 |
0.049 |
|
OS |
MS |
0.967 |
0.006 |
0.714 |
0.048 |
|
OS |
SL |
0.963 |
0.007 |
0.651 |
0.052 |
|
TS |
MM |
0.958 |
0.007 |
0.652 |
0.049 |
|
TS |
MS |
0.966 |
0.006 |
0.712 |
0.047 |
|
TS |
SL |
0.963 |
0.007 |
0.651 |
0.052 |
|
PE |
MM |
0.971 |
0.006 |
0.697 |
0.053 |
|
PE |
MS |
0.969 |
0.006 |
0.697 |
0.052 |
|
PE |
SL |
0.970 |
0.006 |
0.699 |
0.052 |
|
Baseline Comparison |
0.947 |
0.006 |
0.647 |
0.034 |
|
OS |
MM |
0.960 |
0.006 |
0.617 |
0.044 |
|
OS |
MS |
0.954 |
0.007 |
0.631 |
0.040 |
|
OS |
SL |
0.965 |
0.006 |
0.587 |
0.050 |
|
TS |
MM |
0.960 |
0.006 |
0.612 |
0.044 |
|
TS |
MS |
0.954 |
0.007 |
0.623 |
0.040 |
|
TS |
SL |
0.965 |
0.006 |
0.580 |
0.051 |
|
PE |
MM |
0.959 |
0.006 |
0.699 |
0.037 |
|
PE |
MS |
0.958 |
0.006 |
0.709 |
0.036 |
|
PE |
SL |
0.958 |
0.006 |
0.701 |
0.037 |
|
Baseline Comparison |
0.935 |
0.009 |
0.719 |
0.034 |
|
OS |
MM |
0.955 |
0.007 |
0.791 |
0.029 |
|
OS |
MS |
0.955 |
0.007 |
0.793 |
0.029 |
|
OS |
SL |
0.950 |
0.007 |
0.802 |
0.026 |
|
TS |
MM |
0.956 |
0.007 |
0.803 |
0.028 |
|
TS |
MS |
0.958 |
0.007 |
0.814 |
0.027 |
|
TS |
SL |
0.952 |
0.007 |
0.818 |
0.025 |
|
PE |
MM |
0.963 |
0.007 |
0.678 |
0.044 |
|
PE |
MS |
0.963 |
0.007 |
0.697 |
0.042 |
|
PE |
SL |
0.963 |
0.007 |
0.686 |
0.044 |
Note. Boldface signifies maximum value. MM represents mean/mean scale linking, MS represents the mean/sigma scale linking, and SL represents the Stocking and Lord TCC scale linking procedure. OS represents IRT observed score equating, PE represents IRT preequating, and TS represents IRT true score equating.