Journal of Applied Measurement

P.O. Box 1283

Maple Grove, MN 55311

 

Volume 9, 2008 Article Abstracts

 

Vol. 9, No. 1 Spring 2008

Strategies for Controlling Item Exposure in Computerized Adaptive Testing with the Partial Credit Model

Laurie Laughlin Davis and Barbara G. Dodd

Abstract

Exposure control research with polytomous item pools has determined that randomization procedures can be very effective for controlling test security in computerized adaptive testing (CAT). The current study investigated the performance of four procedures for controlling item exposure in a CAT under the partial credit model. In addition to a no exposure control baseline condition, the Kingsbury-Zara, modified-within-.10-logits, Sympson-Hetter, and conditional Sympson-Hetter procedures were implemented to control exposure rates. The Kingsbury-Zara and the modified-within-.10-logits procedures were implemented with 3 and 6 item candidate conditions. The results show that the Kingsbury-Zara and modified-within-.10-logits procedures with 6 item candidates performed as well as the conditional Sympson-Hetter in terms of exposure rates, overlap rates, and pool utilization. These two procedures are strongly recommended for use with partial credit CATs due to their simplicity and strength of their results.

****

A Multidimensional Rasch Analysis of Gender Differences in PISA Mathematics

Ou Lydia Liu, Mark Wilson, and Insu Paek

Abstract

Since the 1970s, much attention has been devoted to the male advantage in standardized mathematics tests in the United States. Although girls are found to perform equally well as boys in math classes, they are consistently outperformed on standardized math tests. This study compared the males and females in the United States, all 15-year-olds, by their performance on the PISA 2003 mathematics assessment. A multidimensional Rasch model was used for item calibration and ability estimation on the basis of four math domains: Space and Shape, Change and Relationships, Quantity, and Uncertainty. Results showed that the effect sizes of performance differences are small, all below .20, but consistent, in favor of boys. Space and Shape displayed the largest gender gap, which supports the findings from many previous studies. Quantity showed the least amount of gender difference, which may be explained by the hypothesis that girls perform better on tasks that they are familiar with through classroom practice.

****

An Exploration of Correctional Staff Members’ Views of Inmate Amenities: A Scaling Approach

Elizabeth Ehrhardt Mustaine, George E. Higgins, and Richard Tewksbury

Abstract

Today, the number of prisons and the prison population is rising. One of the key challenges accompanying these changes is how prison and staff can handle this increasing number of inmates. One of the issues involved is what products, goods, and services are deemed suitable for inmates. Research has addressed this issue, but has yielded no consensus. Methodological variances are central to the disjuncture between samples and beliefs. Using responses from 554 correctional staff, the Rasch model was used to assess whether perceptions of inmate amenities are part of a larger dimension. Results suggest that twenty-items accurately represent correctional staff perceptions of inmate amenities, with boxing as the most difficult to support and books as being the easiest to support.

****

Measuring Job Satisfaction in the Social Services Sector with the Rasch Model

Eugenio Brentari and Silvia Golia

Abstract

In the present paper, the Rasch measurement model is used in the validation and analysis of data coming from the satisfaction section of the first national survey concerning the social services sector carried out in Italy. A comparison between two Rasch models for polytomous data, that is the Rating Scale Model and the Partial Credit Model, is discussed. Given that the two models provide similar estimates of the item difficulties and workers satisfaction, for almost all the items the response probabilities computed using the RSM and the PCM are very close and the analysis of the bootstrap confidence intervals shows that the estimates obtained applying the RSM are more stable than the ones obtained using the PCM, it can be conclude that, for the present data, the RSM is more appropriate than the PCM.

****

Comparing Screening Approaches to Investigate Stability of Common Items in Rasch Equating

Alvaro J. Arce-Ferrer

Abstract

This paper used real and simulated data sets to compare three screening approaches often used in state-wide equating programs utilizing the Rasch model: Wright and Stone’s t-statistic, robust z-statistic, and displace. Analyses of real data sets supported the superiority of robust z-statistic and displace measure relative to Wright and Stone’s t-statistic. The simulation component did not support the contention that indiscriminate use of the ±0.3 logits criterion inflates rates of Type I error for robust z-statistic and displace measure, although this contention was supported for the Wright and Stone’s t-statistic. However, Type II error rates were largest for displace measure, followed by the robust z-statistic, then the t-statistic. The paper discusses the importance of a priori selection of a criterion for screening linking items and its effects on stability and accuracy of Rasch equating constant.

****

Estimation of the Accessibility of Items and the Confidence of Candidates: A Rasch-Based Approach

A. A. Korabinski, M. A. Youngson, and M. McAlpine

Abstract

In Scottish High School mathematics examinations partial credit is normally awarded for answers which are not totally correct but nevertheless contain some of the correct working. As a way of incorporating partial credit in the marking of ICT versions of these examinations, “steps” have been introduced. The use of “steps” also allows for a Rasch analysis that measures the inaccessibility of items and the confidence of candidates in addition to the traditional difficulty of items and ability of candidates. Two Rasch models can be fitted and jointly assessed for fit. The resulting measures can then be investigated for any relationship between ability and confidence and between difficulty and inaccessibility. A small data set has been used to illustrate these ideas.

****

Binary Items and Beyond:A Simulation of Computer Adaptive Testing Using the Rasch Partial Credit Model

Rense Lange

Abstract

Past research on Computer Adaptive Testing (CAT) has focused almost exclusively on the use of binary items and minimizing the number of items to be administrated. To address this situation, extensive computer simulations were performed using partial credit items with two, three, four, and five response categories. Other variables manipulated include the number of available items, the number of respondents used to calibrate the items, and various manipulations of respondents’ true locations. Three item selection strategies were used, and the theoretically optimal Maximum Information method was compared to random item selection and Bayesian Maximum Falsification approaches. The Rasch partial credit model proved to be quite robust to various imperfections, and systematic distortions did occur mainly in the absence of sufficient numbers of items located near the trait or performance levels of interest. The findings further indicate that having small numbers of items is more problematic in practice than having small numbers of respondents to calibrate these items. Most importantly, increasing the number of response categories consistently improved CAT’s efficiency as well as the general quality of the results. In fact, increasing the number of response categories proved to have a greater positive impact than did the choice of item selection method, as the Maximum Information approach performed only slightly better than the Maximum Falsification approach. Accordingly, issues related to the efficiency of item selection methods are far less important than is commonly suggested in the literature. However, being based on computer simulations only, the preceding presumes that actual respondents behave according to the Rasch model. CAT research could thus benefit from empirical studies aimed at determining whether, and if so, how, selection strategies impact performance.

 

Vol. 9, No. 2 Summer 2008

Effects of Varying Magnitude and Patterns of Response Dependence in the Unidimensional Rasch Model

Ida Marais and David Andrich

Abstract

By adding items with responses identical to a selected item, Smith (2005) investigated the effect of the response dependence on person and item parameter estimates in the dichotomous Rasch model. By varying the magnitude of response dependence among selected items, rather than their having perfect dependence, this paper provides additional insights into the effects of response dependence on the same estimates in the same model. Two sets of simulations are reported. In the first set, responses to all items except the first were dependent on either the first item or on the immediately preceding item; in the second set, subsets of items were formed first, and then within each of these subsets, responses to all items in a subset except the first were dependent on either the first item or on the immediately preceding item. The effects of dependence were noticeable in all of the statistics reported. In particular, the fit statistics and the parameter estimates showed increasing discrepancies from their theoretical values as a function of the magnitude of the dependence. In some cases, however, two related statistics gave the impression of improvement as a function of increased dependency; first the standard deviation of person estimates showed an increase, and second the index analogous to traditional reliability showed relative increase. In addition to the estimates and depending on the structure and magnitude of the dependence, the person distribution was affected systematically, ranging from becoming skewed to becoming bimodal. The effects on the distribution help explain some of the effects on the statistics reported. In the case of the second set of simulations in which the dependence is within subsets of items, it is possible to take account of the response dependence. This is done by summing the responses of the items within each subset to form a polytomous item and then analyzing the data in terms of a smaller number of polytomous items. This way of accounting for dependence, in which the maximum score for the test as a whole remains the same, gives a more accurate value of the reliability and a more realistic distribution of the person estimates than when the dependence within subsets of items is not taken into account.

****

Fisher’s Information Function and Rasch Measurement

Mark H. Stone

Abstract

Fisher’s information function is reviewed with respect to an example he used for explication. A contemporary example continues the discussion with application to a rating scale instrument. The relationship of information to precision and measurement error is presented and discussed with respect to the analysis of fit. Targeting the instrument and the best test design for measuring a person with respect to information and item-person fit is discussed. The idealization of information and precision for making measures appears most effectively realized when computerized assisted testing can be employed to implement a best test design.

****

A Rasch Analysis for Classification of Systemic Lupus Erythematosus and Mixed Connective Tissue Disease

Kyle Perkins, Robert W. Hoffman, and Nikolaus Bezruczko

Abstract

The classification of rheumatic diseases is challenging because these diseases have protean and frequently overlapping clinical and laboratory manifestations. This problem is typified by the difficulty of classification and differentiation of two prototypic multi-system autoimmune diseases, Systemic Lupus Erythematosus (SLE)and Mixed Connective Tissue Disease (MCTD). The researchers submitted medical risk factor data represented by instrument or laboratory measures and physician judgments (12 key features for SLE) from 43 patients diagnosed with SLE and 12 key features for MCTD from 51 patients diagnosed with MCTD to the WINSTEPS Rasch analysis program. Using Rasch model parameterization, and fit and residuals analyses, the researchers identified separate dimensions for MCTD and SLE, thereby lending support to the position that MCTD is its own separate disease, distinct from SLE.

****

Magnitude Estimation and Categorical Rating Scaling in Social Sciences: A Theoretical and Psychometric Controversy

Svetlana A. Beltyukova, Gregory E. Stone, and Christine M. Fox

Abstract

This article revisits a half-century long theoretical controversy associated with the use of magnitude estimation scaling (MES) and category rating scaling (CRS) procedures in measurement. The MES procedure in this study involved instructing participants to write a number that matched their impression of difficulty of a test item. Participants were not restricted in the range of numbers they could choose for their scale. They also had the choice of disclosing their individual scale. After the MES task was completed, participants were given a blank copy of the test to rate the perceived difficulty of each item using a researcher-imposed categorical rating scale (CRS) from 1 (very easy) to 6 (very difficult). The MES and CRS data were both analyzed using Rasch Rating scale model. Additionally, the MES data were examined with Rasch Partial Credit model. Results indicate that knowing each person’s scale is associated with smaller errors of measurement.

****

Impact of Altering Randomization Intervals on Precision of Measurement and Item Exposure

Timothy Muckle, Betty Bergstrom, Kirk Becker, and John Stahl

Abstract

This article reports on the use of simulation when a randomization procedure is used to control item exposure in a computerized adaptive test for certification. We present a method to determine the optimum width of the interval from which items are selected and we report on the impact of relaxing the interval width on measurement precision and item exposure. Results indicate that, if the item bank is well targeted, it may be possible to widen the randomization interval and thus reduce item exposure, without seriously impacting the error of measure for test takers whose ability estimate is near the pass point.

****

Rasch Measurement in Developing Faculty Ratings of Students Applying to Graduate School

Sooyeon Kim and Patrick C. Kyllonen

Abstract

The Standardized Letter of Recommendation (SLR), a 28-item form, was created by ETS to supplement the qualitative rating of graduate school applicants’ nonacademic qualities with a quantitative approach. The purpose of this study was to evaluate the following psychometric properties of the SLR using the Rasch rating scale model: dimensionality, reliability, item quality, and rating category effectiveness. Principal component and factor analyses were also conducted to examine the dimensionality of the SLR. Results revealed (a) two secondary factors underlay the data, along with a strong higher order factor, (b) item and person separation reliabilities were high, (c) noncognitive items tended to elicit higher endorsements than did cognitive items, and (d) a 5-point Likert scale functioned effectively. The psychometric properties of the SLR support the use of a composite score when reporting SLR scores and the utility of the SLR in higher education and in admissions.

****

Understanding Rasch Measurement: Using Rasch Scaled Stage Scores to Validate Orders of Hierarchical Complexity of Balance Beam Task Sequences

Michael Lamport Commons, Eric Andrew Goodheart, Alexander Pekker, Theo Linda Dawson, Karen Draney, and Kathryn Marie Adams

Abstract

These studies examine the relationship between the analytic basis underlying the hierarchies produced by the Model of Hierarchical Complexity and the probabilistic Rasch scales that places both participants and problems along a single hierarchically ordered dimension. A Rasch analysis was performed on data from the balance-beam task series. This yielded scaled stage of performance for each of the items. The items formed a series of clusters along this same dimension, according to their order of hierarchical complexity. We sought to ascertain whether there was a significant relationship between the order of hierarchical complexity (a task property variable) of the tasks and the corresponding Rasch scaled difficulty of those same items (a performance variable). It was found that The Model of Hierarchical Complexity was highly accurate in predicting the Rasch Stage scores of the performed tasks, therefore providing an analytic and developmental basis for the Rasch scaled stages.

 

Vol. 9, No. 3 Fall 2008

Formalizing Dimension and Response Violations of Local Independence in the Unidimensional Rasch Model

Ida Marais and David Andrich

Abstract

Local independence in the Rasch model can be violated in two generic ways that are generally not distinguished clearly in the literature. In this paper we distinguish between a violation of unidimensionality, which we call trait dependence, and a specific violation of statistical independence, which we call response dependence, both of which violate local independence. Distinct algebraic formulations for trait and response dependence are developed as violations of the dichotomous Rasch model, data are simulated with varying degrees of dependence according to these formulations, and then analysed according to the Rasch model assuming no violations. Relative to the case of no violation it is shown that trait and response dependence result in opposite effects on the unit of scale as manifested in the range and standard deviation of the scale and the standard deviation of person locations. In the case of trait dependence the scale is reduced; in the case of response dependence it is increased. Again, relative to the case of no violation, the two violations also have opposite effects on the person separation index (analogous to Cronbach’s Alpha reliability index of traditional test theory in value and construction): it decreases for data with trait dependence; it increases for data with response dependence. A standard way of accounting for dependence is to combine the dependent items into a higher-order polytomous item. This typically results in a decreased person separation index index and Cronbach’s Alpha, compared with analysing items as discrete, independent items. This occurs irrespective of the kind of dependence in the data, and so further contributes to the two violations not being distinguished clearly. In an attempt to begin to distinguish between them statistically this paper articulates the opposite effects of these two violations in the dichotomous Rasch model.

****

Calibration of Multiple-Choice Questionnaires to Assess Quantitative Indicators

Paola Annoni and Pieralda Ferrari

Abstract

The joint use of two latent factor methods is proposed to assess a measurement instrument for an underlying phenomenon. For this purpose, Rasch analysis is initially used to properly calibrate questionnaires, to discard non informative variables and redundant categories. As a second step, an optimal scaling technique, Nonlinear PCA, is applied to quantify variable categories and to compute a continuous indicator. Specifically, the paper deals with the state of decay of Italian buildings of great architectural and historical interest, which function as a case study . The decay level of the buildings is quantified on the basis of a broad set of observed ordinal variables and the final indicator may be independently used for buildings of future inventory. Overall, similarity and diverse potentiality of the techniques are analyzed and discussed with the purpose of exploring the synergic effect of their combined use.

****

The Impact of Data Collection Design, Linking Method, and Sample Size on Vertical Scaling Using the Rasch Model

Insu Paek, Michael J. Young, and Qing Yi

Abstract

The Rasch model-based vertical scaling was evaluated by simulation study with respect to recovery of item parameter, linking constant, population mean (grade-to-grade growth), population standard deviation (gradeto- grade variability), and separation of grade distributions by effect size. The simulated vertical scale had five different grades with five different test levels. Controlled factors were data collection design, linking methods, and sample size. For item parameter, linking constant, and population mean, counter-balanced single group (CBSG) with mean/mean (or fixed item) method and concurrent calibration performed best. The population standard deviation recovery, as sample size increases, did not show systematic improvement across different data collection and linking methods. For the separation of grade distributions, CBSG with mean/mean (or fixed item) methods performed best. The average absolute differences from the true parameters were less than 0.1 in logit across different linking methods. In general the differences between different linking methods were less than those between different sample sizes.

****

Understanding the Unit in the Rasch Model

Stephen M. Humphry and David Andrich

Abstract

The purpose of this paper is to explain the role of the unit implicit in the dichotomous Rasch model in determining the multiplicative factor of separation between measurements in a specified frame of reference. The explanation is provided at two complementary levels: first, in terms of the algebra of the model in which the role of an implicit, multiplicative constant is made explicit; and second, at a more fundamental level, in terms of the classical definition of measurement in the physical sciences. The Rasch model is characterized by statistical sufficiency, which arises from the requirement of invariant comparisons within a specified frame of reference. A frame of reference is defined by a class of persons responding to a class of items in a well-defined response context. The paper shows that two or more frames of reference may have different implicit units without destroying sufficiency. Understanding the role of the unit permits explication of the relationship between the Rasch model and the two parameter logistic model. The paper also summarises an approach that can be used in practice to express measurements across different frames of reference in the same unit.

****

Factor Structure of the Developmental Behavior Checklist using Confirmatory Factor Analysis of Polytomous Items

Daniel E. Bontempo, Scott. M. Hofer, Andrew Mackinnon, Andrea M. Piccinin, Kylie Gray, Bruce Tonge, and Stewart Einfeld

Abstract

The Developmental Behavior Checklist (DBC; Einfeld and Tonge, 1995) is a 95 item clinical screening checklist designed to assess the extent of behavioral and emotional disturbance in populations with intellectual deficit (ID). The DBC provides five principal-component derived subscales covering clinically relevant dimensions of psychopathology (i.e., Disruptive, Self-Absorbed, Communication Disturbance, Anxiety, and Social Relating). Validating these subscales for individual differences research requires examinations of the stability of this structure. This study begins a program of psychometric study of the DBC, by utilizing item level data to investigate the DBC’s subscale structure in regard to simple-structure restrictions, as well as the implications of factorially complex items for inter-subscale correlations. To accomplish these goals a polytomous confirmatory factor analysis (PCFA) of the DBC was performed, and the pattern of loadings and inter-factor correlations was examined with and without simple-structure restrictions. Our findings provide evidence that the two largest subscales (Disruptive/Antisocial, Self Absorbed) are well behaved in PCFA models and should exhibit little bias under unit-weighted scoring procedures, or in latent factor models. Findings for the three smaller subscales (Communication Disturbance, Social Relating, and Anxiety) do not invalidate their use in individual differences research, but do highlight several issues that should be considered by individual differences researchers.

****

Overcoming Vertical Equating Complications in the Calibration of an Integer Ability Scale for Measuring Outcomes of a Teaching Experiment

Andreas Koukkoufis and Julian Williams

Abstract

The measurement complexities emerging from vertical equating in an educational experiment aiming at an advance in the curriculum are addressed, when calibrating an ‘integer ability’ scale for year 5 students from Greater Manchester based both on primary (years 5 and 6) and high school (years 7 and 8) data. The need for such a calibration resulted from experimental teaching of ‘high school content’ in primary school. Substantial Rasch differential item functioning (DIF) arose in the vertical equating between primary and high school in our initial ‘all-on-all’ ‘concurrent’ calibration. A second ‘Primary anchored-and-extended’ calibration which substantially overcame DIF problems is shown to be preferable for our teaching experiment. The relevant methodological challenges and the techniques adopted are discussed. The solution provided might be useful to researchers for educational experiments targeting an advance in the curriculum.

****

Estimation of Decision Consistency Indices for Complex Assessments: Model Based Approaches

Matthew Stearns and Richard M. Smith

Abstract

With the implementation of the No Child Left Behind assessment program and the use of proficiency levels as a means of evaluating Annual Yearly Progress, there is a renewed interest in the consistency of classification decisions based on scale scores from achievement test and state-wide proficiency standards. Many of the current methods described in the literature (Huynh, 1976; Hanson and Brennan, 1990; and Livingston and Lewis, 1995) are based on assumptions about the distribution of the conditional errors. Although recent methods (Brennan and Wan, 2004) make no assumptions about the distribution, these methods have one compelling disadvantage: the decision consistency calculated is based on the entire set of data and are not conditional on the location of the cut scores, the student measure and the conditional standard errors of measurement for the students. The decision consistency for a student scoring right at the cut score will be much lower that the decision consistency for a student with a score 5 points above or below that cut score. The standard error method described in this article is based solely on the asymptotic standard error of measurement derived from the appropriate Rasch measurement model, and the location of the cut score used to make the classification decision. This classification can be easily modified to accommodate multiple classification categories. This is a conditional decision consistency statistic that can be applied to each person ability estimate (raw score) and provides information that can be used to calculate the likelihood that a person with that measure will receive the same classification if retested. The decision consistency for the entire sample can be calculated by simply summing the likelihood of the same classification over all of the examinees. The results of retest simulations using data that fit the Rasch model suggest that the standard error method provides a better estimate of the resulting classification consistency than the true score methods or the bootstrap method.

 

Vol. 9, No. 4 Winter 2008

The Differential Impact of Resolution Methods on the Operational Scores of Gender and Ethnic Groups

Shiqi Hao, Robert L. Johnson, and James Penny

Abstract

In the scoring of performance assessments, when two raters assign different ratings some method must be used to resolve the discrepant ratings to form an operational score for reporting. This study investigated the differential impact of various resolution methods on the operational scores for gender and ethnic groups. The mean operational scores, and the passing rates for each group on two essay prompts were compared using three resolution methods (rater mean, tertium quid, and parity). The results indicated that for female and African American students, resolution typically resulted in greater reduction of mean operational ratings and passing rates than for those of their male or White counterparts. Differential item functioning (DIF) analyses were conducted using IRT-based logistic regression models. No apparent gender-related DIF was detected. Although uniform DIF was found for the ethnic groups, the effect was small, and there was not enough evidence to support the hypothesis that DIF could be associated with a resolution method.

****

A Rasch Measurement Analysis of the Use of Cohesive Devices in Writing English as a Foreign Language by Secondary Students in Hong Kong

Margaret Lai Fun Ho and Russell F. Waugh

Abstract

This paper investigated the use of three types of cohesive devices, that is, reference, conjunction and lexis, used in writing the English as a Foreign Language (EFL) essays for students in secondary year 2, 4 and 6 in Hong Kong. Fifty students from each of the three forms (N = 150) provided narrative and descriptive essays for analysis which were marked by two competent English teachers by counting the frequency of writing devices used per 100 words. Initially, 14 cohesive devices (items) were counted as items for analysis, but two devices (items) were deleted as not fitting a Rasch measurement model. The RUMM2020 computer program with the partial credit model was used to create a linear scale of Writing Devices Used with twelve items: two for references, four for conjunctions, two for lexis, three for cohesive ties, and one for quality. There was good overall fit to the measurement model (item-trait chi-square = 56.81, df = 48, p = 0.18) but the Person Separation Index was very low at 0.08 mainly due to the small range of quality of essays in comparison to the difficulties of the writing devices (items). Students found that the three easiest writing devices used were remote cohesive ties, immediate cohesive ties and mediate cohesive ties. The three hardest writing devices used were temporal conjunctions, causal conjunctions and adversative conjunctions.

****

Linking Classical Test Theory and Two-level Hierarchical Linear Models

Yasuo Miyazaki and Gary Skaggs

Abstract

This paper considers the link between classical test theory (CTT) and two-level hierarchical linear models (HLM). Conceptualizing that items are nested within subjects, we can reformulate the ANOVA classical test model as an HLM. In this HLM framework, item difficulty parameters are represented by the fixed effects, and subject’s abilities are represented by the random effects. The population reliability of either the total or the mean score can be represented by a function of the random effects parameters and the number of items. For estimation, taking advantage of the balanced design nature of CTT, we can obtain explicit formulas for parameter estimates of both fixed and random effects in HLM. It reveals that the formula and the estimate derived from HLM exactly match those of CTT reliability, which are equivalent to Cronbach’s coefficient alpha under the assumptions of essentially tau equivalent measures. Not only that, we can obtain most of the important quantities in CTT such as estimates of item difficulty, standard error of measurement, true score, and person ability in a single HLM model. Thus, the CTT model formulated by HLM framework provides a systematic approach on measurement analysis by CTT. For illustrative purpose, a small data set was analyzed using HLM software (Raudenbush, Bryk, Cheong, & Congdon, 2000). The results confirmed the theoretical link between CTT and HLM.

****

Measuring Self-Complexity: A Critical Analysis of Linville’s H Statistic

Wenshu Luo, David Watkins, and Raymond Y. H. Lam

Abstract

The paper argues that the most commonly used measure of self-complexity, Linville’s H statistic, cannot measure this construct appropriately. It first examines the mathematical properties of H and its relationships with five related indices: the number of self-aspects, the overlap among self-aspects, the average inter-aspect correlation, the ratio of endorsement, and the HICLAS attribute class number. Then, a demonstration study using simulations is reported. Three conclusions are drawn. H and the HICLAS attribute class number are similar in the way they are calculated. Both indices are highly related to the number of self-aspects, while their relationship to overlap is not monotonic. Overlap is affected by the ratio of endorsement and the average inter-aspect correlation but cannot represent the notion of redundancy among traits which directly determines Linville’s H statistic. These conclusions are employed to explain the inconsistent findings relating self-complexity and adaptation and an alternative measurement approach is proposed.

****

Measuring Cynicism Toward Organizational Change — One Dimension or Two?

Simon Albrecht

Abstract

Wanous, Reichers, and Austin’s (2000) measure of cynicism about organizational change (CAOC) was subjected to validation and cross-validation procedures using data collected from two Australian public sector organizations. More specifically analyses were conducted to determine whether CAOC is best understood as a single dimensional or a two dimensional construct. The results of confirmatory factor analysis suggest that neither a one dimensional nor a two dimensional measurement model provided a satisfactory fit to the data. However, removal of two of the original eight items resulted in a two dimensional model, with three items in each dimension, that provided a reasonable fit to the data. Practical implications and directions for future research are discussed.

****

Assessment of Differential Item Functioning

Wen-Chung Wang

Abstract

This study addresses several important issues in assessment of differential item functioning (DIF). It starts with the definition of DIF, effectiveness of using item fit statistics to detect DIF, and linear modeling of DIF in dichotomous items, polytomous items, facets, and testlet-based items. Because a common metric over groups of test-takers is a prerequisite in DIF assessment, this study reviews three such methods of establishing a common metric: the equal-mean-difficulty method, the all-other-item method, and the constant-item (CI) method. A small simulation demonstrates the superiority of the CI method over the others. As the CI method relies on a correct specification of DIF-free items to serve as anchors, a method of identifying such items is recommended and its effectiveness is illustrated through a simulation. Finally, this study discusses how to assess practical significance of DIF at both item and test levels.

Home