Journal of Applied Measurement

P.O. Box 1283

Maple Grove, MN 55311

**Volume 9, 2008 Article Abstracts**

**Vol. 9, No. 1 Spring 2008**

Strategies for Controlling Item
Exposure in Computerized Adaptive Testing with the Partial Credit Model

*Laurie Laughlin Davis and Barbara G. Dodd*

Abstract

Exposure control research with polytomous item pools has determined that randomization procedures can be
very effective for controlling test security in computerized adaptive testing (CAT). The current study investigated
the performance of four procedures for controlling item exposure in a CAT under the partial credit
model. In addition to a no exposure control baseline condition, the Kingsbury-Zara, modified-within-.10-logits,
Sympson-Hetter, and conditional Sympson-Hetter procedures were implemented to control exposure rates. The
Kingsbury-Zara and the modified-within-.10-logits procedures were implemented with 3 and 6 item candidate
conditions. The results show that the Kingsbury-Zara and modified-within-.10-logits procedures with 6 item
candidates performed as well as the conditional Sympson-Hetter in terms of exposure rates, overlap rates, and
pool utilization. These two procedures are strongly recommended for use with partial credit CATs due to their
simplicity and strength of their results.

****

A Multidimensional Rasch Analysis
of Gender Differences in PISA Mathematics

*Ou Lydia Liu, Mark Wilson, and Insu Paek*

Abstract

Since the 1970s, much attention has been devoted to the male advantage in standardized mathematics tests in
the United States. Although girls are found to perform equally well as boys in math classes, they are consistently
outperformed on standardized math tests. This study compared the males and females in the United States, all
15-year-olds, by their performance on the PISA 2003 mathematics assessment. A multidimensional Rasch
model was used for item calibration and ability estimation on the basis of four math domains: Space and Shape,
Change and Relationships, Quantity, and Uncertainty. Results showed that the effect sizes of performance differences
are small, all below .20, but consistent, in favor of boys. Space and Shape displayed the largest gender
gap, which supports the findings from many previous studies. Quantity showed the least amount of gender
difference, which may be explained by the hypothesis that girls perform better on tasks that they are familiar
with through classroom practice.

****

An Exploration of
Correctional Staff Members’ Views of Inmate Amenities: A Scaling Approach

*Elizabeth Ehrhardt Mustaine, George E. Higgins, and Richard Tewksbury*

Abstract

Today, the number of prisons and the prison population is rising. One of the key challenges accompanying
these changes is how prison and staff can handle this increasing number of inmates. One of the issues involved
is what products, goods, and services are deemed suitable for inmates. Research has addressed this issue, but
has yielded no consensus. Methodological variances are central to the disjuncture between samples and beliefs.
Using responses from 554 correctional staff, the Rasch model was used to assess whether perceptions of inmate
amenities are part of a larger dimension. Results suggest that twenty-items accurately represent correctional
staff perceptions of inmate amenities, with boxing as the most difficult to support and books as being the easiest
to support.

****

Measuring Job
Satisfaction in the Social Services Sector with the Rasch Model

*Eugenio Brentari and Silvia Golia*

Abstract

In the present paper, the Rasch measurement model is used in the validation and analysis of data coming from
the satisfaction section of the first national survey concerning the social services sector carried out in Italy. A
comparison between two Rasch models for polytomous data, that is the Rating Scale Model and the Partial
Credit Model, is discussed. Given that the two models provide similar estimates of the item difficulties and
workers satisfaction, for almost all the items the response probabilities computed using the RSM and the PCM
are very close and the analysis of the bootstrap confidence intervals shows that the estimates obtained applying
the RSM are more stable than the ones obtained using the PCM, it can be conclude that, for the present data, the
RSM is more appropriate than the PCM.

****

Comparing Screening
Approaches to Investigate Stability of Common Items in Rasch Equating

*Alvaro J. Arce-Ferrer*

Abstract

This paper used real and simulated data sets to compare three screening approaches often used in state-wide
equating programs utilizing the Rasch model: Wright and Stone’s t-statistic, robust z-statistic, and displace.
Analyses of real data sets supported the superiority of robust z-statistic and displace measure relative to Wright
and Stone’s t-statistic. The simulation component did not support the contention that indiscriminate use of
the ±0.3 logits criterion inflates rates of Type I error for robust z-statistic and displace measure, although this
contention was supported for the Wright and Stone’s t-statistic. However, Type II error rates were largest for
displace measure, followed by the robust z-statistic, then the t-statistic. The paper discusses the importance of
a priori selection of a criterion for screening linking items and its effects on stability and accuracy of Rasch
equating constant.

****

Estimation of the
Accessibility of Items and the Confidence of Candidates: A Rasch-Based Approach

*A. A. Korabinski, M. A. Youngson, and M. McAlpine*

Abstract

In Scottish High School mathematics examinations partial credit is normally awarded for answers which are not
totally correct but nevertheless contain some of the correct working. As a way of incorporating partial credit in
the marking of ICT versions of these examinations, “steps” have been introduced. The use of “steps” also allows
for a Rasch analysis that measures the inaccessibility of items and the confidence of candidates in addition to
the traditional difficulty of items and ability of candidates. Two Rasch models can be fitted and jointly assessed
for fit. The resulting measures can then be investigated for any relationship between ability and confidence and
between difficulty and inaccessibility. A small data set has been used to illustrate these ideas.

****

Binary Items and
Beyond:A Simulation of Computer Adaptive Testing Using the Rasch Partial Credit Model

*Rense Lange*

Abstract

Past research on Computer Adaptive Testing (CAT) has focused almost exclusively on the use of binary items and
minimizing the number of items to be administrated. To address this situation, extensive computer simulations
were performed using partial credit items with two, three, four, and five response categories. Other variables
manipulated include the number of available items, the number of respondents used to calibrate the items, and
various manipulations of respondents’ true locations. Three item selection strategies were used, and the theoretically
optimal Maximum Information method was compared to random item selection and Bayesian Maximum
Falsification approaches. The Rasch partial credit model proved to be quite robust to various imperfections,
and systematic distortions did occur mainly in the absence of sufficient numbers of items located near the trait
or performance levels of interest. The findings further indicate that having small numbers of items is more
problematic in practice than having small numbers of respondents to calibrate these items. Most importantly,
increasing the number of response categories consistently improved CAT’s efficiency as well as the general
quality of the results. In fact, increasing the number of response categories proved to have a greater positive
impact than did the choice of item selection method, as the Maximum Information approach performed only
slightly better than the Maximum Falsification approach. Accordingly, issues related to the efficiency of item
selection methods are far less important than is commonly suggested in the literature. However, being based
on computer simulations only, the preceding presumes that actual respondents behave according to the Rasch
model. CAT research could thus benefit from empirical studies aimed at determining whether, and if so, how,
selection strategies impact performance.

**Vol. 9, No. 2 Summer 2008**

Effects of Varying Magnitude and Patterns of
Response Dependence in the Unidimensional Rasch Model

*Ida Marais and David Andrich*

Abstract

By adding items with responses identical to a selected item, Smith (2005) investigated the effect of the response
dependence on person and item parameter estimates in the dichotomous Rasch model. By varying the magnitude
of response dependence among selected items, rather than their having perfect dependence, this paper provides
additional insights into the effects of response dependence on the same estimates in the same model. Two sets
of simulations are reported. In the first set, responses to all items except the first were dependent on either the
first item or on the immediately preceding item; in the second set, subsets of items were formed first, and then
within each of these subsets, responses to all items in a subset except the first were dependent on either the first
item or on the immediately preceding item. The effects of dependence were noticeable in all of the statistics
reported. In particular, the fit statistics and the parameter estimates showed increasing discrepancies from their
theoretical values as a function of the magnitude of the dependence. In some cases, however, two related statistics
gave the impression of improvement as a function of increased dependency; first the standard deviation of
person estimates showed an increase, and second the index analogous to traditional reliability showed relative
increase. In addition to the estimates and depending on the structure and magnitude of the dependence, the
person distribution was affected systematically, ranging from becoming skewed to becoming bimodal. The effects
on the distribution help explain some of the effects on the statistics reported. In the case of the second set
of simulations in which the dependence is within subsets of items, it is possible to take account of the response
dependence. This is done by summing the responses of the items within each subset to form a polytomous
item and then analyzing the data in terms of a smaller number of polytomous items. This way of accounting
for dependence, in which the maximum score for the test as a whole remains the same, gives a more accurate
value of the reliability and a more realistic distribution of the person estimates than when the dependence within
subsets of items is not taken into account.

****

Fisher’s Information Function and Rasch
Measurement

*Mark H. Stone*

Abstract

Fisher’s information function is reviewed with respect to an example he used for
explication. A contemporary example continues the discussion with application to a rating scale instrument.
The relationship of information to precision and measurement error is presented and discussed with respect
to the analysis of fit. Targeting the instrument and the best test design for measuring a person with respect
to information and item-person fit is discussed. The idealization of information and precision for making
measures appears most effectively realized when computerized assisted testing can be employed to implement
a best test design.

****

A Rasch Analysis for Classification of
Systemic Lupus Erythematosus and Mixed Connective Tissue Disease

*Kyle Perkins, Robert W. Hoffman, and Nikolaus Bezruczko*

Abstract

The classification of rheumatic diseases is challenging because these diseases
have protean and frequently overlapping clinical and laboratory manifestations. This problem is typified
by the difficulty of classification and differentiation of two prototypic multi-system autoimmune diseases,
Systemic Lupus Erythematosus (SLE)and Mixed Connective Tissue Disease (MCTD). The researchers submitted
medical risk factor data represented by instrument or laboratory measures and physician judgments (12
key features for SLE) from 43 patients diagnosed with SLE and 12 key features for MCTD from 51 patients
diagnosed with MCTD to the WINSTEPS Rasch analysis program. Using Rasch model parameterization, and fit
and residuals analyses, the researchers identified separate dimensions for MCTD and SLE, thereby lending
support to the position that MCTD is its own separate disease, distinct from SLE.

****

Magnitude Estimation and Categorical
Rating Scaling in Social Sciences: A Theoretical and Psychometric Controversy

*Svetlana A. Beltyukova, Gregory E. Stone, and Christine M. Fox*

Abstract

This article revisits a half-century long theoretical controversy associated
with the use of magnitude estimation scaling (MES) and category rating scaling (CRS) procedures in
measurement. The MES procedure in this study involved instructing participants to write a number
that matched their impression of difficulty of a test item. Participants were not restricted in
the range of numbers they could choose for their scale. They also had the choice of disclosing
their individual scale. After the MES task was completed, participants were given a blank copy
of the test to rate the perceived difficulty of each item using a researcher-imposed categorical
rating scale (CRS) from 1 (very easy) to 6 (very difficult). The MES and CRS data were both analyzed
using Rasch Rating scale model. Additionally, the MES data were examined with Rasch Partial Credit
model. Results indicate that knowing each person’s scale is associated with smaller errors of measurement.

****

Impact of Altering Randomization
Intervals on Precision of Measurement and Item Exposure

*Timothy Muckle, Betty Bergstrom, Kirk Becker, and John Stahl*

Abstract

This article reports on the use of simulation when a randomization procedure
is used to control item exposure in a computerized adaptive test for certification. We present a
method to determine the optimum width of the interval from which items are selected and we report
on the impact of relaxing the interval width on measurement precision and item exposure. Results
indicate that, if the item bank is well targeted, it may be possible to widen the randomization
interval and thus reduce item exposure, without seriously impacting the error of measure for test
takers whose ability estimate is near the pass point.

****

Rasch Measurement in Developing
Faculty Ratings of Students Applying to Graduate School

*Sooyeon Kim and Patrick C. Kyllonen*

Abstract

The Standardized Letter of Recommendation (SLR), a 28-item form, was created
by ETS to supplement the qualitative rating of graduate school applicants’ nonacademic qualities with
a quantitative approach. The purpose of this study was to evaluate the following psychometric properties
of the SLR using the Rasch rating scale model: dimensionality, reliability, item quality, and rating
category effectiveness. Principal component and factor analyses were also conducted to examine the
dimensionality of the SLR. Results revealed (a) two secondary factors underlay the data, along with a
strong higher order factor, (b) item and person separation reliabilities were high, (c) noncognitive
items tended to elicit higher endorsements than did cognitive items, and (d) a 5-point Likert scale
functioned effectively. The psychometric properties of the SLR support the use of a composite score
when reporting SLR scores and the utility of the SLR in higher education and in admissions.

****

Understanding Rasch Measurement:
Using Rasch Scaled Stage Scores to Validate Orders of Hierarchical Complexity of Balance Beam Task Sequences

*Michael Lamport Commons, Eric Andrew Goodheart, Alexander Pekker,
Theo Linda Dawson, Karen Draney, and Kathryn Marie Adams*

Abstract

These studies examine the relationship between the analytic basis underlying the
hierarchies produced by the Model of Hierarchical Complexity and the probabilistic Rasch scales that places
both participants and problems along a single hierarchically ordered dimension. A Rasch analysis was
performed on data from the balance-beam task series. This yielded scaled stage of performance for each of
the items. The items formed a series of clusters along this same dimension, according to their order of
hierarchical complexity. We sought to ascertain whether there was a significant relationship between the
order of hierarchical complexity (a task property variable) of the tasks and the corresponding Rasch scaled
difficulty of those same items (a performance variable). It was found that The Model of Hierarchical
Complexity was highly accurate in predicting the Rasch Stage scores of the performed tasks, therefore
providing an analytic and developmental basis for the Rasch scaled stages.

**Vol. 9, No. 3 Fall 2008**

Formalizing Dimension
and Response Violations of Local Independence in the Unidimensional Rasch Model

*Ida Marais and David Andrich*

Abstract

Local independence in the Rasch model can be violated in two generic ways that are generally not distinguished clearly in the literature. In this paper we distinguish between a violation of unidimensionality, which we call trait dependence, and a specific violation of statistical independence, which we call response dependence, both of which violate local independence. Distinct algebraic formulations for trait and response dependence are developed as violations of the dichotomous Rasch model, data are simulated with varying degrees of dependence according to these formulations, and then analysed according to the Rasch model assuming no violations. Relative to the case of no violation it is shown that trait and response dependence result in opposite effects on the unit of scale as manifested in the range and standard deviation of the scale and the standard deviation of person locations. In the case of trait dependence the scale is reduced; in the case of response dependence it is increased. Again, relative to the case of no violation, the two violations also have opposite effects on the person separation index (analogous to Cronbach’s Alpha reliability index of traditional test theory in value and construction): it decreases for data with trait dependence; it increases for data with response dependence. A standard way of accounting for dependence is to combine the dependent items into a higher-order polytomous item. This typically results in a decreased person separation index index and Cronbach’s Alpha, compared with analysing items as discrete, independent items. This occurs irrespective of the kind of dependence in the data, and so further contributes to the two violations not being distinguished clearly. In an attempt to begin to distinguish between them statistically this paper articulates the opposite effects of these two violations in the dichotomous Rasch model.

****

Calibration of Multiple-Choice Questionnaires to Assess Quantitative Indicators

*Paola Annoni and Pieralda Ferrari*

Abstract

The joint use of two latent factor methods is proposed to assess a measurement instrument for an underlying
phenomenon. For this purpose, Rasch analysis is initially used to properly calibrate questionnaires, to discard
non informative variables and redundant categories. As a second step, an optimal scaling technique, Nonlinear
PCA, is applied to quantify variable categories and to compute a continuous indicator. Specifically, the paper
deals with the state of decay of Italian buildings of great architectural and historical interest, which function
as a case study . The decay level of the buildings is quantified on the basis of a broad set of observed ordinal
variables and the final indicator may be independently used for buildings of future inventory. Overall, similarity
and diverse potentiality of the techniques are analyzed and discussed with the purpose of exploring the synergic
effect of their combined use.

****

The Impact of Data Collection Design, Linking Method, and Sample Size on Vertical Scaling Using the Rasch Model

*Insu Paek, Michael J. Young, and Qing Yi*

Abstract

The Rasch model-based vertical scaling was evaluated by simulation study with respect to recovery of item
parameter, linking constant, population mean (grade-to-grade growth), population standard deviation (gradeto-
grade variability), and separation of grade distributions by effect size. The simulated vertical scale had five
different grades with five different test levels. Controlled factors were data collection design, linking methods,
and sample size. For item parameter, linking constant, and population mean, counter-balanced single group
(CBSG) with mean/mean (or fixed item) method and concurrent calibration performed best. The population
standard deviation recovery, as sample size increases, did not show systematic improvement across different
data collection and linking methods. For the separation of grade distributions, CBSG with mean/mean (or fixed
item) methods performed best. The average absolute differences from the true parameters were less than 0.1 in
logit across different linking methods. In general the differences between different linking methods were less
than those between different sample sizes.

****

Understanding the Unit in the Rasch Model

*Stephen M. Humphry and David Andrich*

Abstract

The purpose of this paper is to explain the role of the unit implicit in the dichotomous Rasch model in determining
the multiplicative factor of separation between measurements in a specified frame of reference. The explanation
is provided at two complementary levels: first, in terms of the algebra of the model in which the role of an implicit,
multiplicative constant is made explicit; and second, at a more fundamental level, in terms of the classical
definition of measurement in the physical sciences. The Rasch model is characterized by statistical sufficiency,
which arises from the requirement of invariant comparisons within a specified frame of reference. A frame of
reference is defined by a class of persons responding to a class of items in a well-defined response context.
The paper shows that two or more frames of reference may have different implicit units without destroying
sufficiency. Understanding the role of the unit permits explication of the relationship between the Rasch model
and the two parameter logistic model. The paper also summarises an approach that can be used in practice to
express measurements across different frames of reference in the same unit.

****

Factor Structure of the Developmental Behavior Checklist using Confirmatory Factor Analysis of Polytomous Items

*Daniel E. Bontempo, Scott. M. Hofer, Andrew Mackinnon, Andrea M. Piccinin, Kylie Gray, Bruce Tonge, and Stewart Einfeld*

Abstract

The Developmental Behavior Checklist (DBC; Einfeld and Tonge, 1995) is a 95 item clinical screening checklist
designed to assess the extent of behavioral and emotional disturbance in populations with intellectual deficit
(ID). The DBC provides five principal-component derived subscales covering clinically relevant dimensions
of psychopathology (i.e., Disruptive, Self-Absorbed, Communication Disturbance, Anxiety, and Social Relating).
Validating these subscales for individual differences research requires examinations of the stability of
this structure. This study begins a program of psychometric study of the DBC, by utilizing item level data to
investigate the DBC’s subscale structure in regard to simple-structure restrictions, as well as the implications of
factorially complex items for inter-subscale correlations. To accomplish these goals a polytomous confirmatory
factor analysis (PCFA) of the DBC was performed, and the pattern of loadings and inter-factor correlations was
examined with and without simple-structure restrictions. Our findings provide evidence that the two largest
subscales (Disruptive/Antisocial, Self Absorbed) are well behaved in PCFA models and should exhibit little
bias under unit-weighted scoring procedures, or in latent factor models. Findings for the three smaller subscales
(Communication Disturbance, Social Relating, and Anxiety) do not invalidate their use in individual differences
research, but do highlight several issues that should be considered by individual differences researchers.

****

Overcoming Vertical Equating Complications in the Calibration of an Integer Ability Scale for Measuring Outcomes of a Teaching Experiment

*Andreas Koukkoufis and Julian Williams*

Abstract

The measurement complexities emerging from vertical equating in an educational experiment aiming at an advance
in the curriculum are addressed, when calibrating an ‘integer ability’ scale for year 5 students from Greater
Manchester based both on primary (years 5 and 6) and high school (years 7 and 8) data. The need for such a
calibration resulted from experimental teaching of ‘high school content’ in primary school. Substantial Rasch
differential item functioning (DIF) arose in the vertical equating between primary and high school in our initial
‘all-on-all’ ‘concurrent’ calibration. A second ‘Primary anchored-and-extended’ calibration which substantially
overcame DIF problems is shown to be preferable for our teaching experiment. The relevant methodological
challenges and the techniques adopted are discussed. The solution provided might be useful to researchers for
educational experiments targeting an advance in the curriculum.

****

Estimation of Decision Consistency Indices for Complex Assessments: Model Based Approaches

*Matthew Stearns and Richard M. Smith*

Abstract

With the implementation of the No Child Left Behind assessment program and the use of proficiency levels
as a means of evaluating Annual Yearly Progress, there is a renewed interest in the consistency of classification
decisions based on scale scores from achievement test and state-wide proficiency standards. Many of the current
methods described in the literature (Huynh, 1976; Hanson and Brennan, 1990; and Livingston and Lewis, 1995)
are based on assumptions about the distribution of the conditional errors. Although recent methods (Brennan
and Wan, 2004) make no assumptions about the distribution, these methods have one compelling disadvantage:
the decision consistency calculated is based on the entire set of data and are not conditional on the location of
the cut scores, the student measure and the conditional standard errors of measurement for the students. The
decision consistency for a student scoring right at the cut score will be much lower that the decision consistency
for a student with a score 5 points above or below that cut score.
The standard error method described in this article is based solely on the asymptotic standard error of measurement
derived from the appropriate Rasch measurement model, and the location of the cut score used to make
the classification decision. This classification can be easily modified to accommodate multiple classification
categories. This is a conditional decision consistency statistic that can be applied to each person ability estimate
(raw score) and provides information that can be used to calculate the likelihood that a person with that measure
will receive the same classification if retested. The decision consistency for the entire sample can be calculated
by simply summing the likelihood of the same classification over all of the examinees.
The results of retest simulations using data that fit the Rasch model suggest that the standard error method
provides a better estimate of the resulting classification consistency than the true score methods or the bootstrap
method.

**Vol. 9, No. 4 Winter 2008**

The Differential Impact of Resolution Methods on the Operational
Scores of Gender and Ethnic Groups

*Shiqi Hao, Robert L. Johnson, and James Penny*

Abstract

In the scoring of performance assessments, when two raters assign different ratings some method must be used
to resolve the discrepant ratings to form an operational score for reporting. This study investigated the differential
impact of various resolution methods on the operational scores for gender and ethnic groups. The mean
operational scores, and the passing rates for each group on two essay prompts were compared using three resolution
methods (rater mean, tertium quid, and parity). The results indicated that for female and African American
students, resolution typically resulted in greater reduction of mean operational ratings and passing rates than for
those of their male or White counterparts. Differential item functioning (DIF) analyses were conducted using
IRT-based logistic regression models. No apparent gender-related DIF was detected. Although uniform DIF was
found for the ethnic groups, the effect was small, and there was not enough evidence to support the hypothesis
that DIF could be associated with a resolution method.

****

A Rasch Measurement Analysis of the Use of
Cohesive Devices in Writing English as a Foreign Language by Secondary Students in Hong Kong

*Margaret Lai Fun Ho and Russell F. Waugh*

Abstract

This paper investigated the use of three types of cohesive devices, that is, reference, conjunction and lexis,
used in writing the English as a Foreign Language (EFL) essays for students in secondary year 2, 4 and 6 in
Hong Kong. Fifty students from each of the three forms (N = 150) provided narrative and descriptive essays for
analysis which were marked by two competent English teachers by counting the frequency of writing devices
used per 100 words. Initially, 14 cohesive devices (items) were counted as items for analysis, but two devices
(items) were deleted as not fitting a Rasch measurement model. The RUMM2020 computer program with the
partial credit model was used to create a linear scale of Writing Devices Used with twelve items: two for references,
four for conjunctions, two for lexis, three for cohesive ties, and one for quality. There was good overall
fit to the measurement model (item-trait chi-square = 56.81, df = 48, p = 0.18) but the Person Separation Index
was very low at 0.08 mainly due to the small range of quality of essays in comparison to the difficulties of
the writing devices (items). Students found that the three easiest writing devices used were remote cohesive
ties, immediate cohesive ties and mediate cohesive ties. The three hardest writing devices used were temporal
conjunctions, causal conjunctions and adversative conjunctions.

****

Linking Classical Test Theory and Two-level Hierarchical Linear Models

*Yasuo Miyazaki and Gary Skaggs*

Abstract

This paper considers the link between classical test theory (CTT) and two-level hierarchical linear models (HLM).
Conceptualizing that items are nested within subjects, we can reformulate the ANOVA classical test model as
an HLM. In this HLM framework, item difficulty parameters are represented by the fixed effects, and subject’s
abilities are represented by the random effects. The population reliability of either the total or the mean score can
be represented by a function of the random effects parameters and the number of items. For estimation, taking
advantage of the balanced design nature of CTT, we can obtain explicit formulas for parameter estimates of
both fixed and random effects in HLM. It reveals that the formula and the estimate derived from HLM exactly
match those of CTT reliability, which are equivalent to Cronbach’s coefficient alpha under the assumptions of
essentially tau equivalent measures. Not only that, we can obtain most of the important quantities in CTT such
as estimates of item difficulty, standard error of measurement, true score, and person ability in a single HLM
model. Thus, the CTT model formulated by HLM framework provides a systematic approach on measurement
analysis by CTT. For illustrative purpose, a small data set was analyzed using HLM software (Raudenbush,
Bryk, Cheong, & Congdon, 2000). The results confirmed the theoretical link between CTT and HLM.

****

Measuring Self-Complexity: A Critical Analysis of Linville’s H Statistic

*Wenshu Luo, David Watkins, and Raymond Y. H. Lam*

Abstract

The paper argues that the most commonly used measure of self-complexity, Linville’s H statistic, cannot measure
this construct appropriately. It first examines the mathematical properties of H and its relationships with five
related indices: the number of self-aspects, the overlap among self-aspects, the average inter-aspect correlation,
the ratio of endorsement, and the HICLAS attribute class number. Then, a demonstration study using simulations
is reported. Three conclusions are drawn. H and the HICLAS attribute class number are similar in the way they
are calculated. Both indices are highly related to the number of self-aspects, while their relationship to overlap
is not monotonic. Overlap is affected by the ratio of endorsement and the average inter-aspect correlation but
cannot represent the notion of redundancy among traits which directly determines Linville’s H statistic. These
conclusions are employed to explain the inconsistent findings relating self-complexity and adaptation and an
alternative measurement approach is proposed.

****

Measuring Cynicism Toward Organizational
Change — One Dimension or Two?

*Simon Albrecht
*

Abstract

Wanous, Reichers, and Austin’s (2000) measure of cynicism about organizational change (CAOC) was subjected
to validation and cross-validation procedures using data collected from two Australian public sector organizations.
More specifically analyses were conducted to determine whether CAOC is best understood as a single
dimensional or a two dimensional construct. The results of confirmatory factor analysis suggest that neither a one
dimensional nor a two dimensional measurement model provided a satisfactory fit to the data. However, removal
of two of the original eight items resulted in a two dimensional model, with three items in each dimension, that
provided a reasonable fit to the data. Practical implications and directions for future research are discussed.

****

Assessment of Differential Item Functioning

*Wen-Chung Wang*

Abstract

This study addresses several important issues in assessment of differential item functioning (DIF). It starts
with the definition of DIF, effectiveness of using item fit statistics to detect DIF, and linear modeling of DIF in
dichotomous items, polytomous items, facets, and testlet-based items. Because a common metric over groups
of test-takers is a prerequisite in DIF assessment, this study reviews three such methods of establishing a common
metric: the equal-mean-difficulty method, the all-other-item method, and the constant-item (CI) method.
A small simulation demonstrates the superiority of the CI method over the others. As the CI method relies on a
correct specification of DIF-free items to serve as anchors, a method of identifying such items is recommended
and its effectiveness is illustrated through a simulation. Finally, this study discusses how to assess practical
significance of DIF at both item and test levels.