abst.htm

Vol. 13, No. 1 Spring 2012

Formulating Latent Growth Using an Explanatory Item Response Model Approach

Mark Wilson, Xiaohui Zheng, and Leah McGuire

Abstract

In this paper, we present a way to extend the Hierarchical Generalized Linear Model (HGLM; Kamata (2001), Raudenbush (1995)) to include the many forms of measurement models available under the formulation known as the Random Coefficients Multinomial Logit (MRCML) Model (Adams, Wilson and Wang, 1997), and apply that to growth modeling. First, we review two different traditions in modeling growth studies: the first is based in the hierarchical linear modeling (HLM) tradition, and the second, which is the topic of this paper, is rooted in the Rasch measurement tradition—this is the linear Latent Growth Item Response Model (LG-IRM). Going beyond the linear case, the LG-IRM approach allows us to considerably extend the range of models available in the HLM tradition to incorporate several of the extensions of IRT models that are used in creating explanatory item response models (EIRM; De Boeck and Wilson, 2004). We next present a number of extensions—including polynomial growth modeling, differential item functioning (DIF) effects, growth functions that can be approximated by polynomial expressions, provision for polytomous responses, person and item covariates (and time varying covariates), and multiple dimensions of growth. We provide two empirical examples to illustrate several of the models, using the ConQuest software (Wu, Adams, Wilson and Haldane, 2008) to carry out the analyses. We also provide several simulations to investigate the success of the estimation procedures.

****

Using the Mixed Rasch Model to Analyze Data from the Beliefs and Attitudes About Memory Survey

Everett V. Smith, Jr., Yuping Ying, and Scott W. Brown

Abstract

In this study, we used the Mixed Rasch Model (MRM) to analyze data from the Beliefs and Attitudes About Memory Survey (BAMS; Brown, Garry, Silver, and Loftus, 1997). We used the original 5-point BAMS data to investigate the functioning of the “Neutral” category via threshold analysis under a 2-class MRM solution. The “Neutral” category was identified as not eliciting the model expected responses and observations in the “Neutral” category were subsequently treated as missing data. For the BAMS data without the “Neutral” category, exploratory MRM analyses specifying up to 5 latent classes were conducted to evaluate data-model fit using the consistent Akaike information criterion (CAIC). For each of three BAMS subscales, a two latent class solution was identified as fitting the mixed Rasch rating scale model the best. Results regarding threshold analysis, person parameters, and item fit based on the final models are presented and discussed as well as the implications of this study.

****

An Examination of Personality Characteristics Related to Acquiescence

Christine DiStefano, Grant B. Morgan, and Robert W. Motl

Abstract

Acquiescence, the tendency to agree with statements regardless of content, is often a concern when administering self-report instruments. While there is evidence to support acquiescence as a response style, this reporting tendency may be related to personality factors of individuals. Using a sample of 757 adults, we investigated the Rosenberg Self-Esteem Scale for acquiescence response tendencies by applying the Rasch partial credit model. Results suggested that favorable (i.e., Agree or Strongly Agree) responses were more frequent for the positively worded items than for negatively worded items. Second, the relationship between acquiescence and seven additional personality measures was examined overall and by sex. Among females, acquiescence was correlated with personality measures measuring perceptions by others, whereas acquiescence among males was related to exhibition types of behaviors.

****

Construction and Validation of Two Parent-Report Scales for the Evaluation of Early Intervention Programs

William P. Fisher, Jr., Batya Elbaum, and W. Alan Coulter

Abstract

The State Performance Plan (SPP) developed under the 2004 reauthorization of the Individuals with Disabilities Education Act (IDEA 2004, Public Law 108-446) requires states to collect data and report on the impact of early intervention services on three key outcomes for participating families. The NCSEAM Impact on Family Scale (NIFS) and the NCSEAM Family Centered Services Scale (NFCSS) were developed to provide states with a means to address this new reporting requirement and to collect additional data that would inform program improvement efforts. Items suggested by stakeholder groups were piloted with a nationally representative sample of parents of children with developmental delays or disabilities ages birth to three participating in early intervention services in eight states. The 28-item NIFS had measurement reliabilities ranging from .93-.96 in a sample of 1,750; measurement reliabilities for the 135-item NFCSS ranged from .94 to .97 in a sample of 1,755 respondents. A 29-item version of the NFCSS had measurement reliabilities ranging from .87 to .92. Using data from the pilot study, stakeholders established a recommended performance standard, set at a meaningful point in the NIFS item hierarchy, for each of the three established outcome areas.

****

Multi-Factor Scale Consolidation When Theory is Weak

Nikolaus Bezruczko and Kyle Perkins

Abstract

As a practical matter, Spirituality and Quality of Life in the health sciences are usually measured separately. Theoretical foundations for this distinction, however, are not strong. In this research, an empirical investigation was conducted into their joint calibration with a Rasch model. Functional Assessment of Cancer Therapy-General (28 items), a cancer health-related quality of life measure (HRQOL), and Functional Assessment of Chronic Illness - Spiritual Well-Being (12 items), a measure of religious and existential well-being (Spirituality), were co-calibrated with a Rasch model implemented with WINSTEPS software for ratings from 545 breast cancer patients. The results show a hierarchical integration of QOL and Spirituality items on a common variable, and both patient separation (2.66) and reliability (.88) improve after co-calibration. Principal Component Analysis of co-calibrated item residuals did not show major threats to dimensionality, and joint calibration explains item variance comparable to separate calibrations (51.9%). Although patient measures (logits) based on separate and co-calibration are within two standard errors, ethnic and racial group values shift after co-calibration.

****

Developing an Emotional Distress Item Bank for Cancer Patients

Allen W. Heinemann, Rita K. Bode, Sarah Rosenbloom, and David Cella

Abstract

Emotional distress is common among cancer patients during and after treatment. Many instruments have been used to measure emotional distress; however, none of them has emerged as a standard. Although the diversity of instruments has some merit, the lack of a common measure limits our ability to compare studies. This paper describes how we constructed a 46-item emotional distress bank. Using expert judgment, we selected a pool of items with emotional content from this six-instrument set. Rasch rating scale analysis helped us identify a set of general distress items with good model fit and a measurement gap causing floor effects. Additional items were written to augment the measure where found deficient. The resulting set of items reflects a spectrum of positive and negative affect. The measure demonstrated excellent reliability (person separation reliability = .96) and a wide range of emotional distress and was able to distinguish among levels of disease severity.

****

A balance scale metaphor is offered as a tool for explaining the principles of measurement and for visualizing the internal structure of dichotomous and polytomous Rasch models. The balance scale metaphor is used to guide the derivation of a general polytomous Rasch model and to illustrate the additional assumptions subsequently required to derive the Andrich (1978) rating scale model (RSM) and the Masters (1982) partial credit model (PCM). The metaphor is used to present the argument that the RSM conforms to the rules of measurement, but the PCM has interactions implicit in its structure that violate specific objectivity and sufficiency of raw scores, which challenge its status as a Rasch model. Using the metaphor and a literal interpretation of the narrative description of the PCM by Masters (1982), a new version of the PCM is derived that does conform to the rules of measurement.

Exploring the Alignment of Writing Self-Efficacy with Writing Achievement Using Rasch Measurement Theory and Qualitative Methods

Alignment of writing self-efficacy and writing achievement is defined as the congruence between student confidence regarding writing skills (writing self-efficacy) and the actual performance on these writing skills as reflected in teacher grades (achievement). One purpose of this study is to examine the relationship between these two variables. A second purpose is to demonstrate a mixed-methods approach to investigating relationships between affective variables using Rasch measurement and interviews. Participants were eighth grade students (N = 94) from an ethnically and socioeconomically diverse school in the southeastern United States. Our results suggest that students who struggle with the mechanics of writing yet appreciate the expressive capacity of writing, may have higher senses of writing self-efficacy that are not predictive of performance.

Measuring Positive and Negative Affect in Older Adults Over 56 Days: Comparing Trait Level Scoring Methods Using the Partial Credit Model

Positive (PA) and negative affect (NA) are important constructs in health and well-being research. Good longitudinal measurement is crucial to conducting meaningful research on relationships between affect, health, and well-being across the lifespan. One common affect measure, the PANAS, has been evaluated thoroughly with factor analysis, but not with Racsh-based latent trait models (RLTMs) such as the partial credit model (PCM), and not longitudinally. Current longitudinal RLTMs can computationally handle few occasions of data. The present study compares four methods of anchoring PCMs across 56 occasions to longitudinally evaluate the psychometric properties of the PANAS plus additional items. Anchoring item parameters on mean parameter values across occasions produced more desirable results than using no anchor, using first occasion parameters as anchors, or allowing anchor values to vary across occasions. Results indicated problems with NA items, including poor category utilization, gaps in the item distribution, and a lack of easy-to-endorse items. PA items had much more desirable psychometric qualities.

The aim is to show that it is possible to parameterize discrimination for sets of items, rather than individual items, without destroying conditions for sufficiency in a form of the Rasch model. The form of the model is obtained by formalizing the relationship between discrimination and the unit of a metric. The raw score vector across item sets is the sufficient statistic for the person parameter. Simulation studies are used to show the implementation of conditional estimation solution equations based on the relevant form of the Rasch model. The model also applied to two numeracy tests attempted by a group of common persons in a large-scale testing program. The results show improved fit compared with the Rasch model in its standard form. They also show the units of the scales were more accurately equated. The paper discusses implications for applied measurement using Rasch models and contrasts the approach with the application of the two parameter logistic (2PL) model.

Educational Achievement, Personality, and Behavior: Assessment, Factor Structure and Implications for Theory And Practice

The purposes of this research were to first examine the evidence regarding the factor structure of educational achievement tests in the context of two theoretical models of cognitive ability (psychometric g and mutualism) that have been proposed to explain this structure as well as the underlying processes that may be responsible for its emergence in dimensionality studies. Then, the factor structure underlying a sample of the standardized educational achievement tests used by California in its statewide school accountability program was compared to those emerging from a selection of behavioral and personality assessments. As expected, the educational achievement tests exhibited a strong and uniformly positive manifold resulting in greater unidimensionality as evidenced by a dominant general factor in bi-factor analysis then either the personality or behavioral assessments. The implications of these structural differences are discussed with respect to the two theoretical perspective as well as in the context of formative and summative educational inferences in particular, and the school accountability and reform movement in general.

An External Validation Study of a Classification of Mixed Connective Tissue Disease and Systemic Lupus Erythematosus Patients

Mixed Connective Tissue Disease (MCTD) and Systemic Lupus Erythematosus (SLE) are autoimmune rheumatic diseases that are difficult for physicians to diagnose and to distinguish for a variety of reasons. The correct classification of these two diseases is a crucial issue for clinicians who treat autoimmune rheumatic diseases. In prior research, medical risk factors represented by instrument or laboratory measures and physician judgments (12 key features for MCTD and 12 key features for SLE) were parameterized with a one parameter logistic function in a Rasch model. Those results identified separate diagnostic dimensions for MCTD and SLE. This procedure was replicated in the present research with a sample of largely African American and Hispanic patients. Results verified separate dimensions for MCTD and SLE, which suggests MCTD is a separate disease from SLE.

The Development and Validation of the Core Competencies Scale (CCS) for the College and University Students

This article describes the development and validation of the Core Competencies Scale (CCS) using Bok’s (2006) competency framework for undergraduate education. The framework included: communication, critical thinking, character development, citizenship, diversity, global understanding, widening of interest, and career and vocational development. The sample comprised 70 college and university students. Results of analysis using Rasch rating scale modelling showed that there was strong empirical evidence on the validity of the measures in contents, structure, interpretation, generalizability, and response options of the CCS scale. The implication of having developed Rasch-based valid and dependable measures in this study for gauging the value added of college and university education to their students is that the feedback generated from CCS will enable evidencebased decision and policy making to be implemented and strategized. Further, program effectiveness can be measured and thus accountability on the achievement of the program objectives.

This simulation study compared the performance of two multilevel measurement testlet (MMMT) models: Beretvas and Walker’s (2012) two-level MMMT model and Jiao, Wang, and Kamata’s (2005) three-level model. Several conditions were manipulated (including testlet length, sample size, and the pattern of the testlet effects) to assess the impact on estimation of fixed and random effect parameters. The results of the present simulation study showed that MMMT-2r yielded the best parameter bias in estimation on fixed item effects, fixed testlet effects, and random testlet effects for conditions with nonzero equal pattern of random testlet effects’ variance even when the MMMT-2r was not the generating model. However, random effects estimation did not perform well when unequal random testlet effects’ variances were generated. Fit indices did not perform well either as other studies have had found. In addition, from the modeling perspective, MMMT-2r allows the greatest flexibility in terms of modeling testlet effects as fixed, random, or both.

A Study of Rasch, Partial Credit, and Rating Scale Model Parameter Recovery in WINSTEPS and jMetrik

jMetrik and WINSTEPS are two Rasch measurement software applications that implement joint maximum likelihood estimation of Rasch, partial credit, and rating scale model parameters via a proportional curve fitting algorithm. We describe this algorithm in this paper and explain the handling of missing data and extreme cases. Results from a simulation study that manipulated sample size and the number of test items indicate that both programs produce similar bias and root mean squared error values. In addition, root mean squared difference values indicate that estimates from each program are within 0.001 and 0.004 logits of each other depending on the model in question.

This study investigated the validation of comprehensive cognitive attributes of an eighth-grade mathematics test using the least squares distance method and compared performance on attributes by gender and region. A sample of 5,000 students was randomly selected from the data of the 2005 Turkish national mathematics assessment of eighth-grade students. Twenty-five math items were assessed for presence or absence of 20 cognitive attributes (content, cognitive processes, and skill). Four attributes were found to be misspecified or nonpredictive. However, results demonstrated the validity of cognitive attributes in terms of the revised set of 17 attributes. The girls had similar performance on the attributes as the boys. The students from the two eastern regions significantly underperformed on the most attributes.

Using Rasch Measurement to Validate the Big Five Factor Marker Questionnaire for a Japanese University Population

In recent years, psychological studies have increasingly come to support the so-called “Big Five” or “Five-factor Model” (FFM) of human personality. However, the vast majority of research in this field has been undertaken in Western contexts, thus raising the question of how applicable the Big Five is to Asian populations. Moreover, nearly all research into the Big Five has relied on traditional techniques of statistical analysis (e.g., factor analysis, correlation) to validate their results, despite the limitations of such methods. This study examined instrument validation of a widely-used Big Five instrument (the Factor Markers questionnaire) given to a Japanese population (n = 283) by using the Rasch rating scale model (Andrich, 1978). Rasch principal components analysis of the item residuals indicated the possible existence of additional factors within the Intellect/Imagination and Agreeableness factors, as well as additional item fit problems within each hypothesized construct.

An Applied Examination of Different Weighting Methods for the Root Expected Mean Square Difference and Root Mean Square Difference Indices

It is possible that functions used to link tests are sensitive to subpopulations of test takers. The REMSD and RMSD(x) are weighted effect sizes of linking invariance, yet it is often unclear how the weights are most appropriately applied when subpopulation group sizes are heterogeneous. The objective of this research is to apply two different weighting methods to the REMSD and RMSD(x) functions while testing for population invariance in a linkage across subpopulations of disparate sample sizes, and to subsequently compare the results across these differentially weighted effect sizes. The findings demonstrate that utilizing proportional weights in the REMSD and RMSD(x) indices can underestimate differences in linking functions for small subpopulations.

Development and Validation of a Questionnaire to Evaluate Attitudes toward Family Medicine

Francisco Escobar Rabadán, Jesús López-Torres Hidalgo, Julio Montoya Fernández, Juan M. Téllez Lapeira, Mª Aranzazu Romero Cebrián, and Juan M. Armero Simarro

A questionnaire that evaluates medical students’ knowledge of and attitudes towards primary care and family medicine was developed and validated in order to analyze changes during medical training. A 34 items questionnaire with 5 options on a Likert-type scale was designed. Based on this, a 21 item version was developed and validated. The internal consistency, using Cronbach’s alpha test, was estimated, and the questionnaire was analyzed by Rasch model. One hundred and fifty nine students responded to the brief questionnaire (95.78% of those enrolled). Cronbach’s alpha was 0.72. Rasch analysis showed high adjustment of the items to the pattern, whit an average reliability of the estimates of the items of 0.97 and of the people of 0.68. Our questionnaire has shown acceptable internal consistency and good construct validity.

Conditional pairwise estimates parameters in Rasch models separate inference on item parameters from inference on person parameters. Pairwise item parameter estimates are consistent when sample size approaches infinity, but adds an extra random error to estimation compared to conditional maximum likelihood estimates. Pairwise estimates of person parameters are easily calculated, but can rarely be assumed to be consistent since the number of items is often small and the properties of the estimates generally unknown. This note gives results from a study of conditional pairwise estimation of person parameters and suggests a modification of the estimate that takes care of some of the error.

Examining Rating Quality in Writing Assessment: Rater Agreement, Error, and Accuracy

The use of performance assessments in which human raters evaluate student achievement has become increasingly prevalent in high-stakes assessment systems such as those associated with recent policy initiatives (e.g., Race to the Top). In this study, indices of rating quality are compared between two measurement perspectives. Within the context of a large-scale writing assessment, this study focuses on the alignment between indices of rater agreement, error, and accuracy based on traditional and Rasch measurement theory perspectives. Major empirical findings suggest that Rasch-based indices of model-data fit for ratings provide information about raters that is comparable to direct measures of accuracy. The use of easily obtained approximations of direct accuracy measures holds significant implications for monitoring rating quality in large-scale rater-mediated performance assessments.

Understanding language development is incomplete without recognizing children’s sociocultural environments, including adult beliefs about language development. Yet there is a need for data supporting valid inferences to assess these beliefs. The current study investigated the psychometric properties of data from a survey (MODeL) designed to explore beliefs in the popular culture, and their alignment with more formal theories. Support for the content, substantive, structural, generalizability, and external aspects of construct validity of the data were investigated. Subscales representing Behaviorist, Cognitive, Nativist, and Sociolinguistic models were identified as dimensions of beliefs. More than half of the items showed a high degree of consensus, suggesting culturallytransmitted beliefs. Behaviorist ideas were most popular. Bilingualism and ethnicity were related to Cognitive and Sociolinguistic beliefs. Identifying these beliefs may clarify the nature of child-directed speech, and enable the design of language intervention programs that are congruent with family and cultural expectations.

Concurrent Validation of CHIRP, a New Instrument for Measuring Healthcare Student Attitudes towards Interdisciplinary Teamwork

Positive attitudes towards teamwork among health care professionals are critical to patient safety. The purpose of this study is to describe the development and concurrent validation of a new instrument to measure attitudes towards healthcare teamwork that is generalizable across various populations of healthcare students. The Collaborative Healthcare Interdisciplinary Planning (CHIRP) scale was validated against the Readiness for Inter- Professional Learning Scale (RIPLS). Analyses included student (n = 266) demographics, ANOVA, internal consistency, factor analysis, and Rasch analysis. The two instruments correlated at r = .582 (p < .01). The CHIRP showed a multifactorial structure having excellent internal consistency (alpha = .850), with 25 of the 36 scale items loading onto a single Teamwork Attitudes factor. The RIPLS likewise had strong internal consistency (alpha = .796) and a three-factor structure, supporting previous studies of the instrument. However, Rasch analyses showed 14 (38.9%) of the 36 CHIRP items, but only four (21.1%) of the 19 RIPLS items remaining within the satisfactory standardized OUTFIT zone of ± 2.0 standard deviation units. We propose the 14 fitting items as a new, validated teamwork attitudes scale.

Using Extended Rasch Models to Assess Validity of Diagnostic Tests in the Presence of a Reference Standard

Vivian Viallon, Emmanuel Ecosse, Mounir Mesbah, Jacques Pouchot, and Joel Coste

We show that extended Rasch models, built to deal with continuous latent variables, may be useful for assessing validity of medical diagnostic tests in the presence of a reference standard, particularly for chronic diseases. We derive estimates for sensitivity and specificity under the Rasch model assumptions. Our estimations can be computed conditionally on the level of potential confounding covariates, making use of a variety of extended Rasch models, namely log-linear Rasch models, to examine the association between covariates and both disease intensity and response to the tests. Also, another variety of extended Rasch models—partial credit models—can determine appropriate thresholds for quantitative diagnostic tests. As an example, we study the validity of some diagnostic tests of heart failure.

Today, the amount of stress the correctional staff endures at work is an important issue. Research has addressed this issue, but has yielded no consensus as to a properly calibrated measure of perceptions of work stress for correctional staff. Using data from a non-random sample of correctional staff (n = 228), the Rasch model was used to assess whether a specific measure of work stress would fit the model. Results show that three items rather than six items accurately represented correctional staff perceptions of work stress.

A Rasch Measure of Teachers’ Views of Teacher-Student Relationships in the Primary School

This study investigated teacher-student relationships from the teachers’ point of view at Perth metropolitan schools in Western Australia. The study identified three key social and emotional aspects that affect teacherstudent relationships, namely, Connectedness, Availability and Communication. Data were collected by questionnaire (N = 139) with stem-items answered in three perspectives: (1) Idealistic: this is what I would like to happen; (2) Capability: this is what I am capable of; and (3) Behaviour: this is what actually happens, using four ordered response categories: not at all (score 1), some of the time (score 2), most of the time (score 3), and almost always (score 4). Data were analysed with a Rasch measurement model and a uni-dimensional, linear scale with 24 items, ordered from easy to hard, was created. The data were shown to be highly reliable, so that valid inferences could be made from the scale. The Person Separation Index (akin to a reliability index) was 0.93; there was good global teacher and item fit to the measurement model; there was good item fit; the targeting of the item difficulties against the teacher measures was good, and the response categories were answered consistently and logically. Teachers said that the ideal items were all easier than their corresponding capability items which were in turn easier than the behaviour items (where the items fitted the model), as conceptualised. The easiest ideal items were ‘I like this child’ and ‘this child and I get along well together.’ The hardest ideal item (but still easy) was ‘I am available for this child.’ The easiest behaviour item (but still hard) was ‘This child and I get along well together.’ The hardest behaviour item (and very hard) was ‘I am interested to learn about this child’s personal thoughts, feelings and experiences.’ The difficulties of the items supported the conceptual structure of the variable.

Home