Journal of Applied Measurement

P.O. Box 1283

Maple Grove, MN 55311

 

Volume 12, 2011 Article Abstracts

Vol. 12, No. 1 Spring 2011

Using Adjusted GPA and Adjusted Course Difficulty Measures to Evaluate Differential Grading Practices in College

Dina Bassiri and E. Mathew Schulz

Abstract

In this study, the Rasch rating scale model (Andrich, 1978) was applied to college grades of four freshman cohorts from a large public university. After editing, the data represented approximately 34,000 students, 1,700 courses and 119 departments. The rating scale model analysis yielded measures of student achievement and course difficulty. Indices of the difficulty of academic departments were derived through secondary analyses of course difficulty measures. Differences between rating scale model measures and simple grade averages were examined for both students, courses, and academic departments. The differences were provocative and suggest that the rating scale model could be a useful tool in addressing a variety of issues that concern college administrators.

****

Optimizing the Compatibility between Rating Scales and Measures of Productive Second Language Competence

Christopher Weaver

Abstract

This study presents a systematic investigation concerning the performance of different rating scales used in the English section of a university entrance examination to assess 1,287 Japanese test takers’ ability to write a third-person introduction speech. Although the rating scales did not conform to all of the expectations of the Rasch model, they successfully defined a meaningful continuum of English communicative competence. In some cases, the expectations of the Rasch model needed to be weighed against the specific assessment needs of the university entrance examination. This investigation also found that the degree of compatibility between the number of points allotted to the different rating scales and the various requirements of an introduction speech played a considerable role in determining the extent to which the different rating scales conformed to the expectations of the Rasch model. Compatibility thus becomes an important factor to consider for optimal rating scale performance.

****

Developing a Domain Theory Defining and Exemplifying a Learning Theory of Progressive Attainments

C. Victor Bunderson

Abstract

This article defines the concept of Domain Theory, or, when educational measurement is the goal, one might call it a “Learning Theory of Progressive Attainments in X Domain”. The concept of Domain Theory is first shown to be rooted in validity theory, then the concept of domain theory is expanded to amplify its necessary but long neglected connection to design research disciplines. The development of a local learning theory of progressive attainments in the domain of Fluent Oral Reading is presented as an illustration. Such a theory is local to a defined domain of application, having well delineated boundaries. It depends on measures having a deep and valid connection to constructs, and the constructs back to the items or tasks at pertinent levels of the measurement scale. Thus instrument development and theory development, which occur in tandem, depend on establishing construct validity in a deep and thoroughgoing manner.

****

Bringing Human, Social, and Natural Capital to Life: Practical Consequences and Opportunities

William P. Fisher, Jr

Abstract

Capital is defined mathematically as the abstract meaning brought to life in the two phases of the development of “transferable representations,” which are the legal, financial, and scientific instruments we take for granted in almost every aspect of our daily routines. The first, conceptual and gestational, and the second, parturitional and maturational, phases in the creation and development of capital are contrasted. Human, social, and natural forms of capital should be brought to life with at least the same amounts of energy and efficiency as have been invested in manufactured and liquid capital, and property. A mathematical law of living capital is stated. Two examples of well-measured human capital are offered. The paper concludes with suggestions for the ways that future research might best capitalize on the mathematical definition of capital.

****

Understanding Rasch Measurement: Distractors with Information in Multiple Choice Items: A Rationale Based on the Rasch Model

David Andrich and Irene Styles

Abstract

There is a substantial literature on attempts to obtain information on the proficiency of respondents from distractors in multiple choice items. Information in a distractor implies that a person who chooses that distractor has greater proficiency than if the person chose another distractor with no information. A further implication is that the distractor deserves partial credit. However, it immediately follows from the Rasch model that if a distractor deserves partial credit, then the response to that distractor and other distractors should not be pooled into a single response with a single probability of an incorrect response. Using the partial credit parameterization of the polytomous Rasch model, the paper shows how an hypothesis can be formed, and tested, regarding information in a distractor. The hypothesis is formed by studying the shape of the distractor response curves across the continuum, and the hypothesis is tested by scoring the correct response 2, the hypothesized distractor 1, and other distractors 0, and then applying the polytomous Rasch model. Multiple pieces of evidence, including fit of the responses at the two thresholds and the order of the two threshold estimates, are used in deciding if a distractor has information. An example illustrating the theory and its application is provided.

****

 

Vol. 12, No. 2 Summer 2011

A Comparison between Robust z and 0.3-Logit Difference Procedures in Assessing Stability of Linking Items for the Rasch Model

Huynh Huynh and Anita Rawls

Abstract

There are at least two procedures to assess item difficulty stability in the Rasch model: robust z procedure and “.3 Logit Difference” procedure. The robust z procedure is a variation of the z statistic that reduces dependency on outliers. The “.3 Logit Difference” procedure is based on experiences in Rasch linking for tests developed by Harcourt. Both methods were applied to archival data from two large-scale South Carolina assessment programs: HSEE 1986/1987 and PACT 2004/2005.The results of the analysis showed the “.3 Logit Difference” procedure identifies slightly more stable items (2.6%) for all items under study. In addition, approximately 93% of all items under consideration were identically classified as stable or unstable for both procedures. This very high level of agreement between the two methods indicates that either procedure can be safely used to identify stable items for use in a common-item linking design. The advantage of the robust z procedure lies in its foundation of robust statistical inference. The procedure takes into account well-accepted models for identifying outliers and permits critical values set at a specified Type I error.

****

Assessment of English Language Development: A Validity Study of a District Initiative

Juan D. Sanchez

Abstract

The San Francisco Unified School District (SFUSD) uses the Language and Literacy Assessment Rubric (LALAR) as the secondary measurement required by the No Child Left Behind (NCLB) Act to measure English proficiency of English language learners (ELLs). In this analysis, the Rasch model is used to identify whether the LALAR is a valid measurement instrument and scale to measure the “English proficiency” of ELLs. This analysis investigates the relationship between student ability (q) and the probability that the student will respond correctly to an item on the LALAR. Controlling for this relationship, the item characteristics of each item, ability of each student, and measurement error associated with each score were mathematically derived. This will allow for validity and reliability tests to be conducted, which will help determine if the LALAR is a useful accountability measure for ELLs.

****

Equating of Multi-Facet Tests Across Administrations

Mary Lunz and Surintorn Suanthong

Abstract

The desirability of test equating to maintain the same criterion standard from test administration to test administration has long been accepted for multiple choice tests. The same consistency of expectations is desirable for performance tests, especially if they are part of a licensure or certification process or used for other high stakes decisions (e.g., graduation). Performance tests typically have three or more facets (e.g., examinees, raters, items, and tasks); all of which must be accounted for in the test-equating process. The application of the multi-facet Rasch model (Linacre, 2003a) is essential for equating performance tests because it provides calibrations of the elements of each facet. It also accounts for the differences in the tests taken by each examinee within a test administration. When multi-facet tests are equated across administrations, differences between the benchmark scale and the current test must be accounted for in each facet. Examinee measures are then adjusted for the differences between tests. The examples presented in this article were selected because of their difference in size and complexity of test design. Because they are different, they demonstrate how the same principles of common element test equating can be used regardless of the number of facets included in the test. Performance tests with more than two facets can be equated, as long as appropriate quality control methods are employed. First, use carefully selected common elements for each facet that represent the content and properties of the test. The common elements should be unaltered from their original use. Then, the most effective method is to initially anchor all common elements in each facet, then iteratively unanchor those elements which do not meet the criteria for displacement and fit. Strict criteria for displacement must be used consistently among facets. The suggested criterion for displacement is equal to or less than 0.5 logits. Unanchoring inconsistent and/or misfitting facet elements will improve the quality of the test equating.

****

Examining Student Rating of Teaching Effectiveness using FACETS

Nuraihan Mat Daud and Noor Lide Abu Kassim

Abstract

Students’ evaluations of teaching staff can be considered high-stakes, as they are often used to determine promotion, reappointment, and merit pay to academics. Using Facets, the reliability and validity of one student rating questionnaire is analysed. A total of 13,940 respondents of the Human Science Division of International Islamic University Malaysia were involved in the study. The analysis shows that the student rating questionnaire used was valid and reliable, and it allows identification of staff for the institution’s prestigious teaching excellence awards, and those needing in-service training. It was found that there was no significant difference in terms of rank, age and gender of the staff. The study also shows that the majority of staff have problems keeping the class interested and getting students to participate in class activities. Faculty also hardly discussed common errors in assignments and tests.

****

Exploring Differential Item Functioning (DIF) with the Rasch Model: A Comparison of Gender Differences on Eighth Grade Science Items in the United States and Spain

Tasha Calvert Babiar

Abstract

Traditionally, women and minorities have not been fully represented in science and engineering. Numerous studies have attributed these differences to gaps in science achievement as measured by various standardized tests. Rather than describe mean group differences in science achievement across multiple cultures, this study focused on an in-depth item-level analysis across two countries: Spain and the United States. This study investigated eighth-grade gender differences on science items across the two countries. A secondary purpose of the study was to explore the nature of gender differences using the many-faceted Rasch Model as a way to estimate gender DIF. A secondary analysis of data from the Third International Mathematics and Science Study (TIMSS) was used to address three questions: 1) Does gender DIF in science achievement exist? 2) Is there a relationship between gender DIF and characteristics of the science items? 3) Do the relationships between item characteristics and gender DIF in science items replicate across countries. Participants included 7,087 eight grade students from the United States and 3,855 students from Spain who participated in TIMSS. The Facets program (Linacre and Wright, 1992) was used to estimate gender DIF. The results of the analysis indicate that the content of the item seemed to be related to gender DIF. The analysis also suggests that there is a relationship between gender DIF and item format. No pattern of gender DIF related to cognitive demand was found. The general pattern of gender DIF was similar across the two countries used in the analysis. The strength of item-level analysis as opposed to group mean difference analysis is that gender differences can be detected at the item level, even when no mean differences can be detected at the group level.

****

Understanding Rasch Measurement: A Mapmark Method of Standard Setting as Implemented for the National Assessment Governing Board

E. Matthew Schulz and Howard C. Mitzel

Abstract

This article describes a Mapmark standard setting procedure, developed under contract with the National Assessment Governing Board (NAGB). The procedure enhances the bookmark method with spatially representative item maps, holistic feedback, and an emphasis on independent judgment. A rationale for these enhancements, and the bookmark method, is presented, followed by a detailed description of the materials and procedures used in a meeting to set standards for the 2005 National Assessment of Educational Progress (NAEP) in Grade 12 mathematics. The use of difficulty-ordered content domains to provide holistic feedback is a particularly novel feature of the method. Process evaluation results comparing Mapmark to Anghoff-based methods previously used for NAEP standard setting are also presented.

****

 

Vol. 12, No. 3 Fall 2011

Diagnosing a Common Rater Halo Effect in the Polytomous Rasch Model

Ida Marais and David Andrich

Abstract

The ‘halo effect’ may be unique to different raters or common to all raters. When common to all raters, halo is not detectable through standard fit indices of the three-facet Rasch model used to account for differences in rater severities. Using a formulation of halo as a violation of local independence, a halo effect common to all raters is simulated and shown to be diagnosable through contrasts between two-facet stack and rack Rasch analyses. In the former, the thresholds are clustered and the distribution of persons is multimodal; in the latter, all thresholds are close together and the distribution of persons is unimodal. In the former, the scale is stretched, and the person separation inflated, relative to the latter.

****

A Comparison of Structural Equation and Multidimensional Rasch Modeling Approaches to Confirmatory Factor Analysis

Edward W. Wolfe and Kusum Singh

Abstract

This paper compares the results of applications of the Multidimensional Random Coefficients Multinomial Logit Model (MRCMLM) to comparable Structural Equation Model (SEM) applications for the purpose of conducting a Confirmatory Factor Analysis (CFA). We review SEM as it is applied to CFA, identify some parallels between the MRCMLM approach to CFA and that utilized in a standard SEM CFA, and illustrate the comparability of MRCMLM and SEM CFA results for three datasets. Results indicate that the two approaches tend to identify similar dimensional models as exhibiting best fit and provide comparable depictions of latent variable correlations, but the two procedures depict the reliability of measures differently.

****

The Rainbow Families Scale (RFS): A Measure of Experiences Among Individuals with Lesbian and Gay Parents

David J. Lick, Karen M. Schmidt, and Charlotte J. Patterson

Abstract

According to two decades of research, parental sexual orientation does not affect overall child development. Researchers have not found significant differences between offspring of heterosexual parents and those of lesbian and gay parents in terms of their cognitive, psychological, or emotional adjustment. Still, there are gaps in the literature regarding social experiences specific to offspring of lesbian and gay parents. This study’s objective was to construct a measure of those experiences. The Rainbow Families Scale (RFS) was created on the basis of focus group discussions (N = 9 participants), and then piloted (N = 24) and retested with a new sample (N = 91) to examine its psychometric properties. Exploratory factor analyses uncovered secondary dimensions and Rasch analytic procedures examined item fit, reliability, and category usage. Misfitting items were eliminated where necessary, yielding a psychometrically sound measurement tool to aid in the study of individuals with lesbian and gay parents.

****

Development of an Instrument for Measuring Self-Efficacy in Cell Biology

Suzanne Reeve, Elizabeth Kitchen, Richard R. Sudweeks, John D. Bell, and William S. Bradshaw

Abstract

This article describes the development of a ten-item scale to assess biology majors’ self-efficacy towards the critical thinking and data analysis skills taught in an upper-division cell biology course. The original seven-item scale was expanded to include three additional items based on the results of item analysis. Evidence of reliability and validity was collected and reported for the revised scale. In addition, the effect of varying the number of response categories presented with the items was empirically examined by administering different versions of the instrument containing 6, 11, 21, and 101 response categories to randomly selected samples of students in the course. Rasch scaling procedures were used to analyze the results. Contrary to Bandura’s recommendation for using the 101-point scale (0-100), the results indicated that most respondents used only a subset of the options in the 101-point scale and that the 6-point and 11-point scales produced less threshold disordering for the purpose of assessing changes in students’ self-efficacy in the context of a one-semester course.

****

Measuring Schools’ Efforts to Partner with Parents of Children Served Under IDEA: Scaling and Standard Setting for Accountability Reporting

Batya Elbaum, William P. Fisher, Jr., and W. Alan Coulter

Abstract

Indicator 8 of the State Performance Plan (SPP), developed under the 2004 reauthorization of the Individuals with Disabilities Education Act (IDEA 2004, Public Law 108-446) requires states to collect data and report findings related to schools’ facilitation of parent involvement. The Schools’ Efforts to Partner with Parents Scale (SEPPS) was developed to provide states with a means to address this new reporting requirement. Items suggested by stakeholder groups were piloted with a nationally representative sample of 2,634 parents of students with disabilities ages 5-21 in six states. Rasch scaling was used to calibrate a meaningful and invariant item hierarchy. The 78 calibrated items had measurement reliabilities ranging from .94-.97. Using data from the pilot study, stakeholders established a recommended performance standard set at a meaningful point in the item hierarchy. Implications of the findings are discussed in relation to the need for rigorous metrics within state accountability systems

****

An ADL Measure for Spinal Cord Injury

Anne Bryden and Nikolaus Bezruczko

Abstract

Occupational therapists do not have a comprehensive, objective method for measuring how persons with tetraplegia perform activities of daily living (ADL) in their homes and communities, because SCI ADL performance is usually determined in rehabilitation. The ADL Habits Survey (ADLHS) is designed specifically to address this knowledge gap by surveying performance on relevant and meaningful activities in homes and communities. After a comprehensive task analysis and pilot development, 30 activities were selected that emphasize a broad range of hand and wrist, reaching, and grasping movements in compound activities. A sample of 49 persons with cervical spinal cord injuries responded to items. The sample was predominantly male, median age was 41 years, and ASIA motor classification levels ranged from C2 through C8/T1 with majority concentration in C4, C5, or C6 (68%). Each participant report was rated by an occupational therapist using a seven category rating scale, and the item by participant response matrix (30 X 49) was analyzed with a Rasch model for rating scales. Results showed excellent participant separation (>4) and very high reliability (>.95), and both item and participant fit values were adequate (STANDARDIZED INFIT <+/ –3 SD units). With only two exceptions, all participants fit the Rasch rating scale model, and only one item “Light housekeeping” presented significant fit issues. Principal Components Analysis an analysis of item residuals did not reveal serious threats to unidimensionality. A between group fit comparison of participants with more versus less movement found invariant item calibrations, and ANOVA of participant measures found statistically significant differences across ASIA motor classification levels. These ADLHS results offer occupational therapists a new method for measuring ADL that is potentially more sensitive to functional changes in tetraplegia than most instruments in common use. Accommodation of step disorder with a three category rating scale did not diminish measurement properties.

****

Understanding Rasch Measurement: Selecting Cut Scores with a Composite of Item Types: The Construct Mapping Procedure

Karen Draney and Mark Wilson

Abstract

In this paper, we describe a new method we have developed for setting cut scores between levels of a test. We outline the wide variety of potential methods that have been used for such a process, and emphasize the need for a coherent conceptual framework under which the variety of methods could be understood. We then describe our particular method, based on an item response modeling framework, which uses the Wright Map, a graphical model of item and threshold difficulties, and a piece of computer software that provides probabilities of various responses for scores under consideration as cut scores. Finally, we describe a study we conducted for the Golden State Examination in Chemistry, in which we investigate the classification agreement for two groups using the method, and also investigate the reactions of the committee members to the procedure and the software, and the lessons we learned from this process.

****

 

Vol. 12, No. 4 Winter 2011

Reducing the Item Number to Obtain Same-Length Self-Assessment Scales: A Systematic Approach using Result of Graphical Loglinear Rasch Modeling

Tine Nielsen and Svend Kreiner

Abstract

The Revised Danish Learning Styles Inventory (R-D-LSI) (Nielsen 2005), which is an adaptation of Sternberg- Wagner Thinking Styles Inventory (Sternberg, 1997), comprises 14 subscales, each measuring a separate learning style. Of these 14 subscales, 9 are eight items long and 5 are seven items long. For self-assessment, self-scoring and self-interpretational purposes it is deemed prudent that subscales measuring comparable constructs are of the same item length. Consequently, in order to obtain a self-assessment version of the R-D-LSI with an equal number of items in each subscale, a systematic approach to item reduction based on results of graphical loglinear Rasch modeling (GLLRM) was designed. This approach was then used to reduce the number of items in the subscales of the R-D-LSI which had an item-length of more than seven items, thereby obtaining the Danish Self-Assessment Learning Styles Inventory (D-SA-LSI) comprising 14 subscales each with an item length of seven. The systematic approach to item reduction based on results of GLLRM will be presented and exemplified by its application to the R-D-LSI.

****

Using Rasch Modeling to Measure Acculturation in Youth

Melinda F. Davis, Mary Adam, Scott Carvajal, Lee Sechrest, and Valerie F. Reyna

Abstract

Ethnic differences in health outcomes are assumed to reflect levels of acculturation, among other factors. Health surveys frequently include language and social interaction items taken from existing acculturation instruments. This study evaluated the dimensionality of responses to typical bilinear items in Latino youth using Rasch modeling. Two seven-item scales measuring Anglo-Hispanic orientation were adapted from Marín and Gamba (1996) and Cuéllar, Arnold, and Maldonado (1995). Most of the items fit the Rasch model. However, there were gaps in both the Hispanic and Anglo scales. The Anglo items were not well targeted for the sample because most students reported they always spoke English. The lack of variability found in a heterogeneous sample of Latino youth has negative implications for the common practice of relying on language as a measure of acculturation. Acculturation instruments for youth probably need more sensitive items to discriminate linguistic differences, or to measure other factors

****

Measurement of Mothers’ Confidence to Care for Children Assisted with Tracheostomy Technology in Family Homes

Nikolaus Bezruczko, Shu-Pi C. Chen, Constance D. Hill, and Joyce M. Chesniak

Abstract

The purpose of this research was to develop an objective, linear measure of mothers’ confidence to care for children assisted with tracheostomy medical technology in their homes. Caregiver confidence is addressed in this research for three technologies, namely, a) trachesotomy, b) tracheostomy and ventilator, and c) BiPAP/CPAP although detailed measurement results are only reported for tracheostomy, and its co-calibration with tracheostomy and ventilator caregiving items. The sample consisted of 53 mothers responding to several caregiver questionnaires based on a caregiving task matrix after content and clinical validation. A major challenge was integrating this construct with overarching principles already established by Functional Caregiving, a multi-level humanistic caregiving model for children with intellectual disabilities. Empirical analyses included principal components analysis, and then linear transformation of Tracheostomy item ratings to an objective, equal-interval scale with a Rasch model. Results show caregiver separation on the Tracheostomy caregiving scale was 2.66 and reliability, .88. In general, co-calibration improved measurement properties without affecting mothers’ caregiving confidence measures. Although sample size was small, measuring mothers’ confidence to care for a child supported by complex medical technologies appears very promising.

****

Comparability of Item Quality Indices from Sparse Data Matrices with Random and Non-Random Missing Data Patterns

Edward W. Wolfe and Michael T. McGill

Abstract

This article summarizes a simulation study of the performance of five item quality indicators (the weighted and unweighted versions of the mean square and standardized mean square fit indices and the point-measure correlation) under conditions of relatively high and low amounts of missing data under both random and conditional patterns of missing data for testing contexts such as those encountered in operational administrations of a computerized adaptive certification or licensure examination. The results suggest that weighted fit indices, particularly the standardized mean square index, and the point-measure correlation provide the most consistent information between random and conditional missing data patterns and that these indices perform more comparably for items near the passing score than for items with extreme difficulty values.

****

The Influence of Labels Associated with Anchor Points of Likert-type Response Scales in Survey Questionnaires

Jean-Guy Blais and Julie Grondin

Abstract

Survey questionnaires are among the most used data gathering techniques in the social sciences researchers’ toolbox and many factors can influence respondants’ answers on items and affect data validity. Among these factors, research has accumulated which demonstrates that verbal and numeric labels associated with item’s response categories in such questionnaire may influence substantially the way in which respondents operate their choices within the proposed response format. In line with these findings, the focus of this article is to use Andrich’s Rating scale model to illustrate what kind of influence the quantifier adverb “totally,” used to label or emphasize extreme categories, could have on respondants’ answers.

****

Analysis of Letter Name Knowledge using Rasch Measurement

Ryan P. Bowles, Lori E. Skibbe, and Laura M. Justice

Abstract

Letter name knowledge (LNK) is a key predictor of later reading ability and has been emphasized strongly in recent educational policy. Studies of LNK have implicitly treated it as a unidimensional construct with all letters equally relevant to its measurement. However, some empirical research suggests that contextual factors can affect the measurement of LNK. In this study, we analyze responses from 909 children on measures of LNK using the Rasch model and its extensions, and consider two contextual factors: the format of assessment and the own-name advantage, which states that children are more likely to know letters in their own first names. Results indicate that both contextual factors have important impacts on measurement and that LNK does not meet the requirements of Rasch measurement even when accounting for the contextual factors. These findings introduce philosophical concerns for measurement of constrained skills which have limited content for assessment.

****

Understanding Rasch Measurement: Converging on the Tipping Point: A Diagnostic Methodology for Standard Setting

John A. Stahl and Kirk A. Becker

Abstract

This article discusses the strengths and weakness of the Angoff and Bookmark standard setting procedures. An alternative approach that focuses on the strengths of these procedures and adds three diagnostic indices is presented. This alternative approach is applied to three standard setting data sets and the results are discussed.

 

Home