Information about COVID-19 for travelers and travel related industries.
General information and situation updates from the Centers for Disease Control and Prevention.
Great resources for Match Week, SOAP applicants, and others.
We all strive to make #WorkLifeBalance part of our daily practice, but is it? With four years of undergrad, four years of med school, and a whole lot more years of residency, hours spent studying are what fill all the cracks between classes, clinical rotations, and the like. Studying becomes your personal mantra—your bread-and-butter—your life-force. You may feel duty-bound, mired by the responsibilities and expectations tied to your goal to become a DO. And as these responsibilities ramp up, there is the very real potential that the time and energy you once had for your passions and hobbies may be at an all-time low.
We all strive to make #WorkLifeBalance part of our daily practice, but is it? With four years of undergrad, four years of med school, and a whole lot more years of residency, hours spent studying are what fill all the cracks between classes, clinical rotations, and the like. Studying becomes your personal mantra—your bread-and-butter—your life-force. You may feel duty-bound, mired by the responsibilities and expectations tied to your goal to become a DO. And as these responsibilities ramp up, there is the very real potential that the time and energy you once had for your passions and hobbies may be at an all-time low.
#DOLOVE

We all strive to make #WorkLifeBalance part of our daily practice, but is it? With four years of undergrad, four years of med school, and a whole lot more years of residency, hours spent studying are what fill all the cracks between classes, clinical rotations, and the like. Studying becomes your personal mantra—your bread-and-butter—your life-force. You may feel duty-bound, mired by the responsibilities and expectations tied to your goal to become a DO. And as these responsibilities ramp up, there is the very real potential that the time and energy you once had for your passions and hobbies may be at an all-time low.
The brain-sweat, tears, and hard work that are vital on your journey to DO licensure does not need to evoke feelings of guilt for not studying every hour of the day. So let’s tackle the study burnout and help you find ways to carve out time to keep doing what you love. Here’s how:
Schedule it
Perhaps most of your life is meticulously mapped on a calendar, moment by moment—classes, clinical rotations, crazy amounts of studying, eating, sleeping, laundry, brushing your teeth, more studying. But your hobbies and passions are just as important to your well-being and your mental headspace, and they deserve dedicated time on your calendar too. Yes, yes, we know, there are only so many hours in a day, but who says your passions need to take hours? Take a 10-minute run (instead of your normal 10-mile jog), cook a 15-minute meal (instead of a multi-course masterpiece), read one chapter of a fiction book (instead of the whole thing), doodle for 10 minutes…you get the idea. Your hobbies and passions, just in smaller doses. Clear that brilliant brain of yours, nourish yourself with things that bring you joy, and then get back to the books!
Stop canceling
You’ve made the appointment, but are you going to show up? This is the tough part because we all know how easy it is to cancel, especially when you are too tired, too busy, and too guilty (when you think you should be studying even more than you are). Showing up for yourself amidst COMLEX-USA exam prep might seem like a joke, but doing what you love is a big part of recharging and fueling what comes next. Be confident in the schedule you create and the choices you make, more importantly when they involve taking care of you. If you don’t do it, no one else is going to do it for you.
Mix it up.
It’s easy to shelve the hobbies you’ve had for a long time. While they still bring you joy, they don’t give you the same butterflies-in-your-stomach feeling they did years ago. Maybe it’s time to try something new: rock climbing, origami, ballroom dancing. Step out of your comfort zone and make use of an entirely different part of your brain or body. Getting out of your study-cave and into something new might also help ignite some of those old passions, clear your mind, and help energize you so you can get back to the studying. Plus, adding to your arsenal of interests may eventually help you connect and empathize better with your patients as well.
When studying starts to bleed into the things you love, take a step back and remind yourself that you’re a person before you are a physician—and even more important: you’re a person even after you become one. During this month of passion and #DOLove, encourage yourself, empower yourself, and break through your schedule to keep doing the things that bring you joy and keep you, you.
See All
You may also like
With Mental Health Awareness Month wrapping up, we decided to reach out to NBOME National Faculty members who have...
Written by Kiyana Harris, MS, OMS-IV
We had the good fortune of meeting Kiyana on Twitter during Match week...
In a previous blog, we spoke to Jeremy Weleff, DO, a psychiatry resident now at Yale University, who launched...
COMLEX-USA Item Reduction
The NBOME Board of Directors approved plans to shorten COMLEX-USA Level 2-CE from 400 to 352 test items, beginning June 2020. This will only affect pre-test content, so exam validity/content coverage or reliability will not be impacted. This change should also reduce some of the stress associated with the time-pressured environment. Available test time will remain at 8 hours. Similar changes have been approved for COMLEX-USA Level 1 beginning May 2021.
Pretest questions are embedded into COMLEX-USA exams but do not count towards candidate’s scores. They provide useful information on the quality and relative difficulty of the questions to ensure fairness for candidates. They help obtain item statistics (for quality control as well as equating purposes), and to further test novel item formats. The number of total items per block will decrease from 50 to 44 items.
Further Review of COMLEX-USA Exam Scoring and Score-Reporting
NBOME has continued to study the uses of COMLEX-USA scores and score-reporting as it relates to the primary and intended purposes of the examinations (i.e., licensure), as well as secondary uses (most cited one is in residency program applications.) Our Board of Directors has approved the continued use of pass-fail only for the COMLEX-USA Level 2-PE clinical skills exam, and recommended further study related to the use of pass-fail as well as numerical scoring for COMLEX-USA Level 1, Level 2-CE and Level 3. Further updates will be provided as early as July 2020.
Modification to COMLEX-USA Level 1 and Level 2-PE Test Cycles in 2020
In response to feedback from candidates and deans, NBOME has modified the 2020-2021 test cycle for COMLEX-USA Level 1. It now commences 3 weeks earlier than prior years, running May 5, 2020 through April 2021.
We have also adjusted the COMLEX-USA Level 2-PE test cycle in response to increased demand during the spring months. The 2020 test cycle will now end in early November 2020, with a new complete test cycle beginning November 30, 2020. This change should provide additional testing opportunities in times of higher demand, thus helping candidates and schools to better facilitate the residency program application process.
Prometric Test Center and COMLEX-USA Enhancements for New Test Cycles in 2020
To assure an optimal computer-based testing experience at Prometric Test Centers, modifications continue to be made for COMLEX-USA examinations. Effective with the new May 2020 test cycle for COMLEX-USA Level 1, and the new June 2020 test cycle for Level 2-CE, the Prometric test driver used for all COMLEX-USA examinations has been updated to that already being used in COMLEX-USA Level 3. We endeavor to provide an optimal testing experience for all COMLEX-USA candidates and feel confident that these changes will further enhance the COMLEX-USA program.
NBOME to Modify Attempt Limits for COMLEX-USA Effective July 1, 2022
The NBOME Board of Directors approved changes to eligibility for COMLEX-USA to limit the maximum number of attempts to 4 total per exam, effective July 1, 2022. This change is intended to minimize misclassification, enhance test security/integrity, and reinforce NBOME’s mission to protect the public. Exceptions petitioned by a state medical or osteopathic medical licensing board will be evaluated on a case-by-case basis. Further information will be outlined in the COMLEX-USA Bulletin of Information, planned for release in July 2020.
For more information, please contact NBOME Client Services at clientservices@nbome.org or 866.479.6828
The NBOME is pleased to recognize the 2019 Item Writer and Case Author of the Year award winners from its distinguished National Faculty. Throughout the year, this group of individuals graciously volunteered their time and expertise to contribute to the COMLEX-USA and COMAT exam programs. In addition to their professional roles, these volunteers wear a variety of hats – writing and reviewing test items, serving as physician examiners for COMLEX-USA Level 2-PE, and supporting the NBOME mission to protect the public through competency assessment.
Each year, the NBOME Board selects the best-in-class item writers and case authors from a large group of contributors. Congratulations to these esteemed awardees for their exemplary commitment to producing valid and high quality exam content.
2019 COMLEX-USA Level 1 Item Writer of the Year: Martin Schmidt, PhD
Dr. Schmidt is a professor of biochemistry at DMU-COM in Des Moines, Iowa and a long-standing member of our National Faculty. His contributions have been to the COMLEX-USA Level 1 and COMAT Foundational Biomedical Examinations.
2019 COMLEX-USA Level 2-CE Item Writer of the Year: John Dougherty, DO
Dr. Dougherty is the Founding Dean and Chief Academic Officer at Noorda College of Osteopathic Medicine (proposed) in Provo, Utah. He has been a member of the National Faculty since 2016.
2019 COMLEX-USA Level 2-PE Case Author of the Year: Robyn Dreibelbis, DO

Dr. Dreibelbis is vice-chair and assistant professor of Family Medicine at WesternU/COMP – Northwest in Lebanon, Oregon. Dr. Dreibelbis has been selected for this award as a member of the Case Development Committee.
2019 COMLEX-USA Level 3 Item Writer of the Year: Binh Phung, DO, MHA

Dr. Phung is a clinical assistant professor of Pediatrics at OSU-COM in Tulsa, Oklahoma and a pediatric hospitalist at the Children’s Hospital at St. Francis. Dr. Phung has focused his talents on the COMLEX-USA Level 3 examination in both multiple-choice questions and clinical decision-making content.
2019 Clinical Decision-Making (CDM) Case Writer of the Year: Brett Stecker, DO
Dr. Stecker is the assistant professor at Alpert Medical School at Brown University and physician advisor at Steward Medical Group at Morton Hospital in South Easton, Massachusetts. Dr. Stecker is experienced in working with the COMLEX-USA Levels 1, 2-CE and 3 examinations, and was previously awarded Item Writer of the Year for COMLEX-USA Level 2-CE in both 2016 and 2018.
2019 COMLEX-USA Osteopathic Principles and Practice (OPP) Item Writer of the Year: Lauren Noto Bell, DO

Dr. Noto Bell is an associate professor at PCOM in Philadelphia, Pennsylvania and a long-standing member of the National Faculty. She has been involved with all levels of the COMLEX-USA examinations and the COMAT clinical exam. She was awarded item writer of the year for OPP in 2017.
2019 COMLEX-USA Preventative Medicine/Health Promotion (PMHP) Item Writer of the Year: Todd Coffey, PhD

Dr. Coffey is chair and associate professor in the department of research and biostatistics at ICOM in Meridian, Idaho. He joined the National Faculty in 2018 and contributes to all levels of the COMLEX-USA level examinations.
2019 COMAT Clinical Item Writer of the Year: Jessica Rogers, DO
Dr. Rogers is an Obstetrician and Gynecologist at Coyle Institute Female Pelvic Medicine & Reconstructive Surgery in Pensacola, Florida. She joined the National Faculty in 2014 and has been a significant contributor to both the COMLEX-USA and COMAT examinations.
2019 COMAT Foundational Biomedical Sciences (FBS) Item Writer of the Year: Lori Redmond, PhD
Dr. Redmond is a professor of Neuroscience at PCOM in Suwanee, Georgia and has been a member of our National Faculty since 2017. She was recruited for and has been a strong contributor to the COMAT Foundational Biomedical Sciences examinations.

NBOME recently sat down with Sandra Waters, MEM, Vice President for Collaborative Assessment & Initiatives, to learn more about the upcoming release of CATALYST, a new longitudinal assessment platform that will initially house COMSAE Phase 2 content when it launches this spring.
NBOME: Your team is debuting a new product this spring—COMSAE Phase 2 on CATALYST. We’re already familiar with COMSAE, but what exactly is CATALYST?
Sandra Waters: CATALYST is a longitudinal assessment platform designed to enhance learning. So, it isn’t actually content, it’s a new mechanism to deliver content to users.
NBOME: How is longitudinal assessment different from the more traditional learning approaches we’re used to?
SW: The notion of assessing someone over time—that really is the key. In a traditional class, individuals learn about a subject for a period of time, and then they learn about another subject, and then another subject. Longitudinal assessment uses something called topics interleaving, which enables an individual to gain exposure to ALL of those different components at one time. It creates these bursts of learning and knowledge acquisition.
If an individual is performing well in a certain area, they don’t need to be assessed nearly as frequently in that area. In areas where an individual isn’t performing well, CATALYST can increase the volume and frequency of content related to that trouble spot. The intent is to fine-tune the learning component, make it more targeted, and use that to increase knowledge and skills.
CATALYST was developed to combine learning with assessment. The assessments NBOME normally conducts are taken at a single point in time. Whereas, the CATALYST platform enables an individual to assess their knowledge and skills over an extended period of time. And it aids learning by providing users with immediate feedback while the material is still front-of-mind. Research has shown us that this is a much more effective way for an individual to learn—as opposed to sitting down, taking a test, and never truly understanding what you got wrong—or why.
NBOME: How did this all come about? What’s CATALYST’s origin story?
SW: When we first had the idea for CATALYST, we focused our efforts on designing the technology for board re-certification. The current approach involves a physician coming to a testing center every 6-10 years to take a closed book exam for 6-8 hours—not exactly the easiest feat when you’re running a full-time practice, seeing patients, and fulfilling all of the other responsibilities of a busy physician. CATALYST has the ability to change the whole playing field.
However, as we were developing and testing the platform, we continued to identify other ways to use it and other content we could put on there, including our own COMSAE content.
NBOME: Tell me more about why you decided to launch using COMSAE content.
SW: When we were developing CATALYST, we decided to pilot COMSAE Phase 2 on the platform. It just presented itself as an easy entry point for developing the added features that make CATALYST so special.
Each question includes a rationale essentially explaining why the correct answer is correct and why the incorrect answer is not correct. CATALYST also provides references for further learning and understanding. It’s a self-contained way to test knowledge and skills while providing additional information.
Further, COMSAE on CATALYST is built for busy schedules and maximum flexibility. It’s designed to feed questions to users at self-selected intervals. For example, you could opt to receive 10 questions each week or 30 questions all on one day. As we discussed before, everyone learns a little differently, and we all have different needs and schedules. This platform helps speak to that.
NBOME: With such strong focus on mobility and digitally nimble technology these days, what is the roll-out plan for COMSAE on CATALYST?
SW: COMSAE and other products offered on the CATALYST platform will be available on all devices, and also include a mobile app. Flexibility and convenience were extremely important to us as we developed the product.
NBOME: Who is eligible to purchase COMSAE Phase 2 on CATALYST and how does the system work? Can you walk me through the user experience?
SW: It is available to anyone who has an account with NBOME. Candidates may purchase the product through NBOME’s secure portal, at which point, they’ll be sent a welcome email along with login credentials to access the CATALYST platform. From here, they can customize the frequency of questions to suit their unique needs and learning goals. Based on those settings, they will begin to receive notifications when questions are available.
NBOME: If I was a student considering COMSAE on CATALYST, why would I want this over the traditional COMSAE format?
SW: I actually think you’d want both. The traditional COMSAE allows you get game-day-ready in an environment that closely mimics COMLEX-USA. Questions are formatted to match in style, and it’s a timed administration—just like COMLEX-USA. You also receive a final report once you complete the assessment. COMSAE on CATALYST is much more of a learning tool. It focuses on mastery of the content. You receive question-by-question and performance-by-domain feedback, rather than just a final report.
Because they serve completely different purposes, I wouldn’t necessarily see one replacing the other.
NBOME: What future enhancements can we look forward to with the CATALYST platform and longitudinal learning?
SW: We’re working on plans to expand our content offerings on the platform. COMSAE Phase 1 is being considered as an option, as well as COMAT subjects. Stay tuned!
Osteopathic International Alliance (OIA) Conference

From October 4-6, 2019, NBOME Board Vice-Chair Geraldine O’Shea, DO, and President and CEO, Dr. John R. Gimpel, attended the annual Osteopathic International Alliance Conference in Bad Nauheim, Germany.
The OIA is in official relations status with the World Health Organization, and “envisions a world in which every person has access to high-quality osteopathic medicine.” Next year’s AGM and conference will be held in Rio de Janeiro, Brazil from September 30-October 2, 2020.
National Resident Matching Program® (NRMP®) Conference
From October 3-5, 2019, NBOME Associate Vice President for Strategy and Quality Initiatives, Melissa Turner, MS, attended the National Resident Matching Program® (NRMP®) Conference in Chicago. The NBOME provided attendees with a COMLEX-USA update.
This year’s stakeholder conference, titled, “Transition to Residency: Conversations Across the Medical Education Continuum,” set a record for its 300 registrants. Focusing on a variety of topics related to residency, speakers included Ezekiel Emanuel, MD, PhD, Helen Fisher, PhD and Lawrence G. Smith, MD, MACP.
The COMLEX-USA Composite Examination Committee (CCEC) Meeting

The COMLEX-USA Composite Examination Committee (CCEC) met on October 14 and 15 in the Philadelphia Executive offices. This committee reviews all levels of the COMLEX-USA examination series, including statistics and candidate feedback, and provides a report to the NBOME Board of Directors. At this meeting, CCEC reviewed performance and innovations happening within the examination series — including the potential of reducing test items in Level 1 and 2-CE, as well as the Level 2-PE team researching possible modifications to the Humanistic and Biomedical/Biomechanical Domains. CCEC convenes the Blueprint Subcommittee, which regularly reviews the COMLEX-USA Master Blueprint to assure it keeps up with the evolving practice of osteopathic medicine.
The committee also discussed hot topics related to licensure examinations, such as the possibility of switching to a pass/fail scoring system, or keeping some form of numeric-based scoring. The CCEC is also reviewing the current maximum number of attempt limits per examination level. The Point-of-Care Knowledge, Education and Testing (POCKET) process was reviewed and decisions were recommended regarding next steps with this process.
Osteopathic Medical Education Conference (OMED)

From October 25-28, 2019, the NBOME participated in the Osteopathic Medical Education Conference (OMED) in Baltimore, Maryland. The American Osteopathic Association’s (AOA) annual conference brought together thousands of osteopathic physicians, medical students, and other health professionals from across the country for medical education, inspiration, networking and entertainment.
NBOME exhibited at the conference, featuring our new COMAT-Foundational Biomedical Sciences portfolio, as well as the COMLEX-USA examination series, the CATALYST platform and opportunities for doctors to explore the NBOME National Faculty Program. Attendees visiting the NBOME booth were greeted by staff who had meaningful conversations with many visitors and were on hand to answer student, faculty, practicing physician and others’ questions.
On day one of the conference, the American Osteopathic Foundation hosted its annual Honors Gala, presenting awards to a number of NBOME National Faculty members, including AOF Educator of the Year to Richard Jermyn, DO, from RowanSOM. In addition, and in honor of NBOME’s 85th anniversary year, the NBOME made a contribution to the William Anderson, DO Minority Scholarship Fund.
Early Sunday morning, a team of NBOME runners and walkers came out to join in the Advocates for the American Osteopathic Association (AAOA) Fit for Life Run 2019. The run benefitted osteopathic student scholarships and the NBOME was a featured sponsor as well.
Association of American Medical Colleges (AAMC) Annual Conference

From October 8-12, NBOME sent John R. Gimpel, DO, MEd to attend the AAMC Learn, Serve, Lead 2019 Conference in Phoenix, AZ. The meeting covered many of the successes and challenges of academic medicine nationwide, and was attended by medical educators from across the country. Retired President and CEO of AACOM, Steve Shannon, DO, received an achievement award.
Council of Medical Specialty Societies (CMSS) & Organization of Program Directors Associations (OPDA) Annual Meeting
From November 21-22, 2019, NBOME Vice President for Collaborative Assessment & Initiatives Sandra Waters, MEM and Associate Vice President for Strategic & Quality Initiatives, Melissa Turner, MS, attended the Council of Medical Specialty Societies (CMSS) Annual Meeting and Specialty Forum in Arlington, VA.
Together, they presented NBOME and COMLEX-USA updates at The Organization of Program Directors Associations (OPDA) meeting. “OPDA is dedicated to promoting the role of the residency program director and program director societies in achieving excellence in graduate medical education.”
The NBOME Test Accommodations Committee (TAC) Meeting

Between November 21-22, The NBOME Test Accommodations Committee, which is comprised of osteopathic physicians and other subject matter experts who review applications for special accommodations from COMLEX-USA candidates in cooperation with NBOME staff.
The Committee met to discuss trends and developments in the test accommodations realm, as they apply to high-stakes testing agencies like the NBOME.
Coming Up
In the next quarter, we’ll be making appearances at the following conferences and meetings:
PHILADELPHIA, PA. The National Board of Osteopathic Medical Examiners (NBOME), an independent, not-for-profit organization that provides competency assessments for osteopathic medical licensure and related health care professions, announced Lori Kemper, D.O., M.S., FACOFP, as its newest Secretary-Treasurer. At their December Board of Directors Meetings, the NBOME elected Dr. Kemper to a two year term.

“I’m thrilled to have even been considered, let alone chosen, as the NBOME’s next Secretary-Treasurer,” said NBOME board member, Lori Kemper, D.O., M.S., FACOFP. “I look forward to working towards our mission in this new, exciting officer role.”
Dr. Lori Kemper’s more than 30-year career encompasses both independent practice and graduate medical education. Since 2007, she has been the dean of Midwestern University, Arizona College of Osteopathic Medicine, where she previously served as associate dean of graduate medical education and associate professor in the department of family medicine. She currently serves as a commissioner to the American Osteopathic Association (AOA) Commission on Osteopathic College Accreditation (COCA) and is the chair of the Board of Deans of the American Association of Colleges of Osteopathic Medicine (AACOM). Dr. Kemper currently serves the NBOME as a member of the Test Accommodations Committee and the Awards Committee.
“We’re ecstatic to announce Dr. Kemper’s election to Secretary-Treasurer of the NBOME,” said NBOME Board Chair Geraldine O’Shea, DO. “Her tenure on our Board of Directors has resulted in great strides for us as an organization, and we look forward to what she’ll help us accomplish moving forward.”
PHILADELPHIA, PA. The National Board of Osteopathic Medical Examiners (NBOME), an independent, not-for-profit organization that provides competency assessments for osteopathic medical licensure and related health care professions, today introduced Juan F. Acosta, DO, MS, as its newest board member. He was recommended to the NBOME Nominating Committee via the Assembly of Graduate Medical Educators (AOGME, formerly known as the Association of Osteopathic Directors of Medical Education- AODME), filling the seat previously held by new NBOME Vice-Chair Richard J. LaBaere II, DO, MPH. He was elected at the annual NBOME Board Meeting in December.

“Joining the Board of Directors at the NBOME is an exciting opportunity for me,” said NBOME new board member, Juan F. Acosta, DO, MS. “I’d like to express my gratitude to Dr. Gimpel and his colleagues at the NBOME Board for electing me to this position.”
Dr. Acosta recently moved to New York where he serves as the Associate Medical Director for the Emergency Department at Saint Catherine of Siena Medical Center in Smithtown. He is also actively involved with the Disaster Medical Assistance Team (DMAT) and serves as a reviewer for the Journal of Emergency Medicine and a section editor for the West-JEM Journal. Dr. Acosta is an Oral Board examiner for American Osteopathic Board of Osteopathic Emergency Medicine (AOBEM). He is presently the secretary for the American College of Osteopathic Emergency Physicians (ACOEP) and Secretary for the Association of Osteopathic Directors and Medical Educators (AODME). Dr. Acosta also serves on the American Association’s Commission on Osteopathic College Accreditation (COCA) and the Committee on Continuing Medical Education (CCME).
“We are very fortunate to welcome Dr. Acosta to the Board,” said NBOME Board Chair Geri O’Shea, DO. “His enthusiasm, decorated professional career and experience in graduate medical education, add to our Board at an exciting time for the NBOME and the osteopathic medical profession.”
COMAT Product Updates
The COMAT exam series will expand in January 2020 to include the new Foundational Biomedical Sciences (FBS) Targeted exams. Each of the 14 exams focus on a specific organ body system or basic science discipline introduced to osteopathic medical students in years one and two. Click here to see the full list of available FBS Targeted subject exams.
Since the introduction of the FBS Comprehensive exam in December 2018, a total of 20 Colleges of Osteopathic Medicine have used or plan to use the FBS exams for pre-clerkship assessments to complement their use of the COMAT Clinical discipline exams for their end-of-rotation needs in years three and four.
The October issue of the Journal of Graduate Medical Education included research on the concurrent and predictive validity of the COMAT discipline exams and COMLEX-USA Level 2-CE. The findings indicated statistically significant, positive associations between COMAT and COMLEX-USA Level 2-CE scores, which can support the use of COMAT for osteopathic medical schools.
COMSAE Product Updates
After the first several score releases for COMLEX-USA Level 1 beginning mid-July 2019 and Level 2-CE beginning mid-August 2019, the NBOME conducted a comprehensive evaluation of scores on COMSAE Phase 1 and Phase 2 and subsequent scores on COMLEX-USA Level 1 and Level 2-CE. Following this evaluation, both the COMSAE Phase 1 and Phase 2 new score reports and scoring will be implemented on February 3, 2020. Please note that the COMSAE Phase 2 cut-score remains the same as 2019.
In addition, the NBOME has conducted a concordance correlation study, which demonstrated a positive and significant correlation, around 0.70, with COMSAE Phase 1 and COMLEX-USA Level 1 and COMSAE Phase 2 and COMLEX-USA Level 2-CE. This concordance study finding is consistent with those of recent years for all forms purchased by COMs with timed administrations.
As always, caution should be exercised when using COMSAE scores to estimate subsequent COMLEX-USA scores or for uses other than those for which they were developed.
We will continue to communicate regularly with COMs regarding new information related to COMSAE scores and their relationship to COMLEX-USA scores, as well as other COMSAE program updates as they become available.

“When you show deep empathy toward others, their defensive energy goes down, and positive energy replaces it. That’s when you can get more creative in solving problems.” – Stephen Covey
Empathy has always been the root of human connection, and in that, stems the foundation of our capacity to help others—whether family, friend, or patient—it all comes down to the same core values. And yet, it is the humanistic domain that many question including as part of the DO licensure exam. How important is it?
Having worked in osteopathic medical education and licensing for over 25 years, I am frequently posed the question, what makes DOs different? My answer is always the same, it’s about patient empathy. This isn’t to say that MDs don’t possess this trait; they do. However, there’s a heightened sense of empathy and patient understanding that seems to steer certain candidates toward osteopathic medicine.
The DO approach is based on the unique connection between mind, body, and spirit as it relates to patient care. It’s this holistic, 360-degree assessment and desire for enhanced understanding that fuels empathy and a different shade of patient care. It also involves empathic inquiry, developing understanding beyond just the problem at hand, but also what other life factors are impacting the patient. As a doctor, understanding how these many dimensions interact and intersect on a deeper level is the basis of the DO approach.
To clarify, empathic doctors are not internalizing or ‘taking on’ a patient’s pain or discomfort in a therapeutic way. Rather, they’re attempting to understand the patient’s illness experience. A patient once told me, “I don’t need my doctor to love me, but I do need them to understand me.” That deeper level of understanding is what brings humanistic values back into the medical encounter, allowing for the establishment of a knowing relationship. Research has shown that empathy helps to build trust, is linked to better diagnoses, improves patient outcomes, and decreases malpractice lawsuits.
Now that we have better understanding of the importance of empathy, how do we measure it in a clinical setting? The COMLEX-USA Level 2-Performance Evaluation has been assessing interpersonal and communication skills and professionalism of candidates for the past 15 years. And empathy is one of six dimensions assessed. Based on evidence that patients place great value on their human connection to their doctor, there are several guiding principles that support the role empathy plays in patient care:
Empathy is connection.
Attend to the patient both verbally and non-verbally. Listen to them. Make eye contact. Actively respond to their condition or pain. Avoid giving patients the ‘clinical cold shoulder’ by focusing only on their symptoms.
Empathy is curiosity.
This is especially important for young doctors who don’t have a lot of patient experience. Learn from the patient. Explore their illness experience. Discuss their lifestyle, their belief system, their stress levels, what motivates them. Give patients the feeling of being understood.
Empathy is compassion.
This is the ability to imagine what a patient is experiencing without being overwhelmed by their pain or distress. Research has shown that people are selective when expressing empathy towards others. It’s hard to feel compassion, for example, when a patient is difficult, unlikeable, or struggling with unhealthy behaviors that put them at risk. But it’s these patients who are most deserving and in need of our compassion and understanding.
Empathy is not stress.
Stress is in opposition to empathy. It’s difficult to connect to a patient, or anyone for that matter, when we feel anxious and overwhelmed. Likewise, physicians who have difficulty managing their feelings towards patients are themselves at risk. Although the stress of working with patients is an unavoidable part of a physician’s work life, one goal of medical education should be to equip students with the skills to manage stress in healthy ways.
A generation ago, few were talking about the role of empathy in healthcare. But today, cognitive neuroscience has enabled us to look critically at what ignites and motivates our behaviors, including the empathic ones. This new knowledge and learning, coupled with a heightened focus on developing higher quality patient care, shines a bright light on the need and desire for greater empathic engagement. That said, empathy is not what makes good doctors; it’s what makes good doctors even better.
Demonstrating empathy is important to becoming a DO, and since 2004, passing this assessment has been required to obtain the DO degree, move into residency training, and obtain a license to practice osteopathic medicine.
Contributed by Tony Errichetti, PhD | Director of Doctor-Patient Assessment | NBOME
Philadelphia, PA — The National Board of Osteopathic Medical Examiners (NBOME), an independent, not-for-profit organization that provides competency assessments for osteopathic medical licensure and related health care professions, announced the installment of three officers to its board of directors.
In addition to the new officers, the NBOME recognized Dana C. Shaffer, DO, FACOFP, for his service as Board Chair from over the past two years. NBOME President & CEO, John R. Gimpel, DO, MEd, expressed his gratitude for Immediate Past Chair Dr. Shaffer.
The following individuals were elected to serve as officers for the NBOME’s Board of Directors:
Board Chair: Geraldine T. O’Shea, DO 
As the Chair of the NBOME Board, Dr. O’Shea will lead the NBOME’s strategic plan for 2020-2022, ACEL and vision to become the global leader in assessment for the osteopathic medicine and related health care professions. Dr. O’Shea became a member of the NBOME Board in December 2009 and has served on the Awards Committee, COMLEX-USA Composite Examination Committee, Finance Committee, and the Marketing and Communications Task Force. She was installed as Vice-Chair in December 2017. Previously she served as Secretary-Treasurer from 2015-2017, chaired the Finance Committee and currently serves as a member of the Executive Committee, the Compensation Subcommittee, the SAS for GME Outreach Task Force, and as Liaison Committee Chair.
Dr. O’Shea has practiced internal medicine at the Foothills Women’s Medical Center in Jackson, California, since 1998. A 1993 graduate of the Western University of Health Sciences College of Osteopathic Medicine of the Pacific, she completed her internal medicine residency at the Maricopa Medical Center in Phoenix, Arizona. Dr. O’Shea served as president of the Osteopathic Medical Board of California from 2006 to 2012, and in 2013 as president of the American Association of Osteopathic Examiners. She also previously served on the Federation of State Medical Boards (FSMB) Nominating Committee and has served on the FSMB’s Awards, Audit, and Finance Committees.
Dr. O’Shea is a trustee of the American Osteopathic Association (AOA) and serves as chair of the Strategic Planning Committee, the Bureau of Membership and the Membership Value Task Force. Before being appointed to the AOA Board of Trustees, Dr. O’Shea served the AOA in many capacities, including vice-chair of the Bureau on Federal Health Programs and vice-chair of the Council of Women’s Health Issues. As past president of the Osteopathic Physicians and Surgeons of California (OPSC), Dr. O’Shea was chair of the California delegation to the AOA’s House of Delegates between 2006 and 2014 and received the OPSC’s Lifetime Achievement Award in February 2012.
Board Vice-Chair: Richard J. LaBaere, II, DO, MPH, FAODME

Dr. LaBaere is the NBOME’s newly installed Vice Chair for 2020-2022. He joined the NBOME Board in 2010 and served on the organization’s Blue Ribbon Panel on Enhancing COMLEX-USA and the Marketing and Communications Task Force. Dr. LaBaere previously served as the Secretary-Treasurer on the Board of Directors, Chairs the Finance Committee and Chairs the COMLEX-USA Composite Examination Committee. He also serves on the Compensation Committee, and the Executive Committee, as well as the SAS for GME Outreach Task Force.
Dr. LaBaere is currently the associate dean for postgraduate training, the osteopathic postdoctoral training institution academic officer and an adjunct clinical professor of family medicine at A.T. Still University–Kirksville College of Osteopathic Medicine (ATSU-KCOM) in Missouri. He has served as regional assistant dean for the Michigan region at the Genesys Regional Medical Center in Grand Blanc, Michigan, where he began his career in 1993 in private practice and graduate medical education.
He has served in various roles as family medicine residency program director, director of medical education and designated institutional official for over 25 years. Dr. LaBaere has presented to local, state and national audiences and has received a number of awards, including the 2006 Osteopathic Family Physician of the Year by the Michigan Association of Osteopathic Family Physicians, and was inducted into the American Osteopathic Association’s Mentor Hall of Fame in 2007 and as a fellow in the collegium of the Association of Osteopathic Directors and Medical Educators (AODME) in 2008. He served as AODME president in 2013. Dr. LaBaere is certified by the American Board of Osteopathic Family Physicians. He earned his Bachelor of Science and master of public health degrees from the University of Michigan in Ann Arbor and his DO degree from the Michigan State University College of Osteopathic Medicine.
Board Secretary-Treasurer: Lori A. Kemper, DO, MS, FACOFP

Dr. Kemper will serve as the NBOME Secretary-Treasurer for 2020-2022. A member of the NBOME Board, Dr. Kemper is also a member of the Test Accommodations Committee and the Awards Committee.
Dr. Kemper’s more than 30-year career encompasses both independent practice and graduate medical education. Since 2007, she has been the dean of Midwestern University, Arizona College of Osteopathic Medicine, where she previously served as associate dean of graduate medical education and associate professor in the department of family medicine. She currently serves as a commissioner to the American Osteopathic Association (AOA) Commission on Osteopathic College Accreditation (COCA) and is the chair of the Board of Deans of the American Association of Colleges of Osteopathic Medicine (AACOM).
Dr. Kemper earned her DO degree from the Kirksville College of Osteopathic Medicine in 1981 and a master’s degree in biological sciences from Arizona State University. She is board certified in family practice and is a fellow of the American College of Osteopathic Family Physicians. Dr. Kemper has practiced as a family physician since 1982, starting her career with the National Health Service Corps, where she provided care for the underserved population in south Phoenix, Arizona. She served as director of medical education and as the family medicine residency program director for Tempe St. Luke’s Hospital in Tempe, Arizona, from 1993 to 2007, where she also served as chief of staff from 2005 to 2007.
Dr. Kemper has earned numerous awards, including the Arizona Osteopathic Medical Association (AOMA)’s Excellence in Osteopathic Medical Education award (2010), Phoenix Magazine’s “Top Doc” award (2007, 1997), and the AOMA’s Physician of the Year Award (2006). Dr. Kemper served as the program director for OMED 2011, the annual Osteopathic Medical Conference and Exhibition. She chairs the Professional Education Committee for the Arizona Osteopathic Medical Association, of which she is past president.
PHILADELPHIA, PA. The National Board of Osteopathic Medical Examiners’ (NBOME) Board of Directors appointed 12 new leaders to their National Faculty chair positions.
The NBOME’s National Faculty is made up of over 700 active, engaged members from across the nation. These thought leaders have diverse expertise in all osteopathic health professions and specialties, osteopathic medical education and evaluation, and osteopathic physician licensure and regulation. Together, they serve on operational committees that review exam criteria, write and review exam items, and serve other roles in our mission to protect the public through rigorous competency assessment of osteopathic medical practitioners.
On behalf of the NBOME Board of Directors and staff, we would like to congratulate and welcome the following National Faculty members who have been appointed to 2020 National Faculty Chair positions.
Foundational Biomedical Sciences Division Chair, Pharmacology
Adrienne Z. Ables, PharmD, MS, FNAOME – Virginia College of Osteopathic Medicine Carolinas Campus

COMAT Examination Chair, Emergency Medicine
Thomas E. Benzoni, DO, EM, AOBEM, FACEP – Des Moines University College of Osteopathic Medicine
Clinical Decision-Making and Key Features Chair
Peter F. Bidey, DO, MSEd – Philadelphia College of Osteopathic Medicine
COMLEX-USA Level 1 Examination Chair
Joyce A. Brown, DO, CHSE – Touro College of Osteopathic Medicine – Middletown

Clinical Sciences Department Chair, Radiology and Diagnostic Imaging
Samuel M. Cosmello, DO, RPh – Private Practice, Fayetteville, NC

Clinical Sciences Department Chair, Surgery, Surgical Specialties and Anesthesia
Jay M. Crutchfield, MD, FACS – A.T. Still University School of Osteopathic Medicine in Arizona

Foundational Biomedical Sciences Division Chair, Biochemistry
Martha A. Faner, PhD – Michigan State University-College of Osteopathic Medicine

Clinical Science Department Chair, Preventive Medicine and Health Promotion
Joyce M. Johnson, DO, MA – Georgetown University

Foundational Biomedical Sciences Division Chair, Physiology
Kathleen P. O’Hagan, PhD – Midwestern University Chicago College of Osteopathic Medicine

COMAT Examination Chair, Surgery
Michelle M. Sowden, DO – University of Vermont College of Medicine

Foundational Biomedical Sciences Department Chair
Robert J. Theobald, PhD – A.T. Still University-Kirksville College of Osteopathic Medicine

Clinical Sciences Preventive Medicine and Health Promotion Division Chair, Biostatistics and Epidemiology
Eduardo Velasco, MD, MSc, PhD – Touro University College of Osteopathic Medicine – California
“Our National Faculty is crucial to our mission of protecting the public,” said Sandra Waters, MEM, NBOME’s Vice President for Collaborative Assessment & Initiatives. “The NBOME is honored to have such talented and committed thought leaders that represent all aspects of clinical and foundational biomedical science disciplines.”
About the NBOME
The National Board of Osteopathic Medical Examiners (NBOME) is an independent, not-for-profit organization that provides competency assessments for osteopathic medical licensure and related health care professions. NBOME’s COMLEX-USA examination series is a requirement for graduation from colleges of osteopathic medicine and provides the pathway to licensure for osteopathic physicians in the United States and numerous international jurisdictions.
PHILADELPHIA, PA. The National Board of Osteopathic Medical Examiners (NBOME), an independent, not-for-profit organization that provides competency assessments for osteopathic medical licensure and related health care professions, salutes Dana C. Shaffer, DO, on the culmination of his two-year term as Chair of the NBOME Board of Directors. At their Board of Directors Annual Meeting and Gala Dinner in December, the NBOME recognized Dr. Shaffer for his exceptional leadership and service.

“My time with the NBOME and all the wonderful folks here have been instrumental in my career,” said current NBOME Board Chair, Dana Shaffer, DO. “It’s been a privilege to serve as the Board Chair, and I’m confident the future is bright for the NBOME and its leadership.”
Dana C. Shaffer, DO, the dean at the Kentucky College of Osteopathic Medicine (KYCOM) in Pikeville, Kentucky, was installed as Chair of the NBOME Board in December 2017. He serves as a member of the Executive Committee and Compensation Subcommittee. He previously served as Vice-Chair and Secretary-Treasurer of the NBOME Executive Committee, as a member of the Test Accommodations Committee and as chair of the Finance Committee and the Liaison Committee. Prior to serving as dean at KYCOM, Dr. Shaffer served as senior associate dean, and also the senior associate dean of clinical affairs at Des Moines University College of Osteopathic Medicine from 2006 to 2013. Prior to that, Dr. Shaffer practiced the complete spectrum of rural family medicine in rural Iowa for 22 years, including osteopathic manipulative medicine, obstetrics and emergency medicine, as well as both inpatient and outpatient care.
“Dr. Shaffer has excelled as a leader for us at the NBOME,” said NBOME President & CEO, John R. Gimpel, DO, MEd. “His wisdom, judgment, experience and commitment have been great assets for us, and we thank him for his countless contributions to the NBOME and our mission.”
PHILADELPHIA, PA. The National Board of Osteopathic Medical Examiners (NBOME), an independent, not-for-profit organization that provides competency assessments for osteopathic medical licensure and related health care professions today announced Kim E. LeBlanc, MD, PhD, as the recipient of their 2019 Clark Award for Patient Advocacy. The award was created to recognize those who have gone above and beyond the call of duty in their advocacy for patient safety, patient protection and quality of care. It recognizes those who have worked to assure patients that DOs have qualified for licensure by virtue of having passed the licensure examinations (COMLEX-USA) that are designed for and have evidence for validity for the practice of osteopathic medicine. Dr. LeBlanc was presented with the award as part of NBOME’s Annual Board Meeting and Gala Dinner.

“The NBOME has been so important to my professional life, and I couldn’t be more honored to have been chosen as their next Clark Award winner. I appreciate all of my colleagues at the NBOME and in the osteopathic medical profession. Nothing would be possible without all of your hard work,” said Kim E. LeBlanc, MD, PhD. “I have worked with colleagues at the NBOME and on the COMLEX-USA examinations since my tenure at the Louisiana State Board of Medicine and have been proud to endorse the use of COMLEX-USA for osteopathic medical licensure in Louisiana and across the nation.”
Dr. LeBlanc recently transitioned back to Louisiana from his position as Executive Director of the Clinical Skills Evaluation Collaboration (CSEC), which creates and administers the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills examination. He was instrumental in advocating for equivalent licensure for DOs and acceptance of COMLEX-USA when he served as President of the Louisiana State Board of Medicine and on the Board of Directors of the Federation of State Medical Boards. Dr. LeBlanc was in the private practice of family medicine and sports medicine for nearly 20 years where he became involved in academic medicine. He also served as team physician for the University of Louisiana Lafayette, several US Olympic teams, and several professional baseball, soccer, ice hockey, and football teams.
“Dr. LeBlanc’s role in advocating for DOs and COMLEX-USA is a vital piece of NBOME’s exciting history, so it is fitting that he should be awarded the NBOME Clark Award for Patient Advocacy in this our 85th Anniversary year,” said NBOME President & CEO, John R. Gimpel, DO, MEd. “Dr. LeBlanc has made a major difference in health care and medical licensure.”
PHILADELPHIA, PA. The National Board of Osteopathic Medical Examiners (NBOME), an independent, not-for-profit organization that provides competency assessments for osteopathic medical licensure and related health care professions, today announced Gary L. Slick, DO, MA, as the recipient of their 2019 Santucci Award. Thomas F. Santucci, Jr., DO, was the NBOME’s President and Chair of the Board from 1985 to 1987, at a pivotal time of change for the organization. The Santucci Award is the NBOME’s highest honor, awarded only to an individual who has distinguished him or herself by their sustained outstanding contributions to the mission of the NBOME, protecting the public via competency assessment. Since 1978, Dr. Slick, has served in numerous roles at the NBOME, including as Chair of the Board of Directors from 2015-2017.

“I’m truly humbled to have been chosen by my peers to receive The Santucci Award,” said Gary L. Slick, DO, MA. “I want to thank all of my fellow board members at the NBOME for this distinction. Everything the NBOME has accomplished has been a team effort, and I look forward to what’s to come in the future.”
Dr. Slick currently serves as the designated institutional official of the graduate medical education residency and fellowship programs under sponsorship of the Oklahoma State University Center for Health Sciences (OSU-CHS), the chief academic officer of the Osteopathic Medical Education Consortium of Oklahoma, professor of medicine at the OSU-CHS, and member of the board of directors of the Accreditation Council for Graduate Medical Education.
A nephrologist, Dr. Slick has served the NBOME in numerous volunteer capacities over four decades, including as an item writer, test construction committee member, and final exam reviewer in physiology and internal medicine for COMLEX-USA examinations. He has served as committee chair of numerous NBOME Board and testing committees, including as the inaugural Chair of the COMAT internal medicine examination, tests now used in clerkship evaluation at almost every college of osteopathic medicine nationwide. Dr. Slick has been a member of the NBOME Board of Directors since 2005 and was installed as Board Chair in 2015.
“We are so pleased to recognize Dr. Slick with the NBOME’s highest honor,” said NBOME President & CEO, John R. Gimpel, DO, MEd. “Dr. Slick has made immeasurable contributions to the NBOME and the osteopathic medical profession since the 1970s, and how fitting that he should be a 2019 Santucci Award Winner in our 85th Anniversary year.”
As we continue to reflect on our 85th anniversary, we discussed the most memorable achievements in the history of NBOME with our Board of Directors.
What would you identify as NBOME’s greatest accomplishment since its founding?
Richard LaBaere II, DO, MPH: NBOME’s greatest accomplishment lies in the establishment of the COMLEX-USA series and its reputation as a nationally and internationally recognized assessment tool that is valid, reliable and relevant to what osteopathic physicians do. The NBOME has been tireless in implementing best practices in test development and testing, has made research a priority, and has employed a forward-looking approach to improvement and service.
Gary Slick, DO, MA: NBOME’s greatest accomplishment to date is being recognized nationally and at the federal and state level as one of two accepted licensing agencies in the U.S.
John Thornburg, DO, PhD: There has been significant evolution and growth from the original small ‘mom and pop’ organization, with only a few full-time employees, to what it is today. NBOME and COMLEX-USA have had much to overcome over the years and they have done so with tremendous grace.
What are you most proud to have been a part of since becoming involved with the NBOME?
Richard LaBaere II, DO, MPH: I am most proud of our thoughtful and deliberate growth in both capacity and relevance in the assessment and services NBOME provides. The implementation and further development of COMAT, the launch of a new testing blueprint, and the opening of a new clinical skills testing center are just a few great examples of strategic growth which has helped us in fulfilling our mission to protect the public. NBOME has been a reliable, steadfast partner to many affiliated organizations as well, willing and able to help others move forward during turbulent times of change.
William Anderson, DO: One of the most significant accomplishments that I am glad to have been a part of is the high standards that NBOME set for the profession.
John Thornburg, DO, PhD: One of NBOME’s biggest accomplishments has been the recent adaptation of COMLEX-USA to a competency-based blueprint with the highest standards of quality, enhancing our esteemed status as the one-and-only osteopathic medical assessment for licensure.
What is the biggest challenge you have seen the NBOME face and overcome?
William Anderson, DO: The USMLE examination has long been recognized as the licensure exam that allows medical students to practice independently. As a result, NBOME and COMLEX-USA have faced a great deal of competition and challenge while working to establish a unique path for osteopathic medical licensure. The fact that NBOME was able to meet these challenges and emerge successful as an equivalent evaluation, speaks to the high standards of COMLEX-USA, and its appropriateness as a tool to measure and assess osteopathic medical knowledge.
John Thornburg, DO, PhD: Over the years, NBOME has faced many significant challenges and has worked tirelessly to gain respect and acceptance across the medical community and the general public. The quality of our assessment products has been key to our success, as well as NBOME’s efforts to strengthen relationships with our many stakeholders, particularly residency program directors, FSMB, NBME, AOA, AOE, and COM deans.
What is the most dramatic change you have seen during your tenure at the NBOME?
Richard LaBaere II, DO, MPH: In the years since I have become a part of the NBOME, I’ve noticed the incredible growth in the understanding of COMLEX-USA in the past decade alone; more and more know about COMLEX-USA and how it reflects the performance of those training in the osteopathic profession.
Gary Slick, DO, MA: Originally, the NBOME only had one examination: COMLEX-USA. In recent years, however, there has been an explosion in the number of assessments developed—from COMSAE, to COMAT, to CORRE. These new assessments have allowed new knowledge to be assessed and a larger number of stakeholders to take advantage of our examinations, including students, COMs, physicians, etc.
John Thornburg, DO, PhD: When I first became involved with the NBOME, there were ten COMs, some with class sizes of less than 100. The subsequent increase in the number of COMs and their class size has resulted in a huge increase in revenue, as well as the need for more NBOME staff to meet this demand. While the COMLEX-USA series remains NBOME’s primary product, the role new assessment products has played is far beyond what could have been foreseen 10 years ago.
What comes next for NBOME? What are you most excited about?
Richard LaBaere II, DO, MPH: I am really excited about the development of new technology platforms like CATALYST to sustain and drive easy access and expedite ways to continue life-long learning. I am also excited about how we can use assessment data in novel ways to assist both students and residency program directors in achieving the very best match possible in graduate medical education, especially in light of the single accreditation system for graduate medical education in 2020.
William Anderson, DO: I anticipate NBOME’s next steps will be closely tied to the single accreditation system for graduate medical education and the ACGME, respectively. As the GME landscape changes, the osteopathic medical community will need to adapt alongside it, working to earn a place in the new ACGME system and position itself as an asset to the practice of medicine.
Contributors
Richard J. LaBaere II, DO, MPH serves as the Secretary-Treasurer on the Board of Directors, chairs the Finance Committee and vice-chairs the COMLEX-USA Composite Examination Committee. Dr. LaBaere is currently the associate dean for postgraduate training, the osteopathic postdoctoral training institution academic officer and an adjunct clinical professor of family medicine at A.T. Still University–Kirksville College of Osteopathic Medicine (ATSU-KCOM) in Missouri.

Gary L. Slick, DO, MA is Immediate Past Board Chair on the Board of Directors and member of the NBOME’s Compensation Subcommittee and Nominating Committee. Dr. Slick also currently serves as the designated institutional official of the graduate medical education residency and fellowship programs under sponsorship of the Oklahoma State University Center for Health Sciences (OSU-CHS).
John E. Thornburg, DO, PhD serves as a National Faculty Chair in Foundational Biomedical Sciences at NBOME. Dr. Thornburg also currently serves as Professor Emeritus in the Pharmacy and Toxicology department at Michigan State University. In 2012, he was awarded the AOA’s Distinguished Service Certificate during AOA OMED.

William G. Anderson, DO, was an active member of the NBOME Board of Directors from 2003 through 2014 and was member of its Executive Committee from 2007 to 2010. Dr. Anderson is a professor of surgery and senior adviser to the dean at the Michigan State University College of Osteopathic Medicine (MSU-COM).
Horber DT, Waters S. CATALYST: Transforming Physicians’ Assessment into Learning. Presentation delivered the 2019 Meeting of the American Board of Medical Specialties, Chicago, IL, September 2019.
Session Summary
For years, physicians have criticized maintenance of certification as an ineffective requirement that is irrelevant to practice and cost-prohibitive. In response, several specialty Boards have implemented longitudinal assessment formats to ensure continuing physician competency. The National Board of Osteopathic Medical Examiners (NBOME) has developed CATALYST, an assessment platform supported by findings from cognitive learning that emphasize the value of retrieving previously learned content, providing immediate feedback, spacing questions over time, and interleaving topics in order to produce more complex and durable learning.
During 2017 and 2018, in conjunction with the American Osteopathic Association (AOA), the NBOME conducted 16-week pilot studies with three osteopathic specialty boards. Results provided overwhelming support for the CATALYST assessment platform: of the 196 diplomates surveyed, 95% agreed or strongly agreed that CATALYST would help them stay current in their specialties and over 98% preferred the CATALYST format to traditional Board examinations. A significant pilot finding was that different specialty Boards have different requirements and expectations, as did the physicians within the specialty. In order to provide greater customization within CATALYST, the NBOME is implementing a new CATALYST platform, with a follow-up study.
The presentation will describe CATALYST as an assessment format, summarize NBOME’s development path including the new platform and pilot outcomes, and describe alternative uses for CATALYST. Lessons learned from this journey and planned next steps will provide insights to organizations seeking alternative modes of ongoing physician assessment. Audience participation and questions will be encouraged
Learning Objectives
By attending this presentation, attendees will be able to:
• Describe CATALYST’s basis in cognitive learning theory
• Summarize the outcomes reported in the CATALYST pilot studies
• Describe next steps for CATALYST
Shaffer D, Waters S. Ensuring Ongoing Physician Competency with CATALYST. Presentation delivered at the 2019 Meeting of the International Association of Medical Regulatory Authorities, Chicago, IL, September 2019.
Abstract
The purpose of maintenance of certification in the United States is to ensure ongoing physician competency in order to safeguard patient safety. In recent years, maintenance of certification, with its generally unpopular traditional, high-stakes, multiple-choice examination, has been criticized as a cost-prohibitive process that is not relevant to physicians’ clinical practice. In response, some specialty Boards, among them the American Board of Anesthesiology, the American Board of Pediatrics, and the American Board of Internal Medicine, have implemented alternative assessment formats that focus on facilitating physician’s continued learning.
In keeping with its mission, the National Board of Osteopathic Medical Examiners (NBOME) has developed CATALYST, a longitudinal assessment designed to provide specialty Boards with a potential means of assessing ongoing physician competency. CATALYST is based on findings from cognitive learning which emphasize the retrieval of previously learned content, providing immediate feedback, spacing questions over time, and interleaving topics. The NBOME, in conjunction with the American Osteopathic Association (AOA) conducted 16-week pilot studies to gather data concerning how diplomates from three osteopathic specialty boards viewed the CATALYST assessment platform and the assessment process. Participants were recruited from the American Osteopathic Board of Internal Medicine (AOBIM), the American Osteopathic Board of Pediatrics (AOBP), and the American Osteopathic Board of Obstetrics and Gynecology (AOBOG).
Results indicated overwhelming support for the CATALYST platform: of the 196 diplomates surveyed, 95% agreed or strongly agreed that CATALYST would help them stay current in their specialty and 91% thought it would help them take better care of their patients. Over 98% stated that they would rather answer a fixed number of CATALYST questions periodically than take the traditional recertification examination.
This presentation will describe the use of CATALYST as an assessment format and summarize the outcomes of the pilot studies and their outcomes. As well, next steps for CATALYST, including the development of a new technology platform, will be discussed. Lessons learned will assist participants in considering exploration or potential enhancement of similar programs in their jurisdictions.
Behavioral Learning Objectives
By attending this presentation, attendees will be able to:
• Explain the elements of cognitive learning theory that support CATALYST as a longitudinal assessment.
• Describe the outcomes of the pilot studies with diplomates of three osteopathic specialty boards.
• Describe next steps for CATALYST.
References
The American Board of Anesthesiology – Part 3: MOCA Minute®. http://www.theaba.org/MOCA/MOCA-Minute. Accessed February 4, 2019.
The American Board of Pediatrics – MOCA-Peds. https://www.abp.org/mocapeds Accessed February 4, 2019.
Madewell,JE, Hattery, RR, Thomas SR, Kun LE, Becker GJ, Merritt C, Davis, LW, American Board of Radiology: Maintenance of Certification. Radiology. 2005 234(1): 17-25. Published Online:Jan 1 2005https://doi.org/10.1148/rg.251045979.
Brown PC, Roediger HL, & McDaniel MA. Make it Stick: The Science of Successful Learning. Cambridge MA: Harvard University Press, 2014.
Moulton CA. Dubrowski A, MacRae H, Graham B, Grober E, & Reznick R. Teaching Surgical Skills: What kind of Practice Makes Perfect? A Randomized, Controlled Trial. Ann Surg. 2006 Sep;244(3):400-9.
Mirigliani L, Lorion, A. When Life Gets in the Way: Getting SPs out of Their Heads and into the Role. Presentation delivered at the 2019 Association for Standardized Patient Educators Annual Conference, Orlando, FL, June 2019.
Overview
We ask Standardized Patients (SPs) to put the outside world aside during encounters and focus only on what is happening in the room, but this is not always easy, even for the best SPs. SPs being distracted by real-life concerns may lead to struggle with portrayal, recall, late arrivals or callouts, or unpredictable responses to co-workers or feedback. Without attention, an SP may continue to struggle—in work and out. Yet, the integrity of the simulation must be protected, and sometimes the SP’s employment will be in jeopardy. This session will help participants be prepared to recognize potential signs of SPs who are struggling emotionally; be receptive to having conversations with those SPs; identify tools and resources that can assist the SPs; be able to set limits; and be able to hold to those limits, even if it means the SPs going through corrective action, up to and including termination.
Rationale
An SP’s emotional state can have serious repercussions, impacting other SPs and staff as well as the SP, making portrayal of some cases more difficult and potentially risking an examination’s standardization. Being aware that shifts in emotional state are a possibility, being able to address the situation with the SP, and having tools and resources readily available for the SP will help trainers and administrators intervene, address the root cause of unusual behavior, and potentially assist a valued SP. Setting and holding to limits will help the trainer or administrator protect him or herself and the simulation.
Objectives
Participants will be able to:
1. Help SPs recognize what may be “triggers” for them, including having to simulate something they are experiencing in real life.
2. Encourage SPs to do emotional “self-checks” prior to simulations.
3. Start a potentially uncomfortable conversation with the SP.
4. Have tools and resources at hand.
5. Set limits to preserve the integrity of the simulation.
Intended Discussion Questions
1. Have you been in this situation before, on either side of the conversation? If so, what did/did not go well and what did you learn?
2. What tools have you used/could you use to help SPs dealing with emotional issues to focus on the simulation and their responsibilities to the center, their co-workers, and the students?
3. What resources are available at your institution that could assist SPs struggling with emotional difficulties?
4. Given your role, how will you prepare your colleagues and share information?
References
Spencer, John and Jill Dales, “Meeting the Needs of Simulated Patients and Caring for the Person Behind Them?” Medical Education 40.1 (2006): 3-5.
Bokken, Lonneke, Van Dalen, Jan, and Jan-Joost Rethans, “Performance-related stress symptoms in simulated patients,” Medical Education 38.10 (2004): 1089-1094.
Varlander, Sara, “The Role of Students’ Emotions in Formal Feedback Situations,” Teaching in Higher Education 13.2 (2008): 145-156.
Lewis, Karen L.,Carrie A. Bohnert,Wendy L. Gammon, Henrike Hölzer, Lorraine Lyman, Cathy Smith, Tonya M. Thompson, Amelia Wallace, and Gayle Gliva-McConvey “The Association of Standardized Patient Educators (ASPE) Standards of Best Practice (SOBP)” Advances in Simulation 2:10 (2017).
“Building Workplace Resilience.” Guidance Resources Online. 2018. ComPsych Corporation. Retrieved from https://www.guidanceresources.com/groWeb/s/article.xhtml?nodeId=809859&conversationContext=1
Ronkowski, E. Collaborative Cognitive Item Mapping Paper presented at the 2019 Conference of the American Board of Medical Specialties, Chicago, IL, September 2019.
Learning Objectives
Attendees will leave this presentation with ideas on how to innovate traditional item-writing workshops through Collaborative Cognitive Item Mapping (CCIM). They will also have an understanding of how to implement the Plan-Do-Check-Act (PDCA) model to innovate test development in a data-driven manner.
Session Summary
Collaborative Cognitive Item Mapping (CCIM) is a dynamic, new form of item development that builds on the literature in automatic item generation (AIG). In CCIM, a small group of subject matter experts (SMEs) develops items that assess essential testing objectives related to a clinical presentation, such as neck masses. The SMEs select high-frequency, high-impact diagnoses related to the topic, then map out patient findings and clinical decision-making processes. An item editor transforms the map into a set of items.
CCIM is beneficial because it is collaborative, systematic, and intentional. Independent item writing (IIW) can be challenging for physicians who are used to constant interactions and movement; CCIM allows SMEs to develop items without the intimidation of the blank page. The systematic approach of CCIM ensures that items include necessary details, such as duration of symptoms, and results in better distractors as SMEs think through plausible options for multiple diagnoses at the same time. With IIW, it is difficult to control for SMEs writing similar items on the same topics. With CCIM, a small group, rather than an individual, decides the testing objectives and diagnoses; this results in items that reflect the breadth and scope of the topic.
To develop CCIM, we implemented the Plan-Do-Check-Act (PDCA) model. At a pilot workshop, participants wrote items through IIW and CCIM. Through a collaboration of psychometricians, editors, and test developers, we fast-tracked a group of nearly 100 items for pretesting, and the results showed no significant statistical difference in item performance between the CCIM and IIW items. Our preliminary findings also suggest that CCIM can boost item production as much as 30% compared to traditional workshops.
Beyond the NBOME board, executive leadership, and even our 700+ member National Faculty, there are dozens of staff members and collaborators helping us protect the public in their roles behind the scenes. To commemorate our anniversary, we turned to some NBOME insiders for their insights at the work and culture of the NBOME. Last week we heard from, Shirley Bodett, and Dennis J. Dowling, DO. This week a few more long-time NBOME staff and collaborators shared their perspectives of our work over the years.
Sydney Steele, JD has been NBOME’s General Counsel for over 25 years. With a deep knowledge of the Americans with Disabilities Act (ADA), Sydney has been instrumental in establishing our modern Test Accommodations practices. In 2019 he won NBOME’s Santucci Award for a career of sustained contributions to the mission of the NBOME.
What was going on at the organization when you started?
When I started As I recall there were only about 12 or so employees in the Conshohocken office. There was no full-time President. There was no office in Chicago. There was no PE exam. And there were very little if any ADA claims by test-takers.
What were some of the biggest shifts in the NBOME during your time here?
The NBOME has become substantially more sophisticated in their testing practices, including Level 2-PE, and expanded testing into related health care professions. Technology has driven a lot of that, as did my role in developing our ADA accommodations for students with disabilities.
What is your fondest memory of your time with the NBOME?
Working with the talented and dedicated people at the NBOME, and watching the organization grow from about 12 or so employees without a full-time president, to what it is today.

NBOME’s principal Research Associate, Yi Wang, MS has been with the organization for 18 years. She was awarded the President’s Award for Outstanding Service.
How has technology changed how the organization works?
When I started working with NBOME, all of our assessments used paper and pencil. In 2005, we moved COMLEX-USA to a computer-based format and began developing COMLEX-USA Level 2-PE. Following that we built and entire portfolio of computer-based, and even web-based assessments.
What’s your favorite thing about working at the NBOME?
Everybody probably says the same thing, but it’s really true that the people that make up the NBOME are really the best thing about it. I’ve been here nearly 20 years, and I’ve seen us grow from 20 employees in 2001 to nearly 130 today, but I still know I can count on every member of my team.
The first face you see when you enter our Philadelphia corporate offices is Rachel Maxwell. She keeps us running like a well-oiled machine as our Coordinator for Operations. She has been with the organization for 15 years.
What’s changed since you’ve worked here?
When I first started in 2004 we were just opening the testing center for the COMLEX-USA Level 2 PE exam. Nothing was done electronically the first few years. Each students filled out a paper application and mailed a check in to pay for their exam. We manually registered students for each testing date were on paper and then uploaded into the system to run the exam. We sent all score reports via snail mail as well. We’ve come a long way since then. Computers have simplified a lot of processes, but they still keep us busy.
What is your fondest memory of your time with the NBOME?
Do I have to pick one? I’ve been here so long that I have so many fond memories of NBOME. There were times when we were smaller when we’d hold company events at the company president’s house, a company outing for an afternoon of snacks and swimming, the entire staff even attended dinner with the board.
What’s your favorite thing about working at the NBOME?
I have met some very interesting people while working with our National Faculty, but more importantly I have made some wonderful friends with my coworkers as well. I think that no matter what, the NBOME is growing, but I still feel like we maintain a small company type feel where everyone knows each other, cares about each other and like any family, we go through our ups and downs, good and bad.
We are pleased to congratulate Karen J. Nichols, DO, former president of the AOA and vice chair of the Accreditation Council for Graduate Medical Education (ACGME) board, for being named chair-elect of the ACGME.
“I have had the honor of serving on the ACGME board for five years and have clearly seen the laser-focus of the entire organization on our mission – ‘…to improve health care and population health by assessing and advancing the quality of resident physicians’ education through accreditation,'” said Dr. Nichols.
Dr. Nichols has a long, decorated history in osteopathic medicine. She served as the first woman president of the AOA, president of the Arizona Osteopathic Medical Association and president of the American College of Osteopathic Internists.
From 2002-2018, Dr. Nichols was dean of Midwestern University Chicago College of Osteopathic Medicine. Prior to that, she was assistant dean, post-doctoral education and division director, internal medicine, at the Midwestern University Arizona College of Osteopathic Medicine. A frequent national speaker on leadership, end-of-life care and osteopathic medicine, Dr. Nichols has also received seven honorary degrees and top awards from the AOA and the American Association of Colleges of Osteopathic Medicine.
She currently holds several positions at the ACGME. In addition to her newly appointed post, she is a member of the executive committee, chair of the governance committee, and a member of the standing committees for education, policy and monitoring.
“The ACGME has worked to transition to an accreditation model that encourages excellence and innovation. My vision is to work with our fine ACGME board, staff and volunteers to see that the ACGME continues to move forward while being thoughtful and current.”
All of us at the NBOME recognize Dr. Nichols’ accomplishments, and we sincerely wish her the best of luck in her new role with the ACGME.
Read more about Dr. Nichols’ role as chair-elect of the ACGME.
In honor of the NBOME’s 85th anniversary since our founding, we sat down with some inspirational members of the osteopathic medical community to discuss their thoughts and perceptions of the NBOME over the years and now.
As NBOME celebrates 85 years of osteopathic medical assessment, how do you feel the organization has impacted the osteopathic medical profession over recent decades?
John Potts, MD: The NBOME’s examinations, developed by their many highly capable and dedicated volunteers, have continued apace of the rapid advance of medical knowledge. As such, the NBOME has pushed both osteopathic medical students and the colleges of osteopathic medicine to ever-higher achievement.
Thomas Cavalieri, DO: The NBOME has impacted the osteopathic medical profession through its commitment to excellence and its steadfast adherence to protecting the public. Fulfilling this mission derives from NBOME’s ability to create an exam that truly integrates osteopathic principles and practice while providing evidence for the need for a distinct profession to have a distinct licensure exam.
Bill Burke, DO: The NBOME, through the actions of its Board and staff, has made an invaluable contribution to the growth and development of the osteopathic medical profession. The ability of DOs to obtain licensure in all 50 states, is in large part due to the development and continuous modernization of the COMLEX-USA series. It is exciting to see the innovation coming from this organization, which will assist practicing physicians in maintaining their board certification through platforms like CATALYST.
William Mayo, DO: Throughout the entirety of its history, the NBOME has defended the distinction of DOs and our approach to our patients—sometimes even against strong opposition. Psychometrically valid, defensible exams, such as COMLEX-USA, provide a strong case to be made on behalf of the profession, and have been endorsed by a number of organizations.
What advice would you give the NBOME as it completes its first 100 years between now and 2034?
Karen Nichols, DO, MA: I would encourage the NBOME to continue holding the bar high in order to ensure that qualified osteopathic physicians are prepared to serve the public.
John Potts, MD: These times are challenging in many ways and I can only predict more challenging times ahead for medical education, both osteopathic and allopathic. I expect the NBOME will continue to fulfill its mission as it has in the past, and continue to uphold the standards that further enable protecting the public.
Humayun Chaudhry, DO: NBOME faces the same challenges confronting all testing entities: the need to demonstrate the continued value of independent assessment as a critical adjunct to medical education and training. This is particularly important at a time when the broader environment seems less amenable to regulation overall.
William Mayo, DO: I would recommend that the NBOME continues to collaborate with the AOA, AACOM and the FSMB to promote distinctiveness across the continuum.
Thomas Cavalieri, DO: It is my hope that the NBOME remains steadfast in its commitment to protecting the public and assuring continued high-quality examinations that truly reflects the essence of osteopathic medicine.
Contributors

John R. Potts III, MD, is the Senior Vice President, Surgical Accreditation at the Accreditation Council for Graduate Medical Education (ACGME). Dr. Potts also serves as an adjunct professor of Surgery at the University of Texas Houston Medical School (UTHMS). He has also served on the ACGME’s Committee on Innovation in the Learning Environment and on the Standing Panel for Accreditation Appeals in the specialty of surgery.

Thomas A. Cavalieri, DO, is the dean at Rowan University School of Osteopathic Medicine and also serves as a professor of medicine and Osteopathic Heritage Endowed Chair for Primary Care Research. Dr. Cavalieri is a past chair on the NBOME’s Board of Directors, and a longtime National Faculty leader. He was first recruited to the National Faculty in the late 1980s as an exam writer, and oversaw the launch of the COMLEX-USA Level 2-PE in 2004.

Bill Burke, DO, is the Dean of the Ohio University Heritage College of Osteopathic Medicine-Dublin Campus and Chair of Osteopathic International Alliance. He served as a trustee of the American Osteopathic Association (AOA) and as the chair of its departments of Educational Affairs, Governmental Affairs, and Research and Development, as well as its Bureau of Communications and Committee on AOA Governance and Organizational Structure. He is a founding director of the International Primary Care Educational Alliance.

William S. Mayo, DO, was president of the American Osteopathic Association (AOA) for 2018–2019. Throughout his tenure, Dr. Mayo has served the AOA in many capacities. Additionally, Dr. Mayo is a past president of the Mississippi Osteopathic Medical Association and the Mississippi EENT Society. He has served on the Mississippi State Board of Medical Licensure since 2006 and was president from 2010-2012.

Karen J. Nichols, DO, MA, MACOI, CS, is the chair elect of the Accreditation Council for Graduate Medical Education board of directors, and has served as president of the American Osteopathic Association, president of the Arizona Osteopathic Medical Association (AOMA), and president of the American College of Osteopathic Internists, being the first woman to hold all of those
positions.

Humayun Chaudry, DO, is the President and Chief Executive Officer of the Federation of State Medical Boards (FSMB) of the United States and was chair of the International Association of Medical Regulatory Authorities (IAMRA) from 2016 to 2018.
Beyond the NBOME board, executive leadership, and National Faculty, there are dozens of staff members and collaborators helping us protect the public in their roles behind the scenes. To commemorate our anniversary, we turned to some NBOME insiders for their insights at the work and the culture that’s brought us to where we are today.

Senior Operations Specialist, Shirley Bodett has been with us longer than any other staff member. In her 34 years with us, she’s witnessed many of the changes that have shaped the modern day NBOME.
NBOME: What was happening with the organization when you started?
When I was hired in 1984, there were only two other employees – an Executive Director, Carl W. Cohoon and his assistant Carol Thoma. I was hired to answer phones and do clerical work.
To create exams (one for each discipline), the discipline chair would look through coded cards and select test items based on categories. The staff would then use a word processor and floppy discs to put these questions into a two-column document. This was then sent to a printer, who published the exam books.
Exam scoring was contracted out to the University of Iowa, where score reports were printed and sent to us in triplicate for distribution. We entered candidate names into huge black books by hand, in alphabetical order, by school and graduating class. Later, we entered each candidate’s scores into that same book. When transcripts were ordered, we again opened these books to find the information needed to complete the transcript.
What were some of the biggest changes you’ve seen in the organization?
Computerization has completely changed how we do nearly everything. We’ve brought a lot of our processes in-house, and our vastly expanded staff is much more involved in item creation, editing, and review.
What is your fondest memory of your time with the NBOME?
Working with some of the same people for many years, and getting to know physicians Board members, other Subject Matter Experts, and staff as individuals rather than as defined by their profession.
A lifelong advocate for Osteopathic Manipulative Medicine (OMM), Dr. Dennis J. Dowling, DO, FAAO, our Coordinator for OMM Assessment began working with the NBOME 26 years ago. His work has been instrumental in launching our COMLEX-USA Level 2 PE.
When did you begin working with the NBOME?
I started in the early 90s after becoming a faculty member at NYCOM. One of my professors, Robert E Mancini, PhD, DO was a pharmacologist who became an osteopathic physician as well as a former NBOME president. Dr. Mancini got me involved with a task force he had put together to integrate Osteopathic Manipulative Medicine (OMM) with other questions.
What were some of the biggest changes in your time here?
In 1997 I expressed an interest in the examination of osteopathic manipulative skills and utilizing scoring rubrics to better reflect the process. We came to a major crossroads in the early 2000s that could have easily led to DO students taking a generic test for all medical students, with a tacked on OMT station or two, and no other osteopathic distinctions. But thanks to our work at the time, we now have a fully integrated osteopathic examination that is a much more effective way of testing osteopathic students preparing to enter postgraduate training.
How has technology changed in terms of how we operate?
Technology expands the ability to create much more material and develop alternate processes of testing. It also opens up to greater security risks than ever before. We have to keep up with advancing technology and capabilities, while meeting the needs of the population that we are examining.
What’s your favorite thing about working at the NBOME?
There’s a camaraderie and a sense of purpose that permeates everything we do. We are truly trying to develop the best product for protecting the public and enhancing osteopathic medicine. Without the strength of the NBOME, osteopathic medicine would be a very different and much less effectual profession than exists today.
Next week we’ll catch up with former NBOME General Counsel, Sydney Steele, 2019 NBOME President’s Award winner Yi Wang, and Coordinator of Operations, Rachel Maxwell for their perspectives on 85 years of NBOME.
Sheryl Bushman, DO, served as our Chair from 2005-2007, overseeing a great investment in development for the board, the organization, and its products to guarantee their validity at a time of increased scrutiny. From 2011 to 2013, Janice Knebl, DO came on as chair, and oversaw the creation of the Blue Ribbon Panel to modernize COMLEX-USA to a competency based model (which we’ve just finished implementing this year). Both these women have had their own distinct impact in shaping of our organization, they also happen to be the first and second chairwomen of the NBOME.
We sat down with these two important figures to hear their perspective of the NBOME’s 85 year history, and their own part in it.
When AT Still opened the first COM in the 19th century, it was pretty radical that women were able to study there. Famously the first person to take the NBOME’s first exam was a woman (Margaret Barnes). How do you feel about the state of women in the NBOME, and in osteopathic medicine on the whole? Are we living up to the legacy?
Dr. Sheryl Bushman: The NBOME has always treated women with the utmost respect. It is part of our DNA. I recall before becoming Chair they asked “What should we call you? Chair-man Bushman doesn’t sound appropriate.” We’ve simply called the position “Chair” ever since. Even to this day, I see committee Chairs purposefully review the demographics of their members and try to generate membership to reflect the profession considering race, sex, age, location, etc. This encourages the NBOME’s culture of collaboration, intellectual stimulation, respect and sensitivity. AT Still would be proud to see how far we have come.
Dr. Janice Knebl: I am so very proud that while I was NBOME Chair, the Board of Directors was composed of 40% women. As I participated in the Coalition for Physician Accountability, which included all of the other major physician groups, we had the largest percentage of women physicians and board members than any of the other organizations. It is critical for the NBOME Board to reflect the “face” of osteopathic medicine which is on average about 50% women in every College of Osteopathic Medicine Class.
What do you think women bring to the table, particularly when it comes to leadership roles?
SB: Whether we are men or women, we all come to our leadership roles with a different style. I am certain that my role as Chair helped me develop my leadership skills in being able to provide difficult news clearly, directly, but gently.
JK: I believe that women bring empathy, strong work ethic and collaboration to osteopathic medicine. Of course, these are generalities that don’t apply to all women. When working with women leaders in osteopathic medicine I have seen them be solution focused and being inclusive of diversity of opinions. Most of the women leaders I have worked with have given over 100% to their positions.
What does the NBOME do well when it comes to promoting gender diversity in leadership, and what do you think we could do better?
SB: As I’ve said, the NBOME has been committed to reflecting the demographics of the osteopathic profession, even as I first became involved in 1989. They do a good job. If there is a gap, I imagine it’s due more to a lack of awareness among the candidate pool than a lack of inclusivity on the NBOME’s part. Perhaps identifying a way to advertise or communicate opportunities could improve participation.
JK: NBOME intentionally recruited women for the Board of Directors during my tenure as Chair. In order to have the gender diversity, there needs to be an intentional approach by inviting and encouraging participation in all aspects of the organization by women. There needs to be an understanding and respect that women may have other roles and responsibilities during their careers that will change to enable them to participate at different times in the organization. NBOME could consider supporting a leadership track for women and men who are identified for future leadership roles within the organization.
How do you look back on your experience with the NBOME?
SB: Among all the leadership positions I’ve held in my career, I treasure this position the most for several reasons. The NBOME is made up primarily of volunteers with great affection for the osteopathic profession and the desire to give back. Unlike many professional organizations, egos are left at the door. Patient wellbeing and student fairness are always at the forefront in our decisions, from test development to the cost of exams, etc. Working with colleagues across the entire spectrum of medical care for this organization is a true blessing.
JK: It was a true privilege for me to serve as an officer and Chair for the NBOME. Being involved with the NBOME and having the opportunity to be a leader in assessment for osteopathic medicine has been a true highlight of my career as an academic osteopathic physician. The mission of the NBOME to protect the public is noble and necessary for the public good and for all of us as patients.
Contributors

Sheryl Bushman, DO currently works as Chief Medical Informatics Officer at Optimum Healthcare IT, and contributes to on our COMLEX-USA Level 2-PE Advisory committee. She served the NBOME’s Board Chair from 2005-2007

Janice Knebl, DO currently practices and teaches Geriatric Medicine in Fort Worth TX, in addition to chairing our COMLEX-USA Composite Examination Committee. She served as NBOME Board Chair from 2011 to 2013
COMLEX-USA | New Level 1 and Level 2-CE Exams Have Launched
We are pleased to announce the completed launch of all elements of the enhanced COMLEX-USA exam series under the new COMLEX-USA Enhanced Master Blueprint. Level 1 successfully released this spring, followed by Level 2-CE’s in late summer. These exams have joined Level 3 and Level 2PE, which launched in 2018 and earlier this year, respectively. New passing standards for COMLEX-USA Level 1 and 2 have also been implemented for the 2019-2020 test cycles.
This multi-stage release is the culmination of nearly 10 years of work in evidence-based design by experts and leaders from across the organization and the country who contributed in all areas to the creation and deployment of this state-of-the-art assessment.
The exams launched to heavy candidate volume with over 1,500 candidates completing each exam during the first weeks. To date, over 5,000 Level 1 examinations have been administered, with similar numbers for Level 2-CE.
These examinations also mark a move to Prometric’s new test driver, SURPASS, where NBOME already administers its Core Osteopathic Recognition Readiness Examination (CORRE), as well as the latest version of COMLEX-USA Level 3. Since the move to SURPASS, some students have encountered performance problems during their administrations, including latency and examination restarts. Prometric continues to investigate the cause and has made system upgrades in early June to address issues. NBOME is currently offering online tutorials for Levels 1, 2-CE and 3 for candidates who would like to learn more about the new test interface being offered at Prometric Testing Centers.
Please visit the COMLEX-USA pages of our website to learn more.
COMAT | Foundational Biomedical Sciences (FBS) Exams Available this Academic Year
Since the inception of the COMAT Clinical Exams in 2011, osteopathic medical students have taken over 250,000 COMAT Clinical exams. As a result, we have seen dramatic improvement in COMLEX-USA Level 2-CE scores by osteopathic medical students. This is particularly important in the era of Single Accreditation for GME.
Celebrating 85 years of protecting the public through valid and reliable licensing exams, NBOME has spent the last 5 years developing an expanded COMAT portfolio to assist in DO student success.
COMAT Clinical exams initially focused on assessment of clinical education and knowledge typically found in year 3 and 4 COM curriculum. The success of this initial series of exams led the NBOME, in collaboration with its National Faculty, to expand its offerings and develop assessments for the Foundational Biomedical Science (FBS) curriculum which take place during year 1 and 2. After careful development and testing, the COMAT FBS Comprehensive (FBS-C) exam became available in December 2018. Since its inception, the 5-hour, 250-question COMAT FBS-C has successfully been utilized by many COMs across the country. This assessment has enabled both COM students and faculty to better understand the effectiveness of the school’s classroom curriculum and identify areas for student development.
Scheduled for release in January 2020, the suite of 14 FBS Targeted (FBS-T) exams further supports osteopathic medical student professional success. These exams are divided between 6 core science disciplines, including Anatomy and Pharmacology, as well as 8 body systems, including musculoskeletal and cardiovascular. Each 90 minute, 62 question COMAT FBS-T exam is designed to evaluate a student’s knowledge in a focused subject area. Timely score reports detail areas of strength and challenge, and will provide COM faculty and students insight to guide COMLEX-USA Level 1 preparation.
Should you have any questions about COMAT or the new FBS examinations, please visit the COMAT pages of our website.
Contributed by: Michael Finley, DO | Senior VP for Assessment | NBOME
CATALYST | Continuous Learning Platform
Wouldn’t it be nice if you could make learning new material easier? And what if you could avoid taking another traditional multiple-choice exam to demonstrate what you’ve learned?
Inspired by research from leading cognitive psychologists, as well as by the success of the American Board of Anesthesiology’s MOCA-Minute, the NBOME began its research journey into developing its own continuous learning platform – CATALYST.
CATALYST is a formative assessment platform designed as an alternative to traditional physician competence and practice-relevant assessment. Based on the outcomes of several successful pilots conducted in 2017 and 2018, the NBOME has expanded its partnership with ITS to develop a more sophisticated platform that can be customized to meet various client assessment needs.
With the primary goal of eliciting user feedback on the newly designed platform, the CATALYST 2.0 Platinum Pilot was released on June 5. Participants included osteopathic medical students, residents, NBOME National Faculty and NBOME staff. Learners were asked to answer 70 multiple-choice questions during a five-week period and were offered a choice of receiving 2 items a day, 14 items a week, or all 70 items at once. Following the completion of each item, participants were asked to gauge their confidence in answering the question and the question’s relevance to their specialty / field of study. Whether or not the question was answered correctly, the participant was provided with immediate feedback including the correct answer, a rationale for the answer, as well as references and links to additional learning resources.
Feedback has been very positive — 94% said the platform met their expectations and 91% found the system easy to navigate. What did participants like most about CATALYST? One responded that it was “very easy to navigate, good questions.” Another “liked that the platform showed the learning objectives of each question, helped identify why the question was being asked, and identified what the learning goals were for each question.” And a National Faculty member liked that “it could be done on my own time across multiple platforms and devices.”
The cross-functional CATALYST team, led by Sandra Waters, MEM, VP for Collaborative Assessment & Initiatives, is preparing for the next CATALYST release in September which will include an enhanced dashboard with normative statistics, highlighting of item components, and streamlined registration. Preparation has also begun for the delivery of COMSAE Phase 2 on CATALYST, providing COMLEX-USA Level 2-CE candidates the opportunity for alternative learning through formative assessment.
Contributed by: Dot Horber, PhD | Director for Continuous Professional Development | NBOME
In this section
Browne, M, Wojnakowski M, Horber DT. Choosing Wisely: So Many Options for Assessment Administration. Which will Enhance Your Exam’s Validity and Fairness? Paper presented at the 2019 Innovations in Testing Conference, Orlando FL, March 2019.
Short Description
With advances in assessment, credentialing organizations are presented with myriad options to “enhance” test format and administration. Two organizations have been conducting research and pilot testing to explore some options alone and in combination – use of resources while testing, and, high stakes testing in remote proctored conditions.
Reference availability may increase an assessment’s fidelity to real life clinical situations, but it raises many implementation questions: Which references will be useful and what is the best way to make them available? What is the effect on test time needs, outcomes, and validity?
Remote proctoring is attractive to candidates as a convenience and can offer some cost savings. In reality though, just how easy is it to test from home? What are the security implications? Copyright treats remote proctored tests differently; how can this be addressed?
The presenters will discuss obstacles encountered, comparison of outcomes, and best practices found.
Full Description
With advances in assessment, credentialing organizations are presented with myriad options to “enhance” test format and administration. Two organizations have been conducting research and beta testing to explore some options alone and in combination – use of resources while testing, and, high stakes testing in remote proctored conditions.
As certification organizations move toward nontraditional assessments, provision of reference resources during assessment is one of many areas of uncertainty. Although reference material availability likely increases an assessment’s fidelity to real life applicable clinical situations, it raises many implementation questions as well as concerns about test outcomes and validity.
Remote proctoring, long a hot topic, has rarely been contrasted with in-person proctoring in a high stakes examination. The differences that materialized in candidate acceptance, test administration and outcomes can inform much constructive discussion.
One organization is researching options for continued professional certification for its 50,000-plus certificants. A 1,500-participant research study incorporating open-book features and different proctoring conditions was completed in October 2018. The research divided the participants into six different experimental conditions – in-person proctored vs remote proctored, no resources, e-resources, and hard copy resources. Presenters will discuss development of the research design as well as the research outcomes.
Presenters from the second organization will discuss aspects of the development of an innovative item format that focuses on competency domains other than clinician knowledge recall. This incorporates the use of online resources to locate clinical diagnostic and treatment information to answer questions. The item format contains a clinical case scenario with associated multiple-choice items that would require most examinees to access online resources in order to answer the questions. In current day-to-day practice. Presenters will discuss the item development process and relate quantitative and qualitative data obtained from the Proof-of-Concept study. Lessons learned from this study and planned next steps will provide insights to organizations seeking more authentic modes of assessment of clinical behavior and decision-making.
NCME Paper 2019
The Effects of Test Familiarity on Person-Fit and Aberrant Behavior
Hotaka Maeda, Ph.D. & Xiaolin Wang, Ph.D.
Abstract (50 words)
The person-fit to the Rasch model was evaluated for examinees taking multiple subject tests with a similar structure. The evaluation considered which test in the sequence (i.e., first, second) was taken. Compared to an examinee’s first test, person-fit improved for later tests. Test score reliability may improve with test familiarity.
Introduction
Aberrant behaviors are unusual test-taking behaviors that introduce noise to test data. They introduce nuisance constructs that are not intended to be measured and thus threaten measurement validity. One source of aberrant behavior is unfamiliarity with tests (Meijer & Sijtsma, 2001; Rupp, 2013). Examinees who take a new and unfamiliar test are likely to struggle to understand the test structure, gauge how much time they have for each item, navigate through a computer-based test, and handle their nerves. In contrast, examinees who are familiar with the test structure are likely to be less stressed, know how to prepare, and be able to complete the test efficiently. Compared to first-time takers’ results, scores for examinees who are familiar with the test structure may be less affected by the nuisance construct of test unfamiliarity and be more representative of their underlying ability. To the authors’ knowledge, this speculation has not been investigated and reported in the literature. Therefore, the purpose of this study is to examine the effects of test familiarity on person-fit and aberrant behavior using observed data.
Method
The instrument used in this study is a comprehensive medical achievement examination composed of eight clinical subject tests. Medical students typically take the test at the end of their clinical rotation in a given clinical subject. All clinical subject tests are structured identically:
- They are administered through the same platform.
- Item stems are worded similarly as they all target commonly encountered patient scenarios.
- All items in all tests are multiple-choice items with only one best answer.
Many examinees take all eight clinical subjects, but they do not take them in the same order. They can also choose to retake any clinical subject test. Therefore, the context of the instrument used in this study can be considered a quasi-experimental setting for assessing the effects of test familiarity on person-fit and aberrant behavior, where test familiarity can be defined by the number of clinical subject tests (including retakes) a candidate has taken.
Response data in all clinical subjects from July 2017 to June 2018 were used. Exploratory factor analysis with no rotation was conducted for each subject separately in order to identify high-quality items. Items were removed from the data if the factor loadings on the first dimension were less than 0.1. Then, the data were modeled using the Rasch model. For each subject, test forms were equated through concurrent calibration. Ability was estimated with maximum likelihood, which was standardized as N(0,1) and bound between [-5, 5] so that the values could be compared across subjects.
Aberrant behavior was assessed using the lz* person-fit statistic (Snijders, 2001). The lz* is asymptotically distributed as N(0,1), where positive values represent good person-fit, and negative values represent poor fit. If examinees respond to the items in a reasonable manner (e.g., not aberrant because of the familiarity of tests), lz* should be a high value, which shows that their responses fit well to the model. The lz* is uncorrelated with ability when aberrant behavior is not present. One of the typical cutoffs for determining poor person-fit is -1.645, which is equivalent to the one-tailed .05 alpha level.
The degree of person-fit (i.e., lz*) was regressed on the sequence of tests using two separate two-level random intercept models. As examinees took multiple tests, the tests were modeled as nested within examinees. Model 1 included three exam-level predictors: 1) examinee age in years at the time of the exam, 2) standardized test score, and 3) whether the subject being taken is a retake. The only predictor at the examinee-level was the number of times the person had ever retaken any clinical subject test (0, 1, 2, and >2). The model could be written as:
Model 1: lz* ~ age + test.score + subject.retake + total.retake
Model 2 included all the predictors in Model 1 in addition to the test sequence as a categorical variable from 1 to 11 (i.e., the order in which the examinees took the test, such as first test, second test, etc.).
Model 2: lz* ~ age + test.score + subject.retake + total.retake + test.sequence
The test sequence for some students did not start with “first” if they had taken the tests prior to July 2017. The test sequence can extend longer for students who retake some clinical subject tests.
Residual plots were used to confirm that the residuals were approximately normally distributed with the same mean and standard deviation at every fitted value. Because Model 1 was nested within Model 2, they were compared using a likelihood-ratio test.
Result
For the purpose of this specific study, 1,422 out of 5,594 items were removed from analysis, many of which were pretest items. All subjects achieved unidimensionality after the removal of such items. In addition, response data from 55 tests were removed because of an abnormally high test sequence due to retakes (12 or more). The final sample size across all test subjects was 4,172 items on 42,903 test administrations given to 10,135 examinees (see Table 1). Each test contained an average of 96.7 items (SD = 9.3). A majority of examinees had no history of retaking any clinical subject test (68.4%). Only 6.7% of the tests were retakes.
Table 1. Number of Exams by Sequence and Clinical Subject
Test Sequence |
Clinical Subject |
A |
B |
C |
D |
E |
F |
G |
H |
Total |
1 |
71 |
672 |
568 |
394 |
2656 |
585 |
468 |
507 |
5,921 |
2 |
138 |
1,261 |
881 |
809 |
476 |
730 |
646 |
678 |
5,619 |
3 |
172 |
705 |
744 |
824 |
577 |
769 |
767 |
884 |
5,442 |
4 |
192 |
700 |
760 |
825 |
473 |
744 |
753 |
738 |
5,185 |
5 |
198 |
697 |
660 |
721 |
642 |
803 |
681 |
774 |
5,176 |
6 |
231 |
598 |
737 |
705 |
924 |
698 |
676 |
667 |
5,236 |
7 |
527 |
590 |
683 |
617 |
574 |
615 |
629 |
689 |
4,924 |
8 |
1181 |
352 |
334 |
288 |
575 |
262 |
287 |
353 |
3,632 |
9 |
207 |
124 |
175 |
90 |
109 |
99 |
90 |
125 |
1,019 |
10 |
228 |
51 |
64 |
26 |
75 |
37 |
41 |
35 |
557 |
11 |
75 |
4 |
7 |
8 |
80 |
8 |
6 |
4 |
192 |
Total |
3,220 |
5,754 |
5,613 |
5,307 |
7,161 |
5,350 |
5,044 |
5,454 |
42,903 |
Note. Many examinees take all eight clinical subjects, but they do not take them in the same order. Although there are only eight clinical subjects, the test sequence can extend beyond eight because of retakes.
Mean lz* was 0.04 (SD = 1.09), while mean standardized test scores was 0.02 (SD = 1.14). Mean SE of the standardized test scores was 0.51 (SD = 0.07). Mean standardized test scores for those who had a history of retaking any clinical subjects test was lower (M = -0.36, SD = 1.13) than those who did not (M = 0.25, SD = 1.08). The percent of the test records exhibiting poor person-fit (i.e., lz* < 1.645) was 6.7%. Standardized test scores were positively correlated with lz* (r = .23).
A likelihood-ratio test showed that the addition of the test sequence predictor significantly improved the model fit, χ2(10)=75.05, p<.001. Controlling for examinee age, total historical test retake count, whether the subject being taken is a retake, and standardized test score, the student person-fit was the poorest for the first test compared to all later tests (p < .05). The coefficients from Model 2 are shown in Table 2. Compared to the first test, person-fit improved for the second exam by 0.07, and on the 11th test by 0.27.
Table 2. Model 2 Coefficients
|
Coef |
SE |
df |
t |
p |
(Intercept) |
0.41 |
0.05 |
32,754 |
8.65 |
<.001 |
Examinee-level predictors |
|
|
|
|
|
Retake total = 0 (Reference) |
– |
– |
– |
– |
– |
Retake total = 1 |
-0.04 |
0.02 |
10,131 |
-2.62 |
.009 |
Retake total = 2 |
-0.11 |
0.02 |
10,131 |
-4.61 |
<.001 |
Retake total > 2 |
-0.19 |
0.03 |
10,131 |
-6.73 |
<.001 |
Test-level predictors |
|
|
|
|
|
Standardized score |
0.19 |
0.01 |
32,754 |
38.17 |
<.001 |
Examinee age in years |
-0.02 |
0.00 |
32,754 |
-9.30 |
<.001 |
Retaking the clinical subject |
0.10 |
0.02 |
32,754 |
4.44 |
<.001 |
Test sequence = 1 (Reference) |
– |
– |
– |
– |
– |
Test sequence = 2 |
0.07 |
0.02 |
32,754 |
3.63 |
<.001 |
Test sequence = 3 |
0.08 |
0.02 |
32,754 |
4.27 |
<.001 |
Test sequence = 4 |
0.10 |
0.02 |
32,754 |
5.03 |
<.001 |
Test sequence = 5 |
0.13 |
0.02 |
32,754 |
6.62 |
<.001 |
Test sequence = 6 |
0.13 |
0.02 |
32,754 |
6.57 |
<.001 |
Test sequence = 7 |
0.10 |
0.02 |
32,754 |
4.66 |
<.001 |
Test sequence = 8 |
0.13 |
0.02 |
32,754 |
5.86 |
<.001 |
Test sequence = 9 |
0.10 |
0.04 |
32,754 |
2.79 |
.005 |
Test sequence = 10 |
0.20 |
0.05 |
32,754 |
4.12 |
<.001 |
Test sequence = 11 |
0.27 |
0.08 |
32,754 |
3.27 |
.001 |
Note. Person-fit was modeled using a two-level random-intercept model.
Model 2 also showed that those who had a history of retaking any clinical subject test tended to have lower person-fit than those who did not (p <.05). However, retaking the same clinical subject test was associated with an increase in person-fit by 0.10 (p <.001).
Discussion
This study shows that person-fit to the Rasch model improves as examinees gain experience in taking a series of tests with a similar structure. Improvements in person-fit were observed beyond the first and second tests. Test familiarity increased lz* by 0.1 or more. For reference, an increase in lz* from 0 to 0.1 is equivalent to an increase in person-fit by 3.98 percentiles. The findings indicate that the reliability of the test scores may improve with test-taking experience, and they show the importance of examinee familiarity with the test structure. The improvement in person-fit by increased test familiarity supports the provision of practice materials in order to minimize the negative impacts from test unfamiliarity and to promote measurement validity.
When interpreting the data, retakes of the same clinical subject exams needed to be considered. The option to retake any test allowed the test sequence to go beyond the number of available clinical subjects (i.e., eight). Clearly, a person who has taken the same test multiple times (despite taking a different form every time) should be more familiar with the test than the first-time takers. The examinees who have retaken any of the clinical subject exams tend to be lower achievers and have lower person-fit compared with non-retakers. However, their person-fit improved upon retaking the same clinical subject test. Also, results suggest poor person-fit occurred due to spuriously low aberrant behavior (i.e., poor performance) such as running out of time, more often than spuriously high-scoring behavior such as item pre-knowledge. This led many of the poor performers to retake the test. However, regardless of the test-retaking behavior, familiarity of the test structure led to increases in person-fit.
The study is limited in that we did not directly investigate whether improvement in person-fit is in fact associated with an increase in the accuracy of the standardized test scores. This is rather difficult to show empirically, but it should be pursued in the future. Further, a quasi-experimental design was used, where some factors were uncontrolled, including allowing examinees to retake any test at their own will. These test-retaking patterns were not random as they were correlated with important variables such as the standardized test scores. The study should also be replicated using other psychometric models and test data.
References
Meijer, R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135.
Rupp, A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55, 3-38.
Snijders, T. (2001). Asymptotic null distribution of person-fit statistics with estimated person parameter. Psycho
New York Colleges of Osteopathic Medicine Educational Consortium (NYCOMEC)
This presentation to osteopathic medicine residency directors focused on preparing for their Clinical Learning Environment Review (CLER), an ACGME program instituted as part of the next accreditation system. The goal of CLER is to ensure residency programs train residents to ensure patient safety. The presentation focused on what is required to ensure patient safety, i.e. “learner safety.” It presented how to debrief residents using “good judgment” (a focus on performance gaps) and “empathic inquiry” (debriefing that develops self-reflection and self-correction). The talk provided examples of effective and ineffective feedback and debriefing approaches.
Parshall C, Julian E, Parikh S, Horber DT. Using Nudges for More Effective Exam Programs. Paper presented at the 2019 Innovations in Testing Conference, Orlando FL, March 2019.
Short description:
Nudges are small, deliberate tactics we can use to help our test-takers (and our SMEs) do the things they want to do. While our testing programs have many points that can derail candidates, through small and subtle changes we can help them persist through the life cycle of application, testing (and retesting), and ongoing certification. For example, framing tactics in messaging can effectively decrease the number of test-takers who fail to show. Nudges can also be used with SMEs to increase JTA survey response rates and committee volunteer numbers. Join us for a panel discussion with researchers and practitioners using nudges in testing.
Full description:
Behavioral nudges have been used forever to help people remember to do things, or follow through on things they started. New research has identified the strategies that are most effective, as well as the research tools for increasing their success in a specific environment. As a result, the use of nudges is moving from ad hoc to intentional and systematic. Educators, corporate offices, and governmental institutions are formally incorporating nudges into their interactions with the public and their staff, and testing programs can use them to support examinees, subject-matter experts, staff, and employers in doing what they already want to do.
The underlying goal is to influence, or “nudge,” people in positive ways that are in their own best interest, as defined by themselves. This presentation will discuss ways that a variety of testing programs are already using nudges and will share the evidence of their effectiveness.
This session will have a panel that includes researchers and practitioners effectively using nudge tactics in the field of testing. They will share real-world successful (and unsuccessful) examples of nudges in testing.
Presentations will include:
- an overview of nudges: what they are, the evidence for their effectiveness, and a simple research plan for implementing nudges effectively.
- a discussion of common areas in testing programs where people have agreed to do things, but often need help carrying them out: e.g., examinees would benefit from nudges to meet registration deadlines, study, stay honest, show up for the test on time with appropriate accouterments; SME’s would benefit from nudges to volunteer, write items, review items.
- a presentation on before-and-after data on how timely phone calls decreased candidates “no-show” for a medical licensure performance exam; additional nudge interventions from the program’s in-development continuous assessment will be included.
- a case study of nudging applied in a high school equivalency program, with specific behavioral techniques and overall results.
References:
Ariely, D. (2008). Predictably Irrational: The Hidden Forces That Shape Our Decisions. New York: HarperCollins.
Kahneman, D. (2011). Thinking, Fast and Slow. New York: Farrar, Straus, and Giroux.
Thaler, R.H., & Sunstein, C.R. (2008). Nudge: Improving Decisions About Health, Wealth, and Happiness. New Haven, CT: Yale University Press.
Authorship
Kimberly M. Hudson, PhD, National Board of Osteopathic Medical Examiners
Yue Yin, PhD, University of Illinois at Chicago
Tsung-Hsun Tsai, PhD, National Board of Osteopathic Medical Examiners
Grant Number/ Funding Information
Not applicable
Corresponding Author
Kimberly Hudson, 8765 West Higgins Road, Suite 200, Chicago, Illinois 60631; 773-714-0622; Kimberly.shay86@gmail.com
Key Words
Equating, Automated Test Assembly, Optimal Test Assembly, IRT, Rasch Model, CINEG
Abstract
As early as the 1960s, testing organizations began implementing Automated Test Assembly (ATA) to simplify the laborious process of manually assembling test forms and to enhance the psychometric properties of the examinations (Wightman, 1998; van der Linden, 2005). But it is unclear what impact transitioning to ATA has on equating outcomes. The purpose of this research study was to evaluate outcomes from different IRT scale linking and equating methods when a testing organization transitioned from manual test assembly to ATA.
After crossing each scale linking procedure with each equating method, I calculated error and bias indices (e.g., RMSD, MAD, MSD) and evaluated the decision consistency of the equating outcomes.
The results showed that the mean/mean scale linking procedure paired with the IRT preequating method produced the lowest bias and error, and highest level of decision consistency.
The results of this study support the importance of aligning psychometric and test development procedures. The findings of this study suggest that the equating outcomes were related to the similarity in statistical test specifications. ATA resulted in more parallel test forms with better psychometric properties than forms assembled manually. Therefore the modifications to assembly practices warrant the reconsideration of a new base form for scaling and standard setting.
Introduction
In high-stakes medical licensure testing programs, test developers and psychometricians work together to develop multiple test forms that can be administered simultaneously to examinees to enhance examination security. Although the volume of forms may differ between testing programs, it is crucial that all test forms are built according to the same test specifications (von Davier, 2010). Furthermore, scores on the test forms must be interchangeable and candidates should perceive no difference between the test forms administered (Kolen & Brennan, 2014). The test development processes and psychometric procedures are inherently connected and both must be considered when developing multiple test forms.
Traditionally, test developers have manually assembled multiple test forms according to a set of content requirements. Test developers typically evaluate statistical criteria such as mean proportion of correct responses (p-value) or mean point-biserial correlation upon completion and make adjustments to confirm that statistical specifications are met. Manual test assembly (MTA) is a time-intensive process, typically requiring the attention and work of multiple test developers. However with the widespread use of computers, testing organizations can improve the laborious manual process by developing and employing computer programs to automatically assemble tests. If staff members possess technical computer programming skills, they might create computer programs that can assemble multiple test forms simultaneously by balancing the content and statistical constraints.
When assembling tests manually, test developers use a variety of informational inputs, or constraints, to create multiple forms of an assessment that are balanced in terms of content, difficulty of items, item formats, contextual information of items (e.g., the patient’s life stage), item duration, word count, and exposure rate. Test developers first compile an item pool, which contains a selection of items that meet some basic requirements for inclusion on a test. Scorable items function as operational or anchor items and often have known item parameters based on a prior administration. Test developers iteratively select a group of items that meet the minimum proportions of each domain as specified by the test blueprint and evaluate the range of item statistics or average item statistics, such as p-values and point-biserial correlations. The number of parallel test forms and the number of constraints undoubtedly impacts the complexity of manually assembling forms. Moreover, many testing organizations implement this resource-intensive process across numerous testing programs on an annual or semi-annual basis.
Automated Test Assembly (ATA) is an efficient alternative to this laborious process with unique challenges (Wightman, 1998). Unlike MTA, ATA programs utilize the test information, the summation of item information across the ability continuum, in the creation of multiple parallel test forms. Thus, ATA improves the manual procedure by not only saving time and resources, but also enhancing the psychometric quality of balanced forms according to a predetermined set of constraints and maximization of the specified objective function. ATA may improve reliability across examination forms due to the standardization of the test development process. Therefore the impact of ATA is not just a question of “Can the computer do it,” but rather “Can the computer do it better?”
In medical licensure examinations there is a critical need for score comparability across test forms to not only ensure that scores are an accurate, reliable representation of examinee ability, but also to make pass/fail distinctions based on the scores. Earning a passing score on a medical licensure examination allows examinees to enter into supervised medical practice. Therefore psychometricians work to maintain decision consistency, regardless of the test assembly method and the form administered to examinees. Decision consistency refers to the agreement of an examinee’s pass/fail decisions on two (or more) independent administrations of unique forms and decision accuracy refers to the agreement between an examinee’s pass/fail decision and whether the same decisions made based on an examinees’ true ability (Livingston & Lewis, 1995). These two indices are necessary to evaluate in high-stakes medical licensure testing. In this research, I compare the decision consistency of equated results after implementing ATA.
The results of this research provide a psychometric framework to evaluate results from different equating methods upon the implementation of ATA. When testing organizations implement new test development processes, it is critical to examine the impact on examinee scores (AERA, APA, & NCME, 2014). Testing organizations monitor and evaluate scores and decision consistency of scores on examinations that ultimately license examinees to practice medicine in supervised or unsupervised settings. Neglecting to examine this may inadvertently lead to passing unqualified physicians, or failing qualified physicians.
In ATA, psychometricians and test developers often define linear and/or non-linear constraints in order to maximize a specific objective function, typically the test information function (TIF), at a given score point on the true-ability continuum (van der Linden, 2005). In a high-stakes licensure examination, the minimum passing standard (or cut-score) is commonly used for optimization because it maximizes test information near the cut-score and minimizes the standard error of measurement (SEM) at the cut-score. This leads to increased reliability of scores closest to the cut-score and better accuracy of pass/fail distinctions. Therefore, ATA is designed to enhance the psychometric qualities based on prior item information (i.e., higher reliability coefficients, and lower standard error of measurement near the cut-score), and the efficiency of assembling test forms. However, research has not yet addressed the impact of transitioning from MTA to ATA on results from equating methods. In this study, I investigate the differences in equated results between MTA and ATA forms.
Most ATA processes use a framework of Item Response Theory (IRT) to construct forms with computer programs integrating item-level information according to a set of predetermined constraints. The use of IRT typically goes hand-in-hand with the psychometric framework utilized by the testing program. In IRT, items have a set of unique characteristics; some items are more informative than others at different ability levels. Psychometricians investigate the individual contribution of an item to a test by reviewing the item information function (IIF). The TIF is the summation of IIFs across the ability continuum. The TIF in the ATA represents the characteristics and composition of all items on each test form. Moreover, in the context of medical licensure examinations, using the minimum passing standard as the value for optimization ensures that scores are precise ability estimates for minimally qualified examinees. Thus, when the TIF is optimized at the cut-score, it ultimately reduces the probability of Type I error (unqualified examinees passing the examination). Furthermore, Hambleton, Swaminathan, and Rogers (1991) suggest that the test characteristic curve (TCC) creates the foundation for establishing the equality of multiple test forms, which is certainly the case when optimizing the TIF. The TIF provides aggregate information from each item on the examination, whereas the TCC shows the probability of an expected raw score with a given ability level, . If we wish to create parallel test forms, then the TCC provides evidence that a given ability level relates to similar expected scores for two parallel forms of the same test. Furthermore, the use of content and statistical constraints in ATA computer programs provides evidence that all test forms are balanced in terms of statistical specifications.
Once parallel forms are assembled, reviewed, published and administered, the results must be analyzed and equated. Equating refers to the use of statistical methods to ensure that scores attained from different test forms can be used interchangeably. Equating can be conducted through a variety of designs, approaches and methods (Kolen & Brennan, 2014). Although there are key differences between IRT and the Rasch model, this research will focus on the applicability of IRT equating methods to a testing program that utilizes the Rasch model as a psychometric framework. Within IRT equating methods, both preequating and postequating methods are widely implemented in K-12 educational settings to ensure scores can be used interchangeably (Tong, Wu, & Xu, 2008). Psychometricians may use IRT to preequate results prior to the start of examination administration, which assuages the tight turnaround time between examination administration and score release. Alternatively, postequating methods use response data from complete current examination administrations (Kolen & Brennan, 2014).
In IRT preequating methods, item parameters are linked from prior calibration(s) to the base form of an examination. For the purpose of this research, item difficulties will be the only item parameter used, which is in alignment with the testing program’s psychometric framework (the Rasch model). The base form (denoted as Form Y) is the form in which the cut-score was established. In order to implement preequating methods, item difficulties for scorable items must be estimated prior to examination administration. Prior to ATA, scorable item difficulties must be known to calculate and maximize the TIF. The alignment of previously calibrated item statistics that are used both for assembling forms using ATA and for preequating may support the applicability of this equating method.
Measurement Models and ATA
IRT allows test developers to “design tests to different sets of specifications and delegate their actual assembly to computer algorithms,” (van der Linden, 2005, p.11). By setting constraints for computerized test assembly, including blueprint domain representation or reasonable ranges for item statistics, test developers can create multiple forms of examination that are parallel in difficulty. ATA can incorporate item details regardless of the psychometric paradigm used to calibrate or score examinees and can be applied to polytomously or dichotomously scored examination. As discussed previously, this study uses data previously calibrated using the Rasch model.
While CTT, IRT, and Rasch approaches to ATA can utilize population dependent item statistics (i.e., p-values and discrimination indices) as constraints, in CTT there is no equivalent metric to the TIF. In order to construct parallel test forms in ATA, Armstrong, Jones and Wang (1994) maximized score reliability through a network-flow model. The authors stated that it was advantageous to use the CTT approach because it was computationally less expensive and produced comparable results in relation to the IRT approach to ATA. When this research was published, computational power was indeed a challenge; however, advances in computer memory and technology are much greater now, so the cited advantage does not hold the test of time. As such, IRT or Rasch approaches to ATA are more supported in the literature and are the focus of this study.
Prior to beginning ATA, test assemblers must calibrate response data to estimate item parameters from the sample population. Psychometricians often examine the goodness-of-fit of the data to determine the best IRT model (i.e., 1-PL, 2-PL) or confirm that the data fit the Rasch model. Once the examination is administered, psychometricians anchor item parameters based on prior calibrations to estimate examinee ability (van der Linden, 2005). In this study, I calibrate data using the Rasch model and will provide some evidence supporting the appropriateness of the model.
In ATA, test assemblers often optimize the TIF and evaluate the similarity of the forms by comparing the TIFs. However, even well-matched TIFs do not necessarily yield equitable score distributions (van der Linden, 2005). Thus, psychometricians must also continuously evaluate and monitor the score distributions once the examination forms are administered. The main question of this study is which IRT equating method (IRT observed score, IRT true score, or IRT preequating) yields the most comparable scores and decision consistencies when transitioning from MTA to ATA. In the following section, I provide a foundation of linking, equating, and scale linking as it pertains to this study.
Equating
Equating is the special case of linking in which psychometricians transform sets of scores from different assessment forms onto the same scale. By definition, equating methods are only applied to assessment forms that have the same psychometric and statistical properties and test specifications. The primary goal of equating is to allow scores to be used interchangeably, regardless of the form that an examinee was administered (Holland & Dorans, 2006; Kolen & Brennan, 2014).
Assessment programs can employ a variety of equating designs and methods, each design with unique characteristics and assumptions. Assessment programs often administer examinations within and across years. For the purpose of this section, I notate an original form of an examination as Y and a new form of an examination as X, with the understanding that assessment programs may administer multiple new forms ( ) or multiple original forms ( ). CINEG design are commonly used and require previously administered items from original forms of an examination to be included on new forms by a set of common or anchor items. The CINEG design is considered a more secure design than the random groups design because only a set of common items are exposed from an original form, rather than exposing an entire original form.
The CINEG design not only accounts for the difference in form difficulty, but also accounts for the difference in the population of test-takers. The statistical role of the common items is to control for differences in the populations, therefore removing bias from the equating function. In order to implement the common items design, the common items must meet several requirements (Dorans et al., 2010). First, the common items must follow the same content and statistical specifications as the entire full-length test. Second, there should be a strong positive correlation between scores on the full-length test form and scores on the common items because the common items follow the same specifications as the full-length test. Thirdly, measurement and administration conditions for the common items must be similar across new and original forms. Lastly, prior research recommends the use of common item sets include at least 20% of the full-length test, or consist of at least 30 items (Angoff, 1971; Kolen & Brennan, 2014). Satisfying these requirements ultimately ensures that the reported scores and the decisions based on the reported scores are accurate and reliable. The testing program used for this study meets the conditions described above.
IRT equating methods can be applied to data calibrated using the Rasch model and are the focus of this study. In this section, IRT equating methods are discussed in detail; however, first psychometricians must use scale linking procedures to examine the relationship between newly estimated item parameters and original estimations of item parameters from two independent calibrations. Due to the assumption of item invariance, if item parameters are known, no equating or scale linking is necessary and IRT preequating methods can be implemented prior to test administration (Hambleton et al., 1991). However, in practice it is important to implement scale linking procedures because there are often differences in item parameter estimates (Stocking, 1991).
Scale linking is the process by which independently calibrated item difficulties are linked onto a common scale. Several methods can be used to calculate scaling constants in order to place the item difficulties from form X on the same scale as Y (Hambleton et al., 1991). The mean/mean, mean/sigma, and TCC methods are discussed in their application to this study. Prior research supports the performance of TCC methods over other methods (i.e., mean/mean or mean/sigma) for scale linking due to the stability of the results and the precision, even when item parameters had modest standard errors (Kolen & Brennan, 2014; Li et al., 2012). Other research investigated the adequacy of different scale linking procedures within the Rasch model.
The mean/sigma method calculates scaling constants A and B based on the mean and standard deviation of the difficulty parameters of the common items on form X. There are two main TCC scale linking procedures, which are iterative processes that utilize item parameter estimates; the focus of the current student is on the Stocking and Lord (1983) procedure. The scale indeterminacy property of IRT is used in this method, such that an examinee with a given ability will have the same probability of answering an item correctly regardless of the scale used to report scores. The Stocking and Lord TCC procedure calculates the probability of correctly answering an item on the original scale ( ) and the new scale ( ) for each common item ( ) by taking the difference in examinee ability into consideration. Equation 10 represents the difference in TCCs ( ) between common items administered on form Y and form X, respectively. Then an iterative process solves for A and B by minimizing across all examinees.
(10)
(11)
Once item parameters are on the same scale, IRT equating methods are employed. IRT true score equating is the most commonly used IRT equating method. In IRT true score equating, true scores ( ) are represented as the number-correct score for examinee ( ) with given ability ( ; Kolen & Brennan, 2014). Additionally, true score equating assumes that there are no omitted responses (von Davier & Wilson, 2007). In a simplistic example, psychometricians first identify a true score on form X, then estimate the corresponding ability level is determined (see equation 12). Then, the true score on form Y ( ) is determined by using the corresponding ability level (see equation 13). Therefore, the equivalent score is the inverse of the ability distribution. This process is iterative, which typically involves the Newton-Raphson Method (Kolen & Brennan, 2014; Han et al., 1997).
and (12)
(13)
Unlike IRT true score equating methods, the IRT observed score equating method depends on the distribution of examinee abilities. The IRT observed score equating method is similar to equipercentile equating methods without the application of additional smoothing techniques, as previously discussed. It requires specifying the distributional characteristics of examinees prior to equating, using prior distributions (Kolen & Brennan, 2014).
All of the IRT equating methods previously discussed require data from the current test administration cycle. However, the IRT preequating method can be used when items are pretested prior to operational use. Once items are on the same scale, psychometricians generate raw-to-scale conversion tables prior to form administration, which ultimately decreases the workload for score release (Kolen & Brennan, 2014). Many testing organizations utilize IRT preequating in order to shorten the window for score release after examination administration. Testing organizations may also prefer IRT preequating methods due to their flexibility when equating scores for computer-based examinations that are administered intermittingly over a long testing cycle.
Researchers have compared the results among equating designs, methodologies and procedures; yet no researchers have compared scale linking or equating outcomes among ATA and MTA forms. In the current study, RMSD, MSD and MAD were calculated to examine the error and bias associated with scores. Researchers have commonly used these indices to evaluate the comparability of equating methods (Antal, Proctor, & Melican, 2014; Gao, He, & Ruan, 2012; Kolen & Harris, 1990).
Decision Consistency
In the context of testing programs that aim to categorize examinees into one or more groups based on their scores, such as medical licensure examinations, classification accuracy is a measurement of whether examinees were accurately classified based on their true ability (Lee, 2010).
Research Questions
The goal of this study is to compare the equating results when a testing organization moved from MTA to ATA. The research questions address the comparability of outcomes from three different methodological approaches to equating, after combining three IRT equating methods with three scale linking procedures.
- Which method of IRT equating (e.g., IRT observed score, IRT true score, or IRT preequating methods) minimizes error and bias associated between MTA and ATA developed forms?
- Which method of IRT equating (e.g., IRT observed score, IRT true score, or IRT preequating methods) yields the highest expected decision consistency of pass/fail distinctions between MTA and ATA developed forms?
METHOD
Data
In this study I used two years of response data from a large-scale medical licensure examination. From the 36 Y forms, I selected four forms (denoted ). There was item overlap among the 36 Y forms, which made it possible to concurrently calibrate data from the 36 Y forms simultaneously using the Rasch model.
First, I aggregated key information for each form, where pretest items were not embedded within each Y form. The pretest design used a total of 12 unique pretest blocks, each consisting of 50 items with overlap. The test administration vendor randomly assigned pretest blocks to examinees. Therefore, pretest items needed to be reviewed and selected for the form by form calibration for CINEG design. Figure 5 shows the design of an intact Y form (denoted form A) of operational items and plausible assignment of six pretest blocks. In Figure 5, Form A consists only of operational items, the test vendor randomly assigned a pretest blocks from group A (PTA) and a pretest blocks from group B (PTB). The diagram is a simplified depiction of the true design, which can ultimately yield more than 5,100 different combinations. Therefore, I employed a threshold of 30 responses to determine which pretest items had sufficient exposure for inclusion in the form by form calibration. Despite anchoring the item difficulties, at least 30 exposures ensured there was sufficient data to investigate data-model fit.
The concurrent calibration of the operational and pretest items on Y forms resulted in item difficulties on the same scale of measurement. Additionally, I selected four X forms (denoted ). I used three criteria to select the eight forms for this study: (a) X forms with the highest volume of administrations after the first several weeks following the examination launch, (b) X forms and Y forms with at least 20 percent overlap or at least 30 common items for scale linking purposes, and (c) the common item set was representative of the test blueprint (Angoff, 1971; Kolen & Brennan, 2014). The data design is shown in Figure 6. The common set of items on and is denoted as , the common set of items on and is denoted as , the common set of items on and is denoted as , and the common set of items on and is denoted as .
Approximately 7,600 examinees took one of the Y forms in year 1 and 4,300 first-time examinees took one of the X forms in the first testing window of year 2. After selecting the four forms, as previously described, I used data from the approximately 1,300 examinees who were administered or and the approximately 1,200 examinees who were administered or . Table IV displays a summary of the data selected for this research study.
Response data from year 1 on all 36 Y forms were concurrently calibrated using WINSTEPS® (Linacre, 2017). The estimated item difficulties were then used as anchors for each separate form calibration of . Y is considered the base form of the examination and therefore no equating on original forms was conducted.
Data Analyses
All data management and analyses were conducted in RStudio (2016), unless otherwise specified. The criterion of was used to examine the statistical significance of tests, unless otherwise specified.
1. Research Question 2: Equating Methods and Error
Which method of IRT equating (e.g., IRT observed score, IRT true score, or IRT preequating methods) minimizes error and bias associated between MTA and ATA developed forms?
I employed three scale linking approaches (mean/mean, mean/sigma, and Stocking-Lord TCC) and three equating methods (IRT observed score, the IRT true score, and the IRT preequating). I utilized the PIE computer programming to implement IRT observed score and IRT true score equating methods (Hanson et al., 2004b). To assess the equating results, I compared the root mean squared difference (RMSD), mean absolute difference (MAD), and mean signed difference (MSD) on X’ to Y. I then evaluated which method minimizes bias by identifying RMSD values close to 0 and evaluated which method minimizes error by identifying MSD and MAD close to 0. Higher indices indicate an accumulation of error and are not preferred. Findings from prior research show that IRT preequating methods often have higher levels of error associated with the examinee scores. However, due to the alignment of using precalibrated item difficulties for both ATA and preequating methods, I expect that the design of ATA may have an impact on the equated results.
2. Research Question 3: Equating Methods and Passing Rates
Which method of IRT equating (e.g., IRT observed score, IRT true score, or IRT preequating methods) yields the highest expected decision consistency of pass/fail distinctions between MTA and ATA developed forms?
Using the outcomes from research question 1, I estimated decision consistency indices using Huynh’s methodology (1990), which uses the probability density function, item curve functions (ICFs) and relative frequencies of a single population to estimate to common decision consistency indices: a raw agreement index, and kappa, (see equations 18 and 19). The raw agreement index, is calculated using the cumulative distribution function of test scores, and relative frequencies of test scores. Kappa is calculated as the difference between the raw agreement index, and , the expected proportion of consistent decisions if there is no relationship between test scores. Kappa indicates the decision consistency beyond what is expected by chance (Subkoviak, 1985).
(18)
Where
, (19)
, (20)
And
(21)
Where represents the ability level at a given raw score, ;
represents the difference in cumulative distribution functions of the raw cut-score, at ability level, ;
represents the relative frequency distribution at and
represents the number of classifications.
RESULTS
Research Question 2
To evaluate the adequacy of the results, I calculated the RMSD, MSD, and MAD. RMSD is a measure of bias, and MSD and MAD are measures of random error. Values closer to 0 indicate no raw score point differences between MTA and ATA forms. Overall, there were large differences in the amount of bias and error associated across forms and equating methods, therefore RMSD, MSD and MAD are presented separately for each form (see Table XV). Across all forms the equating and scale linking method with the least amount of error and bias was the mean/mean preequating method.
Table XV
BIAS AND ERROR INDICES BY EACH SCALE LINKING AND EQUATING METHOD
|
|
Observed Score |
True Score |
Preequating |
Form |
Index |
MM |
MS |
SL |
MM |
MS |
SL |
MM |
MS |
SL |
|
RMSD |
19.35 |
20.46 |
21.85 |
19.51 |
20.55 |
22.10 |
8.02 |
8.41 |
8.278 |
|
MSD |
-18.83 |
-20.27 |
-21.38 |
-18.97 |
-20.35 |
-21.62 |
-7.71 |
-8.05 |
-7.96 |
|
MAD |
18.83 |
20.27 |
21.38 |
18.97 |
20.35 |
21.62 |
7.71 |
8.05 |
7.96 |
|
RMSD |
10.37 |
5.47 |
9.40 |
10.35 |
5.70 |
9.40 |
4.74 |
5.24 |
4.91 |
|
MSD |
-10.31 |
-4.18 |
-9.35 |
-10.29 |
-4.29 |
-9.34 |
-4.31 |
-4.75 |
-4.46 |
|
MAD |
10.31 |
4.34 |
9.35 |
10.29 |
4.49 |
9.34 |
4.32 |
4.75 |
4.47 |
|
RMSD |
2.53 |
4.57 |
1.53 |
2.51 |
4.53 |
1.53 |
2.51 |
2.96 |
2.68 |
|
MSD |
-2.32 |
-4.45 |
-0.62 |
-2.27 |
-4.39 |
-0.58 |
-1.69 |
-2.08 |
-1.89 |
|
MAD |
2.38 |
4.48 |
1.29 |
2.35 |
4.41 |
1.28 |
1.99 |
2.34 |
2.15 |
|
RMSD |
12.42 |
12.46 |
15.91 |
12.92 |
13.03 |
16.84 |
2.55 |
3.01 |
2.82 |
|
MSD |
-11.92 |
-11.83 |
-15.29 |
-12.30 |
-12.26 |
-16.02 |
-1.90 |
-2.27 |
-2.20 |
|
MAD |
11.92 |
11.83 |
15.29 |
12.30 |
12.26 |
16.02 |
2.07 |
2.43 |
2.31 |
Note. MM represents mean/mean scale linking, MS represents the mean/sigma scale linking, and SL represents the Stocking and Lord TCC scale linking procedure. Due to the disparate index values across forms, results are shown for each form separately.
Boldface signifies values more favorable results with indices close to 0 per index per form (by row).
Preequating Method
Across the three equating methods paired with the three scale linking procedures, the results indicated that the mean/mean scale linking procedure with the preequating method performed the most favorably for three of the four forms ( ). For , the mean/mean preequating method resulted in lower bias and error in comparison to all other methods (RMSD = 8.02, MSD = -7.71, and MAD = 7.71), whereas the highest amount of bias was related to the Stocking and Lord TCC procedure paired with the IRT true score equating method (RMSD = 22.10, MSD = -21.62, MAD = 21.62). For , the mean/mean preequating method produced the most favorable results in comparison to all other methods (RMSD = 4.74, MSD = -4.31, and MAD = 4.32). However, the Stocking and Lord TCC scale linking procedure paired with the preequating method produced only slightly higher results than the mean/mean preequating method (within 0.5 raw score points). The small difference of 0.5 raw score points in RMSD, MSD and MAD between the scale linking procedures within the preequating method was present across all forms. For form 3, the mean/mean preequating method produced slightly higher RMSD in comparison to the Stocking and Lord true score equating method (RMSD = 2.51, RMSD = 1.53, respectively). These differences relate to a difference of about 1 raw score point. Therefore, the results from the mean/mean preequating method showed a slight improvement over the other scale linking procedures within the preequating method, although there was very little practical difference in the results across each scale linking procedure.
True Score and Observed Score Equating Methods
The results from the true score and observed score equating methods with each scale linking procedure were comparable across all forms. For , the true and observed score methods yielded very consistent results. Specifically, the mean/mean observed score method and the mean/mean true score method resulted in similar levels of error (MSD = -18.83, MSD = -8.97, respectively) and the maximum deviation between raw scores in terms of MSD of the mean/sigma true score and observed score methods was approximately 0.35 raw score points. Furthermore for , the Stocking and Lord TCC scale linking procedure paired with the observed score and true score methods produced similar high amounts of error (RMSD = 9.400, 9.399, respectively). Unique to form 3, the Stocking and Lord true score and observed score methods produced the lowest bias (RMSD =1.53) and error (MAD =1.279 and MAD = 1.292, respectively) across all other conditions. The results from combining each scale linking procedure with the true score and observed score methods were varied across forms; in some cases, the Stocking and Lord TCC procedure performed least favorably ( ), while in other cases, the mean/sigma scale linking procedure performed the least favorably ( ). Therefore the findings are inconclusive in terms of the preferred scale linking procedure for the IRT observed score and true score methods, although the evidence suggests that the Stocking and Lord TCC procedure produced higher levels of errors for two forms.
Research Question 3
Overall, the mean/mean preequating method and the Stocking and Lord TCC preequating method performed the most favorably ( ). The true score and observed score methods produced similar levels of decision consistency, indicating not much practical difference.
Figure 10. Mean decision consistency indices, (blue) and (red) across all forms. MM represents mean/mean scale linking, MS represents the mean/sigma scale linking, and SL represents the Stocking and Lord TCC scale linking procedure.
The average decision consistency indices across all ATA forms improved in comparison to baseline estimates using MTA forms. For example, the raw agreement index was 1% to 3% greater for ATA forms than MTA forms. For 3 of the 4 forms, the raw agreement index was the highest for preequating methods. Similar to the findings from research question 2, the results from the decision consistency evaluation indicated that the Stocking and Lord TCC scale linking procedure paired with the IRT true and observed score methods performed the most favorably for form 3 (see Figure 11). Results for each form and equating method are displayed in Appendix B.
Discussion
In order to examine the differences in equating outcomes that transitioning from MTA to ATA introduces, I employed three scale linking procedures and three equating methodologies. I calculated the RMSD, MSD, and MAD in order to determine which combination of scale linking procedure and equating method resulted in the least amount of bias and error. The preequating method with the mean/mean scale linking procedure produced the most favorable results for three of the four forms, even when the number of common items did not meet recommended criteria. Lastly, results from the decision consistency analyses indicated that the preequating method outperformed the true score and observed score equating methods in terms of , however the true score and observed score methods produced the most favorable decision consistency in terms of . The variation in error terms of equated scores across forms suggests that MTA and ATA forms cannot be directly compared. If testing organizations begin to implement ATA for form assembly, they should give thoughtful consideration to the use of MTA forms as the base forms for equating purposes.
Research Question 2: Equating Methods and Error
Due to the nature of implementing IRT equating methods following scale linking, the results of the equating methods are based on the quality of the results on the scale linking procedures. The common item sets were sufficiently sized for only form pair 1 and form pair 3; therefore the generalizability of the equating results of forms 2 and 4 are limited. Yet all previous known item statistics were used for preequating, not just those included in the common item set.
It is important to note that for the purpose of this research raw scores were used to evaluate the error and bias associated with scores that were equated for each method. Overall, the optimal method for three of the four forms was the mean/mean scale linking procedure paired with the preequating method. The mean/mean preequating method produced the lowest amount of bias as measured by RMSD. While other methods produced slightly lower MSD or MAD, there was not an appreciable or practical difference in these values from others (typically less than 0.2 raw score points). For the two forms with sufficient common item sets, the mean/mean preequating method and the Stocking and Lord TCC true score methods produced the most favorable results. Similar favorable findings from the mean/mean preequating method were found across the remaining forms; meaning, despite having an insufficient amount of items for scale linking purposes, the results still supported preequating methods. This may be due in part to the similarity between the Rasch equating model and the mean/mean preequating method.
There were large differences in the magnitude of results of RMSD between true score, observed score and preequating methods across the forms. Specifically, form 1 had the highest values of RMSD across all equating methods, whereas form 3 had the lowest values of RMSD across all equating methods. The differences in RMSD across the forms provide additional evidence that the new and original forms were not built to the same statistical specifications.
Results from prior research indicated that preequating methods have performed poorly in comparison to postequating methods. Kolen and Harris (1990) reported that the IRT preequating method resulted in the highest values of RMSD and MSD in comparison to IRT postequating methods. Tong and Kolen (2005) compared the adequacy of equated scores from the traditional equipercentile, IRT true score, and IRT observed score equating methods using three criteria and found that the IRT true score method performed least favorably in comparison to the IRT observed score. In this respect, the results from the current study disagreed with previous literature. Yet, the goal of the current study was to evaluate differences in equating outcomes between MTA and ATA forms. At this point in time, no research studies have compared outcomes from different equating methods when testing organizations transitioned from MTA to ATA, therefore the lack of consistent findings with prior literature may be in relation to the change in test development procedures, the differences in psychometric framework (i.e., IRT 2-PL versus Rasch model) or differences in the nature and purpose of the testing program (i.e., K-12 versus medical licensure). For example, much of the body of literature on equating utilizes K-12 assessment programs to investigate differences in equating methodologies. K-12 assessment programs are built to different test specifications as the purpose of these examinations may be to evaluate and monitor student growth rather than passing or failing examinees. Often these types of assessment programs have different characteristics than medical licensure examinations, including shorter administration windows and mode of delivery. The difference in results can also be explained by the utility of a purposeful equating design. Although the testing program used for the current study did not operationally implement a CINEG design, I employed this design by selecting data that conformed to the design (i.e., requirements were met). Two of the four forms had sufficiently sized common item sets. Yet the favorable findings for the preequating method were in agreement for three of the four forms used. This may be due to the fact that the preequating methodologies relied on a quality bank of linked items rather on a small common item set.
The RMSD, MAD, and MSD are commonly used measures to gain an overall understanding of the differences between equated scores and those on the base form. Yet the standard error of equating is another commonly used approach to evaluate the adequacy of equating results and can be used to gain a better understanding of the error associated across the distribution of equated scores. The standard error of equating replicates hypothetical samples to approximate the standard deviation of each equated score (Kolen & Brennan, 2014). Future research can expand on this study by calculating the standard error of equating for the IRT true and observed score equating methods.
The results of this research question suggest that prior to implementing ATA, the equating design should be thoroughly considered in light of the purpose of the assessment. In agreement with the best principles of test assembly, any time new test development procedures are implemented, results processing should be carefully considered (AERA, APA, & NCME, 2014). The implementation and variation in test assembly procedures necessitates the need for reviewing and evaluating current psychometric procedures (e.g., standard setting, equating designs, etc.). A key recommendation for practitioners is to discontinue the use of MTA forms as the base form when an organization is newly implementing ATA procedures as the findings from this research suggest that there is more variation in the statistical specifications of MTA forms to support continual use as the base form.
Research Question 3: Equating Methods and Decision Consistency
Although there are many ways to evaluate decision consistency, decision consistency was measured using two estimates, the proportion of raw agreement ( ), and the kappa index ( ) which corrects the raw agreement index by what is expected by chance. Both the mean/mean preequating methods and Stocking and Lord TCC preequating methods produced the highest raw agreement indices across forms ( ). This finding supports the findings from research question 2, in which the lowest error and bias were found in the same methods. The raw agreement indices for other equating methods differed slightly within each form, typically within 1%. In comparison, the estimates were inconsistent across forms (see Appendix B). Moreover, the decision consistency index of ATA forms was 1% to 3% greater than that of the MTA forms. This finding provides additional evidence that ATA enhances the psychometric properties of examinations that make pass/fail distinctions.
Although decision consistency is an important aspect for psychometricians to explore, it does not fully explain outcomes of examinations with pass/fail decisions. Specifically without simulation studies, where true ability is known, one cannot know for certain that decisions are accurate. Although there are ways to explore and provide evidence of decision accuracy, it was beyond the scope of the current study. Future research is warranted on different approaches like Lee (2010) to evaluate both decision consistency and decision accuracy when testing programs newly implement ATA.
Limitations
The testing program did not implement a CINEG design operationally; however, the data easily lent itself to the implementation of a CINEG design based on the use of anchor blocks in ATA. Although I confirmed key equating requirements and controlled for others by carefully selecting the forms used in study, the common item sets were not a perfect representation of the content or statistical specifications to that of the entire test. Moreover, the pretest item design had also changed between MTA and ATA forms, which may have influenced the findings. Specifically, pretest item blocks were assigned randomly to examinees in Y forms, whereas pretest items blocks were embedded within each X form. Although understandable when using operational data, there are limitations in the findings. Lastly, the 0.3 logit criteria used to establish the common item set is operationally employed although there are alternative methods one can use to identify outlying items in the common item set. Therefore, future research is warranted to address the replicability of this study when considering the purpose of the examination and equating designs in conjunction with test design decisions (i.e. ATA).
In addition, the assessment used in this study has unique and complex characteristics (e.g., test specifications, blueprint, and constraints) that may limit the generalizability of the results. For example, the forms assembled using ATA programs involved approximately 70 content and statistical constraints (i.e., domain representation, life stage of patient, clinical setting, mean item p-value, etc.), and maximized the TIF to create parallel forms. Although MTA and ATA forms were built according to the same domain representation, in MTA other variables were not as controlled as they were in ATA. Furthermore, ATA employs over 70 standardized constraints, whereas test developers loosened the constraints one by one, form by form during MTA. It is expected that the similarity of constraints between ATA and MTA procedures influenced the findings of this study. The results of this study shed light on the enhancement in quality of ATA forms; however this improvement necessitates the reevaluation or reconsideration of continuing to use MTA forms as the base examination. Future research may address the similarities or differences in MTA and ATA procedures by simulating ATA conditions and assessing the outcomes from different equating methods. Future research would provide insight as to how the similarity (or differences) between assembly procedures may influence different outcomes from equating methodologies.
Prior researchers have developed a variety of models that can be used in order to implement optimal test design. A brief overview of the different models is discussed in van der Linden (1998). It is well-documented that forms developed via ATA produce more favorable psychometric properties than MTA due to the overall test design and the defining attributes of ATA (Luecht, 1998; van der Linden, 2005). This research study provides some evidence that ATA creates more parallel test forms, not only in terms of content and statistical specifications, but also with respect to test information, data-model fit, and decision consistency. Yet, very few studies have provided empirical evidence of the quality improvement of ATA over MTA. Due to the growing popularity of ATA, more research is warranted on the replicability of this study (e.g., simulation studies), on other psychometric advantages that result from implementing ATA, and on the application to assessment programs that have different purposes and test designs.
Summary
The widespread implementation of ATA procedures has alleviated the workload of test developers by allowing computer programs to create multiple parallel test forms with relative ease. ATA procedures provide an efficient and cost-effective alternative to assembling parallel test forms simultaneously. The integral psychometric goal of ATA is the minimization of the SEM and maximization of the test score reliability (van der Linden, 2005). However, ATA is not only a question of computer programs easing the workload, but rather if computer programs improve the psychometric quality of assembled test forms. The results presented in this research study provide empirical evidence of the improvement in psychometric qualities of ATA forms. Whenever testing organizations newly implement test development practices, it is important to evaluate the outcomes (AERA, APA, & NCME, 2014). Assessing the adequacy of score outcomes of various equating methods is one way to investigate the relationship between psychometric quality and new implementation of ATA programs. In this research study, I evaluated the adequacy of different equating methods by estimating the bias, error, and decision consistency associated with score outcomes of newly developed ATA forms.
The context of evaluating equating methodologies with respect to test assembly procedures is important in today’s operational psychometric work as many testing organizations move towards ATA. Although testing organizations may utilize item parameters estimated using the Rasch model or IRT models for ATA, no previous research has connected differences in test assembly procedures to outcomes of equating methods. The results of this study further support the importance of planning and aligning the psychometric procedures to the test development procedures. The findings of this study suggest that the error and consistency of scores were related to the similarity in statistical test specifications. ATA led to the development of parallel forms that had better psychometric properties and less variation in content and statistical specifications than test forms assembled manually.
The results indicated that despite the differences in statistical specifications, the mean/mean preequating method performed the most favorably. This finding may be explained by the alignment among the mean/mean preequating method, psychometric framework of the Rasch model, and that ATA utilizes the same item difficulties to build each form. Conceptually, the mean/mean preequating method is similar to the Rasch equating method, which anchors known item difficulties. Therefore, the mean/mean preequating method is aligned with the Rasch anchored equating method. Furthermore, because ATA utilizes the same known item difficulties to build forms that have similar TIFs that peak at the cut-score of the examination, all of these methods are complimentary and work in tandem. Future research should expand on these findings by investigating the outcomes from similar equating methodologies when ATA forms are used as the base forms.
Cited Literature
Ali, U. S., & van Rijn, P. W. (2016). An evaluation of different statistical targets for assembling
parallel forms in item response theory. Applied Psychological Measurement, 40(3),
163-179.
Antal, J., Proctor, T.P., & Melican, G.J. (2014). The effect of anchor test construction on scale
drift. Applied Measurement in Education, 27(3), 159-172, doi: 10.1080/08957347.2014.905785
American Educational Research Association, American Psychological Association, National
Council on Measurement in Education, & Joint Committee on Standards for Educational and Psychological Testing (AERA, APA, & NCME). (2014). Standards for educational and psychological testing (pp. 75-94). Washington, DC: American Educational Research Association.
Armstrong, R. D., Jones, D. H., & Wang, Z. (1994). Automated parallel test construction using
classical test theory. Journal of Educational Statistics, 19(1), 73-90. doi:10.2307/1165178
Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible
paradigms? In Smith, E. V., & Smith, R. M. (Ed). Introduction to Rasch Measurement: Theory, models and applications. Maple Grove, MN: JAM Press.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.),
Educational measurement (pp. 508–600). Washington, DC: American Council on Education.
Babcock, B., & Albano, A. D. (2012). Rasch scale stability in the presence of item parameter and
trait drift. Applied Psychological Measurement, 36(7), 565-580. doi:10.1177/0146621612455090
Bock, D. R. (1997). A brief history of item response theory. Educational Measurement: Issues
and Practice, 16(4), 21-33.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York:
Holt, Rinehart, and Winston.
Debeer, D., Ali, U. S., & van Rijn. P. W. (2017). Evaluating the statistical targets for assembling
parallel mixed-format test forms. Journal of Educational Measurement, 54(2), 218-242.
Dorans, N., Moses, T. & Eignor D. (2010) Principles and Practices of Test Score Equating. ETS
RR-10-29. ETS Research Report Series.
Eignor, D. R. & Stocking, M. L. (1986). An investigation of possible causes for the inadequacy
of IRT preequating (Report No 86-14). Princeton, NJ: ETS Research Report Series. Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/j.2330-8516.1986.tb00169.x/epdf
Embretson, S. E., & Reise, S. P. (2009). Item response theory for psychologists. New York, NY:
Psychology Press.
Gao, R., He, W., & Ruan, C. (2012). Does preequating work? An investigation into a preequated
testlet-based college placement examination using post administration data (Report No 12-12). Princeton, NJ: ETS Research Report Series.
Hambleton, R., Swaminathan, H., & Rogers, H. (1991). Fundamentals of Item Response Theory.
Newbury Park, CA: Sage.
Hambleton, R., & Slater, S. (1997). Item response theory models and testing practices: Current international status and future directions. European Journal of Psychological Assessment, 13(1), 21-28. doi: 10.1027/1015-5759.13.1.21
Hanson, B.A., Zheng, L., & Cui, Z. (2004a). PIE: A computer program for IRT equating
[computer program]. Iowa City, IA: education.uiowa.edu/centers/casma
Hanson, B.A., Zheng, L., & Cui, Z. (2004b). ST: A computer program for IRT scale linking
[computer program]. Iowa City, IA: education.uiowa.edu/centers/casma
Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true- and observed-score
equatings and traditional equipercentile equating. Applied Measurement in Education, 10(2), 105-121. doi:10.1207/s15324818ame1002_1
Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied
Psychological Measurement, 9(2), 139-164.
Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In Brennan, R. L. (Ed.),
Educational measurement (pp. 187-220). Westport, CT: Praeger.
Huynh, H. (1990). Computation and statistical inference for decision consistency indexes based
on the Rasch Model. Journal of Educational Statistics, 15(4), 353-368. doi:10.2307/1165093
Huynh, H., & Rawls, A. (2011). A comparison between robust z and 0.3-logit difference
procedures in assessing stability of linking items for the rasch model. Journal of Applied Measurement, 12(2), 96.
Karabatsos, G. (2017). Elements of psychometric theory lecture notes. Personal Collection of G.
Karabatsos, University of Illinois at Chicago, Chicago, Illinois.
Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating
tests. Journal of Educational Measurement, 18(1), 1-11.
Kolen, M. J., & Brennan, R. L. (2014). Test equating scaling and linking (3rd ed). New York,
NY: Springer.
Kolen, M. J., & Harris, D. J. (1990). Comparison of item preequating and random groups
equating using IRT and equipercentile methods. Journal of Educational Measurement, 27(1), pp. 27-30.
Lee, W. (2010). Classification consistency and accuracy for complex assessments using item
response theory. Journal of Educational Measurement, 47(1), pp. 1-17.
Li, D., Jiang, Y., & von Davier, A. A. (2012). The accuracy and consistency of a series of IRT
true score equatings. Journal of Educational Measurement, 49(2), 167-189. doi: 10.1111/j.1745-3984.2012.00167.x
Lin, C.-J. (2008). Comparisons between Classical Test Theory and Item Response Theory in
Automated Assembly of Parallel Test Forms. Journal of Technology, Learning, and Assessment, 6(8).
Linacre, J. M. (2017). Winsteps® Rasch measurement [computer program]. Beaverton, OR:
Winsteps.com.
Linacre, J. M. (2017). ). Fit diagnosis: infit outfit mean-square standardized. Retrieved from
http://www.winsteps.com/winman/misfitdiagnosis.htm
Livingston, S. L. (2004). Equating test scores (without IRT). Princeton, NJ: ETS. Retrieved
from: https://www.ets.org/Media/Research/pdf/LIVINGSTON.pdf
Lord, F. M. (1977). Practical applications of item characteristic curve theory. Journal of
Educational Measurement, 14(2), 117-138. doi:10.1111/j.1745-3984.1977.tb00032.x
Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Redding, MA:
Addison-Wesley Publishing Company.
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile
observed-score “equatings.”. Applied Psychological Measurement, 8(4), 453-461. doi: 10.1177/014662168400800409
Luecht, R.M. (1998). Computer-assisted test assembly using optimization heuristics Applied
Psychological Measurement, 22(3), 224-236. doi: 10.1177/01466216980223003.
Mead, R. (2008). A Rasch primer: The measurement theory of Georg Rasch. Psychometrics
services research memorandum 2008–001. Maple Grove, MN: Data Recognition Corporation.
Penfield, R. D. (2005). Unique properties of Rasch model item information functions. Journal of
Applied Measurement, 6(4), 355-365.
O’Neill, T., Peabody, M., Tan, R. J. B. & Du, Y. (2013). How much item drift is too much?
Rasch Measurement Transactions, 27(3), 1423-1424. Retrieved from: https://www.rasch.org/rmt/rmt273a.htm
RStudio Team (2016). RStudio: Integrated Development for R. RStudio, Inc. [computer
program]. Boston, MA Retrieved from http://www.rstudio.com/.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen,
Danish Institute for Educational Research. Chicago: The University of Chicago Press.
Smith, E. V., Stearus, M., Sorenson, B., Huynh, H. & McNaughton, T. (2018). Rasch DC
[computer program]. Unpublished, Department of Educational Psychology, University of
Illinois at Chicago, Chicago, IL.
Smith, E. V. (2004). Evidence of the reliability of measures and validity of measure
interpretation: A Rasch measurement perspective. In Smith, E. V., & Smith, R. M. (Ed.),
Introduction to Rasch Measurement: Theory, models and applications. Maple Grove, MN: JAM Press.
Smith, E. V. (2005). Effect of item redundancy on Rasch item and person estimates. Journal of
Applied Measurement, 6(2), 147-163.
Smith, R. M. (2003). Rasch measurement models: Interpreting Winsteps and FACETS output.
Maple Grove, MN: JAM Press.
Smith, R. M. (1991). The distributional properties of Rasch item fit statistics. Educational
and Psychological Measurement, 51, 541-565.
Smith, R. M. & Kramer, G. (1992). A comparison of two methods of test equating in the Rasch
model. Educational and Psychological Measurement, 52, 835-846.
Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons, Journal of
the American Statistical Association, 69, pp. 730-737.
Subkoviak, M. J. (1985). Tables of reliability coefficients for mastery tests. Paper presented at
the Annual Meeting of the American Educational Research Association, Chicago, IL.
Stocking, M. (1991, April). An experiment in the application of an automated item selection
method to real data. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Boston, MA.
Taylor, C. S., & Lee, Y. (2010). Stability of Rasch scales over time. Applied Measurement in
Education, 23(1), 87-113. doi: 10.1080/08957340903423701
Tong, Y., & Kolen, M. J. (2005). Assessing equating results on different equating criteria.
Applied Psychological Measurement, 29(6), 418-432. doi: 10.1177/0146621606280071
Tong, Y., Wu, S-S., Xu, M. (2008, March). A comparison of preequating and post-equating
using large-scale assessment data. Paper presented at the Annual Conference of the American Educational and Research Association. New York City, NY.
van der Linden, W. J. & Hambleton, R. K. (1997). Handbook of modern item response theory.
New York, NY: Springer.
van der Linden, W. J. (2005). Linear models for optimal test design. New York, NY: Springer.
doi:10.1007/0-387-29054-0
van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22(3), 195-211. doi: 10.1177/01466216980223001
von Davier, A. A. (2010). Equating and Scaling. In Peterson, P., Baker, E. & McGaw, B. (Ed.), International Encyclopedia of Education. Amsterdam: Academic Press, pp. 50-55.
von Davier, A. A. & Wilson, C. (2007). IRT true score test equating: A guide through
assumptions and applications. Journal of Educational and Psychological Measurement, 67(6); 940-957. doi: 10.1177/0013164407301543
Wightman, L. F. (1998). Practical issues in computerized test assembly. Applied Psychological
Measurement, 22(3), 292-302. doi: 10.1177/01466216980223009
Wilson, M. (1988). Detecting and interpreting local item dependence using a family of Rasch
models. Applied Psychological Measurement, 12, 353-364.
Wright, B. D. (1977). Solving measurement problems with the Rasch Model. Journal of
Educational Measurement, 14(2), 97-115.
Wright, B. D. & Mok, M. C. (2004). An overview of the family of Rasch Measurement
Models. In Smith, E. V., & Smith, R. M. (Ed.), Introduction to Rasch Measurement: Theory, models and applications. Maple Grove, MN: JAM Press.
Wright, B. D., & Linacre, M. (1994). Reasonable Mean-Square Fit Statistics. Rasch
Measurement Transactions, 8(3), 370. Retrieved from: https://www.rasch.org/rmt/rmt83b.htm
Yen, W. M. & Fitzpatrick, A. R. (2006). Item Response Theory. In Brennan, R. L. (Ed.)
Educational measurement, fourth edition. Portsmouth, NH: Praeger.
Yi, H. S., Kim, S., & Brennan, R. L. (2007). A method for estimating classification consistency
indices for two equated forms. Applied Psychological Measurement, 32(4), 275-291.
APPENDIX B
TABLE XVII
COMPLETE RESULTS FROM DECISION CONSISTENCY ANALYSES
Form |
Equating |
Scale Linking |
|
Error |
|
Error |
|
Baseline Comparison |
0.949 |
0.006 |
0.682 |
0.033 |
|
OS |
MM |
0.958 |
0.007 |
0.764 |
0.035 |
|
OS |
MS |
0.955 |
0.007 |
0.736 |
0.035 |
|
OS |
SL |
0.948 |
0.008 |
0.743 |
0.035 |
|
TS |
MM |
0.958 |
0.007 |
0.763 |
0.035 |
|
TS |
MS |
0.955 |
0.007 |
0.734 |
0.035 |
|
TS |
SL |
0.948 |
0.008 |
0.745 |
0.035 |
|
PE |
MM |
0.971 |
0.007 |
0.633 |
0.063 |
|
PE |
MS |
0.971 |
0.007 |
0.641 |
0.061 |
|
PE |
SL |
0.972 |
0.007 |
0.646 |
0.061 |
|
Baseline Comparison |
0.958 |
0.006 |
0.657 |
0.040 |
|
OS |
MM |
0.958 |
0.007 |
0.649 |
0.049 |
|
OS |
MS |
0.967 |
0.006 |
0.714 |
0.048 |
|
OS |
SL |
0.963 |
0.007 |
0.651 |
0.052 |
|
TS |
MM |
0.958 |
0.007 |
0.652 |
0.049 |
|
TS |
MS |
0.966 |
0.006 |
0.712 |
0.047 |
|
TS |
SL |
0.963 |
0.007 |
0.651 |
0.052 |
|
PE |
MM |
0.971 |
0.006 |
0.697 |
0.053 |
|
PE |
MS |
0.969 |
0.006 |
0.697 |
0.052 |
|
PE |
SL |
0.970 |
0.006 |
0.699 |
0.052 |
|
Baseline Comparison |
0.947 |
0.006 |
0.647 |
0.034 |
|
OS |
MM |
0.960 |
0.006 |
0.617 |
0.044 |
|
OS |
MS |
0.954 |
0.007 |
0.631 |
0.040 |
|
OS |
SL |
0.965 |
0.006 |
0.587 |
0.050 |
|
TS |
MM |
0.960 |
0.006 |
0.612 |
0.044 |
|
TS |
MS |
0.954 |
0.007 |
0.623 |
0.040 |
|
TS |
SL |
0.965 |
0.006 |
0.580 |
0.051 |
|
PE |
MM |
0.959 |
0.006 |
0.699 |
0.037 |
|
PE |
MS |
0.958 |
0.006 |
0.709 |
0.036 |
|
PE |
SL |
0.958 |
0.006 |
0.701 |
0.037 |
|
Baseline Comparison |
0.935 |
0.009 |
0.719 |
0.034 |
|
OS |
MM |
0.955 |
0.007 |
0.791 |
0.029 |
|
OS |
MS |
0.955 |
0.007 |
0.793 |
0.029 |
|
OS |
SL |
0.950 |
0.007 |
0.802 |
0.026 |
|
TS |
MM |
0.956 |
0.007 |
0.803 |
0.028 |
|
TS |
MS |
0.958 |
0.007 |
0.814 |
0.027 |
|
TS |
SL |
0.952 |
0.007 |
0.818 |
0.025 |
|
PE |
MM |
0.963 |
0.007 |
0.678 |
0.044 |
|
PE |
MS |
0.963 |
0.007 |
0.697 |
0.042 |
|
PE |
SL |
0.963 |
0.007 |
0.686 |
0.044 |
Note. Boldface signifies maximum value. MM represents mean/mean scale linking, MS represents the mean/sigma scale linking, and SL represents the Stocking and Lord TCC scale linking procedure. OS represents IRT observed score equating, PE represents IRT preequating, and TS represents IRT true score equating.
Association of Standardized Patient Annual Conference, June 9-11, 2019
RESEARCH PRESENTATION:
Standardizing Judgment: A Qualitative Study of How SPs Co-Construct Meaning
This presentation reported on the results of a discourse analysis of 22 Standardized Patient (SP) interviews. The research received IRB approval through the University of California, San Diego. The research questions were: 1) How do SPs maintain “standardization” in role performance and assessment, 2) to what degree to SPs adhere to standardization? The results concluded that 1) the term “standardization” is co-constructed by test developers, psychometricians, SP trainers, and SPs, 2) SP trainers employ non-standardized approaches in their training, and 3) SPs are highly invested in maintaining a standard of role portrayal and assessment but the personal resources they bring to it are highly subjective.
PHILADELPHIA, PA. The National Board of Osteopathic Medical Examiners (NBOME), an independent, not-for-profit organization that provides testing for osteopathic medical licensure and related healthcare professions, is excited to mark 2019 as the 85th anniversary of its founding.
Since its founding in 1934, and particularly in recent years, the NBOME has experienced substantial growth and opportunities given the many changes in healthcare, medical education and medical regulation. The NBOME’s assessment portfolio has grown to encompass the continuum of education, training, and continuous professional development for osteopathic medicine and a number of other healthcare professions, both in the United States and in other countries.
“We continue to be steadfast in our commitment to the NBOME’s mission and producing valid, reliable, and defensible national standardized assessment tools, and acknowledging the critical role that osteopathically distinctive assessment plays in protecting the public,” said John R. Gimpel, DO, MEd, NBOME President and CEO, “We are thankful for the strategic direction and leadership our Board of Directors has provided throughout our history, and the unparalleled support NBOME receives from the osteopathic medical community and our numerous other partners in health care.”
Coinciding with the 85 year celebration is the launch of NBOME’s enhanced blueprint for the COMLEX-USA examination series. This multi-stage update to the organization’s flagship exam series is the culmination of nearly 10 years of work in evidence-based design led by NBOME’s National Faculty, including subject matter experts from the education, licensure and clinical practice communities. In addition, NBOME has expanded the COMAT product portfolio with the Foundational Biomedical Sciences exam series, and also introduced the CATALYST longitudinal assessment platform for formative testing and continuous professional development.
A provisional logo and accompanying information resources focused on NBOME’s 85 year legacy, including as a dedicated webpage and social media campaign, are planned for the remainder of the year.
About the NBOME
NBOME is an independent, non-governmental, non-profit assessment organization committed to protecting the public by providing the means to assess competencies for osteopathic medicine and related health care professions. NBOME’s COMLEX-USA examination series is a requirement for graduation from colleges of osteopathic medicine and provides the pathway to licensure for osteopathic physicians in the United States and numerous international jurisdictions.
The NBOME congratulates former board member Ronald Burns, DO, on his new appointment as president of the American Osteopathic Association on Saturday, July 27th 2019.
Dr. Burns assumed the presidency before an estimated 500 osteopathic physicians (DOs) at the American Osteopathic Association’s annual business meeting in Chicago. The organization represents the professional interests of the nation’s more than 145,000 DOs and osteopathic medical students.
From 2009-2018, Dr. Burns served on the NBOME Board and participated in the COMLEX-USA Level 3 Advisory Committee, Executive Committee, and chaired the Nominating Committee. He has also previously served on the Cognitive Testing Advisory Committee.
Dr. Burns is a board-certified family medicine physician, with a private practice in Orlando, Florida. He has represented the state of Florida in the AOA House of Delegates for the past 19 years. He has also been a member of the Florida Osteopathic Medical Association for over 29 years and served as its president from 2004–2005.
“My goal is to ensure the AOA is ready to meet the current and future needs of its ever-growing body of constituents,” said Dr. Burns. “By necessity, that means having strong collaborative relationships with AOA’s affiliate organizations.”
We wish Dr. Burns best of luck, and look forward to working with our longtime colleague in his new position.
Read more about Dr. Burns’ role as AOA President.
Mathematical programming has been widely used by professionals in testing agencies as a tool to automatically construct equivalent test forms. This study introduces the linear programming capabilities (modeling language plus solvers) of SAS Operations Research as a platform to rigorously engineer tests on specifications in an automated manner. To that end, real items from a medical licensing test are used to demonstrate the simultaneous assembly of multiple parallel test forms under two separate linear programming scenarios: (a) constraint satisfaction (one problem) and (b) combinatorial optimization (three problems). In the four problems from the two scenarios, the forms are assembled subjected to various content and psychometric constraints. Assembled forms are next assessed using psychometric methods to ensure equivalence about all test specifications. Results from this study support SAS as a reliable and easy-to-implement platform for form assembly. Annotated codes are provided to promote further research and operational work in this area.
To read this article as it was initially published in Applied Psychological Measurement click here.
Contributed by:
Can Shao | Research Scientist | Curriculum Associates
Hongwei “Patrick” Yang | Assistant Professor | The University of West Florida
Silu Liu | Epidemiological Research Manager | DCHealth Technology
Tsung-Hsun “Edward” Tsai, PhD | Associate Vice President of Assessment Services and Research | NBOME