Skip to content Skip to footer

Hot takes in IO: When reliability can be too much of a good thing

Introductory psychometric training will have you believe that you should always look for higher reliability (often times with a minimum threshold of .7) as an indicator of a good assessment. A high reliability metric indicates that the assessment is consistent in its measurement, which is a good thing… for the most part. This brings us back to our favorite catch phrase: “it depends.” In challenging this notion of higher reliability always being a good thing, I think it’s important to break down the different types of reliability as well as the different types of constructs that assessments are used to measure.

For assessments used in selection processes, there are a few types of reliability we care about: 1) internal consistency (are all of the items used to measure a single construct consistent with each other?), 2) parallel forms (are all of the different versions of the assessment consistent with the other versions?), and 3) test-retest (if the same person took the assessment multiple times, would their scores be consistent over time?). However, for each type of reliability, there are some nuances to consider. In some cases, a high reliability coefficient could indicate that the assessment is actually not working as intended.

Internal consistency

Internal consistency should be examined at a construct level, not necessarily at the assessment level. If multiple items are used to examine the same construct, you should expect them to produce the same signal. However, you wouldn’t necessarily expect multiple items to produce the same signal if the intent is to measure different constructs. This is why in personality tests, we look at the internal consistency of items within each dimension and not across all dimensions. 

For example, a personality test might contain the items “I pay close attention to details” and “I tend to be very precise with my work” to measure conscientiousness. Those items should have high internal consistency with each other to indicate that they are both good measures of conscientiousness. At the same time, there might be an item like “I feel comfortable talking to strangers” to measure extraversion. The extraversion item would likely have a weak or negligible relationship with the conscientiousness items, but we wouldn’t necessarily expect it to have high internal consistency with another dimension in order to be a good measure of extraversion.

Parallel forms

Parallel forms reliability is only relevant when an assessment has multiple forms that are meant to be used interchangeably. There isn’t too much nuance to get into with this type of reliability. If there are multiple forms, each form should produce consistent results when used for the same purpose.

Parallel forms are important to have for job knowledge or skills assessments where there are objectively correct answers, thus incentivizing candidates to try to figure out questions or answers ahead of time. A strong knowledge or skills assessment will have multiple forms in order to reduce the impact of cheating. A strong and fair knowledge or skills assessment will be able to show that these forms are related to each other so all candidates are being assessed on the same skills, regardless of form.

This type of reliability is usually irrelevant for something like a personality test, where it is less important to have multiple forms of the assessment because you wouldn’t be worried about people cheating off of each other (okay— maybe you are worried about that, but save that concern for a discussion around faking). 


Finally, while high test-retest reliability is generally considered desirable for most types of constructs and assessments, there are a few reasons it can be problematic for skills assessments. First and foremost, you only want assessments to produce consistent signal over time if you expect the target construct to be stable. While we expect personality characteristics to remain relatively stable over time, skills should be relatively malleable and improve with practice. This leads to the other consideration with regards to test-retest:the time period over which the retest occurs. Test-retest over a few days should be much higher than test-retest over a few months. 

For these reasons, there are a few implications of having a test-retest reliability that is too high on a skills assessment:

  1. Lack of sensitivity to skill development: Skills are typically expected to improve or develop over time with practice and experience. However, if a skills assessment has high test-retest reliability, it means that individuals are likely to obtain very similar scores when they take the assessment again. This lack of variability in scores fails to capture any improvements in skills that may have occurred between the two test administrations. Consequently, the assessment may not effectively measure the actual skill development of individuals over time.
  1. Reduced motivation and engagement: If individuals perceive that their performance on a skills assessment is unlikely to change significantly over time, it can lead to reduced motivation and engagement in skill-building activities. The belief that their efforts will not result in noticeable improvements can demotivate individuals from investing time and energy in practicing and developing their skills. This can hinder their overall progress and hinder the purpose of the skills assessment if the goal is to encourage skill development.
  1. Limited utility for dynamic skill requirements: In today’s rapidly evolving world, skills requirements are constantly changing. High test-retest reliability in a skills assessment may suggest that the assessment lacks the ability to adapt to changing skill demands. This may occur if the assessment is overly focused on a specific tool or coding language as opposed to a core skill (e.g., basic array manipulation). If the assessment fails to capture emerging technology or fails to differentiate between individuals who possess the necessary updated skills and those who do not, it becomes less useful in guiding decisions related to employment, training, and professional development.


In conclusion, while high reliability is generally desirable for many types of assessments, high test-retest reliability can hinder the utility of skills assessments. Overemphasis on reliability may impede the measurement of skill development, reduce motivation and engagement, and fail to capture dynamic skill requirements. To effectively assess skills, it is important to consider other factors such as the validity of the assessment, the use of multiple assessment methods, and incorporating measures of skill progression and growth over time.

About the author

Sylvia Mol is the Head of the Talent Science team at CodeSignal. Holding a PhD in Industrial-Organizational Psychology and specializing in talent assessment, Sylvia is an expert in designing and leveraging assessments to create more fair and effective talent systems for both candidates and organizations. Sylvia has leveraged her expertise to drive product developments on the assessment vendor side and as a strategic partner to improve the global assessment and hiring processes for dozens of enterprise customers.