Human Language Assessment vs. Artificial Intelligence Assessments

  

As a company in an ever-changing marketplace, you’re always looking for ways to streamline recruitment. Language skills are important in many jobs, and it’s important to ensure that candidates have the right spoken and written language skills. Providing organizations and candidates with a balanced, fair, and comfortable assessment process has been the driving force behind our language assessment services since 2001. We collaborate with our talent acquisition partners to ensure a stress-free and efficient overall experience.

Let’s take a deep dive into AI assessments versus human-made assessments. It could be said that automated scoring technology for judging speech and writing has slowly gained global acceptance. Fewer people are wondering, “Does it work?” “Okay, how does it work for our purposes?” is the next question. This blog aims to provide readers with the information they need to answer both questions.

One common misconception about automated scoring is that a computer has been trained to “do what humans do.” Computers do not behave like humans because of automated scoring technology. Instead, it takes advantage of the fact that computers can be programmed to find and measure features of speaking and writing, combine and weigh them in a multidimensional space, and figure out which specific features and weightings best predict the score given by a human. There is no need to support the claim that a computer can “understand” a spoken utterance, which computers cannot do. There is no need to support the claim that a human judge can accurately count millisecond-level subphonemic timing events in natural speech (which computers can do better than humans). Both types of evaluations can consistently produce proficiency scores for spoken utterances.

The goal of this blog is to provide an understandable explanation of how current state-of-the-art assessments use automated scoring technology, as well as to review research that has shown that the technologies are effective in the assessment context rather than having a human do the assessments. Technology for automated scoring and validity when used correctly and with care, can make language tests much more interesting and useful in terms of delivery, task types, and scoring. At the same time, caution must be taken to justify each part of the technology used in the validity argument of the assessment. To ensure that score interpretation is consistent regardless of whether the test is computer-based or paper-based, the following questions may need to be investigated: Do students score the same on essay tasks when they handwrite or type them on computers? Is typing ability a disadvantage for some examinees? Do students use different strategic skills when reading passages on screen versus on paper?

One school of thought holds that because human judgments are fallible, subjectivity comes into play. In addition to comparing machine scores to human judgments, machine scores should also be compared to external criteria or measures of the same ability. Surprisingly, machine scores are more transparent than human judgments in some ways. Human raters evaluate language samples, consult scale descriptors, and use their judgment and experience to assign a final score. Still, there is no quantifiable way to measure how they weigh and combine the various pieces of information in an essay. In contrast, machine scoring can achieve something replicable. Every piece of data analyzed, as well as its precise weighting in the scoring model, is verifiable in the machine algorithms.

So, in automated models, it is possible to leave out irrelevant information (like the length of a sentence) and explicitly weight important information, like discourse markers, in a way that humans can’t do. Bernstein, Van Moere, and Cheng (2010) say that scoring models are based on data, can be checked, and can be shown to be wrong.

There are a lot of things that humans are good at and a lot of things they’re not so good at. Let’s recap the highlights of AI vs. Humans:

  •  While human evaluators can be used for language ability screening, it’s not a good solution when you’re hiring many candidates at once.
  •  When human evaluators conduct language interviews, their results are sometimes biased because of factors like fatigue, distraction, and hiring quotas.
  •  Our customer data has shown that AI is much better at consistently and accurately evaluating candidates based on their academic language ability.
  •  Human evaluators are better for assessing business language abilities needed for the workplace.

At ELAM, we continuously conduct benchmarking audits to ensure unbiased and cohesive assessments. We want to make sure all candidates have an equal opportunity when it comes to their language abilities!