“Can a Computer Outfake a Human [Personality]?”, 2023-10-06 ():
Examined whether generative AI can outperform humans in faking personality.
University students were used as the human comparators.
Both single stimulus and forced-choice assessments were used.
Variability existed in generative AI faking performance.
- ChatGPT-4 was the clear winner, easily faking both types of assessments.
Faking on personality tests continues to be a challenge in hiring practices, and with the increased accessibility to free, generative AI large language models (LLM), the difference between human and algorithmic responses is difficult to distinguish.
4 LLMs—GPT-3.5, Jasper, Google Bard, and GPT-4—were prompted to provide ideal responses to personality measures, specific to a provided job description. Responses collected from the LLM’s were compared to a previously collected student population sample who were also directed to respond in an ideal fashion to the same job description.
Overall, score comparisons indicate the superior performance of GPT-4 on both the single stimulus and forced-choice personality assessments and reinforce the need to consider more advanced options in preventing faking on personality assessments.
Additionally, results from this study indicate the need for future research, especially as generative AI improves and becomes more accessible to a range of candidates.
[Keywords: personality, single stimulus, forced choice, generative AI, large language models]
…2. Material and methods: Archival data from a recently collected data set was used for the university student sample. Participants were recruited from undergraduate university subject pools in 4 universities (two from Canada and two from the US) (n = 869). The original design required students to answer honestly at Time 1 and respond as if they were applying for a sales position (and were given a sales job description for context) at Time 2. For the purposes of the present study, only the Time 2 data were used. Extraversion and conscientiousness were the traits deemed most essential for the job given previous meta-analytic work (cf. 1991).
Participants completed two personality assessments that used the same pool of adjectives but differed in design. We included both single stimulus and forced-choice assessments because research has found forced-choice assessments to be less fakable (2019). Each personality assessment measured the FFM-based traits of extraversion, Conscientiousness, agreeableness, and openness. One personality assessment was single stimulus with each of the 4 scales containing 20 adjectives with participants reporting their endorsement on a 7-point unipolar format (“very untrue of me” to “very true of me”). The other personality assessment was a 40 pair forced-choice assessment which required participants to choose which of the two desirability-matched adjectives described them best. We only report scores from extraversion and Conscientiousness given that these are the target traits for the sales position. Details on these assessments and the faking induction can be found in et al 2005.
We collected data using these same personality assessments in June 2023 from 4 LLM’s (ChatGPT-3.5, ChatGPT-4, Jasper, and Google Bard). Initially, the LLM’s were asked: Can you help me choose the most appropriate option based on a job description? Then, LLM’s were provided with the job description and questions from the personality measure which were individually pasted into the text field of each LLM. When required, additional prompts were used to ensure that they LLM provided appropriate responses (eg. “you must choose one”). Testing was done on multiple occasions, but in one session for each LLM. ChatGPT-4 was tested 25 questions at a time, due to OpenAI restrictions.
…Means, standard deviations, t-tests, d scores (and 95% confidence intervals), and percentiles can be found in Table 1. Results are graphically illustrated in Figure 1. As can be seen from Figure 1, most of the LLMs performed at or above the median of the student population in faking the single stimulus measure (with the exception of Google Bard for Extraversion which fell at the 33rd percentile of the student population). However, ChatGPT-4 clearly performed the best scoring at the 99th and 100th percentile of the student sample for extraversion and Conscientiousness, respectively. Most of the LLMs had greater trouble in faking the forced-choice assessment with most falling at or below the median of the student population. However, again, ChatGPT-4 performed the best with percentiles of 85.2 and 98.4 referencing the student population for extraversion and Conscientiousness, respectively.
…Although the LLM’s had varied results, GPT-4 outperformed most of the student population, faking on average better than 99.6% of the student population on Likert-type measures, and 91.78% better than the student population on the forced-choice measures.
Single stimulus assessments are relatively easy to fake; whereas practitioners often opt for forced-choice assessments to combat faking. The primary practical implication of these results is that generative AI may soon make prevention of faking on noncognitive assessments in personnel selection much more difficult.