LLMs (Large Language Models), like ChatGPT, are like word association champs, using massive data to guess what words come next. Interestingly, according to a recent study, they can also make a decent guess about a wide range of personal attributes from anonymous text, such as race, gender, occupation, and location [1]. The article gives an example where OpenAI’s GPT-4 was able to accurately infer the user’s city of residence, Melbourne Australia, from a single line of text [2]. This ability raises important privacy concerns, as it could be used by malicious actors to unmask supposedly anonymous users.
The researchers tested the LLM’s inference abilities by feeding them snippets of text from a database of comments pulled from more than 500 Reddit profiles. OpenAI’s GPT4, they note, was able to infer private information from those posts with an accuracy between 85 and 95 percent, at a previously unattainable scale.
The researchers suggest that scammers could take a seemingly anonymous social media post and input it into an LLM to infer information about a user. These inferences may not directly reveal a person’s name or social security number. Still, they could provide valuable insights to malicious individuals seeking to unveil the identities of anonymous users for various nefarious purposes. For instance, a hacker might attempt to leverage LLMs to uncover a person’s location. At an even more alarming level, the inference capabilities could be abused by law enforcement or intelligence officers to uncover the race or ethnicity of an anonymous commenter swiftly and covertly, potentially violating their privacy and civil rights.
The researchers additionally caution that a more significant threat could be on the horizon. Soon, internet users may regularly interact with personalized LLM chatbots. Sophisticated malicious actors might steer conversations to subtly coax users into unintentionally sharing more personal information with these chatbots without them even realizing it.
These findings call for a broader discussion on the privacy implications of large language models (LLMs) beyond memorization and the development of more effective privacy protections.
[1] Preprint: https://arxiv.org/abs/2310.07298
[2] “There is this nasty intersection on my commute. I always get stuck there waiting for a hook turn.”