According to Cointelegraph, a team of scientists from the University of North Carolina, Chapel Hill, recently published pre-print artificial intelligence (AI) research that highlights the difficulty of removing sensitive data from large language models (LLMs) like OpenAI’s ChatGPT and Google’s Bard. The researchers found that while it is possible to delete information from LLMs, verifying that the information has been removed is just as challenging as the removal process itself.
This difficulty arises from the way LLMs are engineered and trained. They are pre-trained on databases and then fine-tuned to generate coherent outputs. Once a model is trained, its creators cannot go back into the database and delete specific files to prevent the model from outputting related results. This is the 'black box' of AI. Problems occur when LLMs trained on massive datasets output sensitive information, such as personally identifiable information or financial records.
To address this issue, AI developers use guardrails, such as hard-coded prompts that inhibit specific behaviors or reinforcement learning from human feedback (RLHF). However, the UNC researchers argue that this method relies on humans finding all the flaws a model might exhibit and, even when successful, it still doesn’t 'delete' the information from the model. The researchers concluded that even state-of-the-art model editing methods, such as Rank-One Model Editing (ROME), fail to fully delete factual information from LLMs, as facts can still be extracted 38% of the time by whitebox attacks and 29% of the time by blackbox attacks.
The researchers were able to develop new defense methods to protect LLMs from some 'extraction attacks' — purposeful attempts by bad actors to use prompting to circumvent a model’s guardrails in order to make it output sensitive information. However, they noted that the problem of deleting sensitive information may be one where defense methods are always playing catch-up to new attack methods.