Ensuring data privacy in the era of generative AI
How can enterprises ensure privacy of their data? What approaches can they use? This article explores a couple of powerful techniques and discusses how to evaluate them.
Generative AI (GenAI) has completely transformed the landscape of artificial intelligence and machine learning in the last two years. We are now seeing sophisticated systems such as Microsoft Copilot and Google Bard Enhanced, built on powerful large language models (LLM) like GPT-4 and Gemini-Pro.
However, these advancements also raise questions about the privacy of the data that power such systems. LLMs lack the capability to automatically discern private information within the vast volume of data used for training.
Ivan Padilla Ojeda, a technical marketing engineer at Outshift by Cisco, says, “The most important laws and regulations are the ones related to the privacy of your customers.”
Enterprises are investing effort in ensuring compliance as they leverage the power of LLMs. How can enterprises ensure privacy of their data? What approaches can they use? This article explores a couple of powerful techniques and discusses how to evaluate them.
Differential privacy
The concept of differential privacy was introduced in 2006 for structured data. The key idea behind differential privacy is that when statistical information about the data is released, the confidentiality of the individual data points remains protected.
For example, if salary data for 500 employees is collected, and the average salary is published, it should not be possible to infer the salary of an individual from the average. This is achieved by adding noise to the data set.
The noise is a configurable parameter. As more noise is added, the privacy guarantees increase. However, too much noise can alter the original distribution.
To illustrate, in the employee salary example, if we randomly adjust each individual’s salary by a small percentage (noise), the average salary remains relatively unchanged but individual salaries become harder to infer.
With more randomness, the average will change significantly, no longer reflecting the true average, but inferring individual salaries will be nearly impossible. However, this alteration impacts the quality of the model built on this data.
Hence, it is important to understand the trade-off between noise level and model quality. The evaluation should assess both the privacy guarantees and the impact on model quality.
A typical evaluation framework involves first building a model that is not differentially private. Then, create multiple versions of differentially private models with varying levels of noise. Next, estimate the drop in accuracy of these models compared to the non-differentially private model. What is a permissible accuracy drop should be validated with the business.
To measure privacy leaks, it is advisable to create a data set that can stress test the model’s privacy. For instance, in the same salary example, we can randomly verify that the LLM does not inadvertently reveal individual salaries.
Data masking
Data masking techniques have been around for decades to anonymise PII (personally identifiable information).
Standard techniques, such as regular expressions, are employed when patterns are well defined. For example, a social security number (SSN) is a nine-digit number, usually written in three sections separated by hyphens. 580-60-1234 is an example of an SSN. A regular expression is formed using rules around the presence of three sections, nine digits, and the sections separated by hyphens. This rule will work only for SSN numbers. If a new entity different from SSN is defined, then a new set of rules must be devised for writing a regular expression. Hence, these techniques do not generalise well.
For better generalisation, machine learning models like named entity recognition are trained on specific PII entities such as person names, SSNs and addresses. These models will learn variations of all patterns that might form names, addresses, SSNs etc., thus generalising better.
To evaluate data masking, create a data set with variations of each entity. Using this data set, calculate the ability of the system to correctly identify these entities.
Summing up
Ensuring data privacy will become a mandatory requirement as LLMs become ubiquitous in the coming years.
Differential privacy and data masking are two essential tools for enterprises striving to ensure compliance. Understanding these techniques and evaluating their effectiveness is crucial for organisations to protect their data privacy.
The author is Head of AI Center of Excellence at Tredence , a data science company.
Edited by Swetha Kannan
(Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the views of YourStory.)