“If we mostly know what sensitive data we have and where it is, that’s good enough for us,” said no one ever.
That’s why Normalyze blends traditional techniques with advanced technologies to deliver accurate, cost-efficient data classification.
A Smarter Approach to Data Classification
While regular expressions and NLP models do a good job of spotting sensitive data, they often fall short in understanding context, leading to frustrating false positives. That’s where LLMs come in—they’re brilliant at handling complex nuances, resulting in more precise classification for the most contextually sensitive and unstructured data. But you can’t use only LLMs because the monetary and performance costs would be prohibitive.
By combining the strengths of each method, we reduce costs, maximize performance, and provide top-tier accuracy in safeguarding sensitive information. Our customers can rely on accurate, cost-effective data classification as the foundation for their data security programs.
“We selected Normalyze for its reliable and comprehensive sensitive data identification, which is the basis to enforce least privilege access and eliminate potential data risks.”
– Sandeep Chandana, Director of Data Science & Analytics at Snowflake
Data Classification Approaches
In the industry, regular expressions and NLP models have long been the standard approaches for data classification, each with their specific use cases and limitations.
Regular expressions are effective for recognizing predefined patterns, such as credit card numbers or Social Security Numbers. However, they struggle with understanding context, which often leads to a high rate of false positives when similar patterns appear in non-sensitive contexts.
Natural Language Processing (NLP) models improve upon regular expressions by recognizing entities based on language patterns and context, making them more versatile in identifying data types such as personal information (e.g., names and addresses) and intellectual property (e.g., specialized documents like chip design blueprints and proprietary recipes). Despite their improved capabilities, NLP models still have limitations when dealing with nuanced contexts or distinguishing between similar entities, which can result in false positives. Additionally, they require significant training to adapt to new data types.
Large Language Models (LLMs) represent the new frontier in data classification. LLMs help reduce false positives and significantly improve accuracy. Unlike traditional approaches, LLMs can understand and analyze the broader context in which data appears, making them highly effective in distinguishing between similar data types. For instance, detecting whether a reference text truly indicates a specific person can be challenging, as context plays a critical role. LLMs excel in analyzing these contexts thoroughly to make accurate identifications, making them particularly effective for distinguishing between similar data types in complex situations.
Example: Detecting Names
To provide an example, consider the case of person name detection:
- Regular expressions: Regular expressions might identify potential names by matching patterns in text, such as recognizing the pattern of a capitalized first name followed by a last name. For instance, ‘John Smith’ might be flagged as a potential name. However, regular expressions lack the context to determine if the detected name is actually a person’s name or something else, such as ‘Smith Street’ or ‘Smith & Co.,’ a business name.
- NLP models: NLP models can improve upon this by using training data to identify common names. For example, an NLP model might recognize ‘John Smith’ as a person’s name if it appears in a context it has been trained on. However, NLP models can still struggle when the context is ambiguous. If ‘Smith’ appears in ‘Smith Street,’ the NLP model might still incorrectly classify it as a person’s name because it lacks a deeper understanding of the context.
- LLMs: This is where LLMs come in—LLMs can analyze the surrounding context more deeply. For instance, LLMs can identify that ‘Smith’ in ‘Smith Street’ is a location, not a person’s name, based on the broader context provided in the sentence. This capability allows for richer and more accurate classification.
Other examples of data that may be hard to classify accurately without context could include account numbers, customer IDs, or invoice numbers with similar or overlapping formats; sensitive information buried in free-text fields (e.g., in emails, documents, or chat messages) or in unstructured data that could have nearly infinite permutations; or the gender associated with a person’s name.
Ensuring Performance While Optimizing Cost
While LLMs are best suited to address the above situations, it’s not practical to perform all data classification with LLMs – the processing costs would be prohibitive, not to mention performance would suffer.
Normalyze addresses this challenge with a hybrid approach to data classification. We leverage regular expressions and NLP models to perform an initial pass at identifying sensitive data, significantly reducing the data volume that needs to be further analyzed. After this initial pass, we use LLMs for a more detailed analysis, focusing on the data that requires deeper contextual understanding. This approach helps us manage costs while maximizing the accuracy of our classification system.
Additionally, LLMs are used selectively—only for those entity types where context is crucial for accurate classification, such as distinguishing between personal names and other similar text elements.
The Data Classification Process
The following is the process we use to classify data efficiently:
- Context-enhanced regular expressions: We use context-enhanced regular expressions to achieve precise entity recognition. For example, when classifying credit card data, we identify specific terms that may precede (e.g., Visa, Mastercard) or follow (e.g., CVV) the credit card number. In addition, we validate the credit card number to ensure its authenticity using established algorithms and known data patterns.
- NLP models: We employ NLP models that are trained to recognize a wide range of specific data types, such as personal information (e.g., names and addresses) and intellectual property, including specialized documents like chip design blueprints and proprietary recipes. These models play a crucial role in ensuring accurate data identification across diverse data contexts.
- LLMs: Finally, we leverage LLMs to reduce false positives and provide higher accuracy. For certain entity types, such as ‘person,’ context is crucial to determine whether the reference text indeed points to an individual. For these types, we rely on LLMs to analyze the context thoroughly and make accurate identifications.
Choice of Models
When we chose the LLM technology, we kept a few key factors in mind to make sure we got the right balance of performance without breaking the bank:
- Foundational vs. custom models: First, we looked at whether to use an off-the-shelf foundational model or build our own custom one specifically for data classification. In the end, we went with a foundational model that we could train for our needs—this gave us the best of both worlds: existing capabilities plus some customization.
- Model size and trade-offs: We opted for a smaller model to strike a balance between efficiency and performance. Smaller models are faster and more memory-efficient, which fits perfectly with our focus on quick processing and keeping costs under control. While larger models can offer higher accuracy, their higher costs and longer processing times make them less practical for our approach to data classification.
- Portability: We also made sure the model could run on different accelerator hardware without needing any one specific high-end GPU. Our goal was flexibility, so we can use the model across various platforms without being tied to one setup.
By considering these factors, we ensured our LLM choice hit the sweet spot of cost-effectiveness, performance, and easy deployment, helping us deliver efficient and accurate data classification.
The Foundation for Data Security
Data classification is the backbone of all your data security efforts, but it’s not just about identifying data—it’s about doing it smartly and accurately. When you know exactly what sensitive data you have and where it’s located, you can make smart, targeted decisions on how to protect it. Accurate classification is key for setting up the right access controls, applying encryption where it matters, and making sure your data loss prevention (DLP) tools are focusing on the right information. It also helps streamline incident response, meet compliance requirements, and decide when to securely delete old data. When your data is accurately classified, you can be confident that you are building your security program on a solid foundation.