Creating Custom Sensitive Information Types in Microsoft Purview

Background on Custom Sensitive Information Types

Sensitive Information Types are a type of data classifier in Microsoft nomenclature. You can find them in your tenant at https://compliance.microsoft.com --> Data Classification --> Classifiers (see screenshot below). Note that you must have at least the "Sensitivity Label Administrator" role to be able to create and edit Sensitivity Labels. This role is part of the Compliance Administrator role group (as well as other role groups).

Classifiers, including Sensitive Information Types, allow you to automatically identify and classify documents and files based on the information they contain. For example, if I wanted to identify documents in my environment that contain US Passport Numbers, I could use the built in Sensitive Information Type "U.S. / U.K. Passport Number".

This classification capability can be very powerful. By identifying files that meet the criteria for a specific sensitive information type and then attaching a label, we can exert control over that document, implementing protective measures. For example, we can:

Prevent it from being emailed.
Encrypt it.
Control who can access it.
Put markings in the header, footer, or the content of the document.

In addition to the "U.S. / U.K. Passport Number" built-in sensitive information type I mentioned above, there are roughly 300 additional built-in sensitive information types (as of the time of this writing--new ones are always being added). The system, however, also provides functionality to build custom sensitive information types. In this post, I am goinng to show you exactly how to do that.

Use Case

Your company makes energy drinks. The ingredients and recipes for these drinks are carefully guarded secrets. If anyone, especially your competitors, were to acquire the secret formulas for your drinks, it would be devastating to your business. Your company uses Microsoft 365. Documents are stored on Teams, Email, OneDrive, and SharePoint.

As an information security officer in my company, I would like to find all content within our environment that contains proprietary formulas and ingredients for our energy drinks. This includes documents that contain the keywords:

Cayenne
Caffeine
Taurine
Espresso

Configuration

After ensuring that I have the appropriate permissions (see above), I will start at https://compliance.microsoft.com --> Data Classification --> Classifiers and click "Create Sensitive info type".

Step 1: Name our sensitive info type

Step 2: Create a Pattern

A pattern allows us to tell the sensitive info type what to look for. There are four methods for defining our primary element:

Regular Expression: Uses a very specific shorthand notation to identify and match patterns.
Keyword List: provide a list of keywords to look for in files. There is a limit to how many keywords can be provided.
Keyword Dictionary: similar to a keyword list but with an unlimited number of keywords.
Function: finds text that is formatted in known patterns (e.g., SSN, Credit Card, etc.)

In our case, we are looking for a small list of specific keywords so it makes the most sense to select a keyword list.

Our pattern looks like this:

Note that:

I selected a high confidence level. The confidence level is more important when we identify supporting elements and a character proximity. In this case we haven't identified those things, so the sensitive info type will match on any occurrence of any of the keywords
I didn't select a supporting element. However, if our company had a policy to include the word "secret" or "formula" or "recipe" in documents that contained secret formulas or ingredients, I could include that at as a supporting element to help improve the accuracy of my sensitive info type. If you can imagine that I might send an email to someone that I "needed a boost of caffeine to make it through the day", the sensitive info type would flag that email because of the word "caffeine". It would be a false positive, though, because it is not a secret recipe or formula. If, however, I specified "high confidence" and that the word "Secret" should be within 100 characters of "Caffeine" it would more accurately ignore that email because it contained neither of those things.
I can also add additional checks to refine the accuracy:

In our case, for example, our company might have a coding system that prefixes ingredients with a code. For example, cayenne might be IN-cayenne. In that case, we may only want to include the words if they include a prefix: "IN-"

Step 3: Finalize

Below is our fully configured sensitive information type.

sensitive information type final settings

Next Steps

Sensitive information types, by themselves, do nothing. They can be used in a variety of ways to enforce data policies, sharing restrictions, retention, and privacy. Within Microsoft, they can be used in:

Data Loss Prevention rules
Retention Labels
Insider Risk Management
Communication Compliance
Microsoft Priva

And, of course, Sensitivity Labels. In a future post, we will put this sensitive info type to use in a sensitivity label. Stay tuned.