How We Built the Best Google Drive Compliance Engine
April 2, 2015
5 minute read
Google Drive is an incredible toolset. It provides your organization and users with the means to collaborate in real-time, on any document, from any location, on any device. What this means for your organization’s productivity is immeasurable, but one of the major concerns we hear is that, while it’s so easy for users to share and store information, there isn’t a secure, scalable way for administrators to monitor, classify, and protect data inside their domain.
With the addition of our new Google Drive Compliance Engine with regular expression searching, we’ve taken our solution to the next level by providing IT administrators and security teams the tools to search for specific types of content, including credit card numbers, social security numbers, and personally identifiable information (PII) within Google Drive.
Searching full document content using regex is an enormous undertaking, as you need to be able to quickly search millions of documents for potentially millions of character combinations. We took this one step further to perform these searches in real-time, constantly searching file content as users work and documents are updated.
How we built this solution is actually an interesting story, as it presented our development team with some exciting challenges that we addressed with creative and scalable solutions.
Challenge #1: Build a system to scan billions of documents in real-time
To build this kind of scalable architecture, we leveraged two key solutions from the Google Cloud Platform and Google Developer teams. The first is Google Compute Engine, which gives our infrastructure the necessary processing power to actually scan the full content of millions of documents across thousands of domains.
The second solution we leveraged is a combination of two APIs—Drive Activity Report and Push Notifications—to receive an alert whenever a change is made to a file. If you’ve ever used Google Docs, Sheets, or Slides, you’ve probably noticed how the application says “Saving…” every few seconds as you make changes. Well, each one of these “save” events triggers a push notification, and BetterCloud is listening to every single one of those notifications in order to identify violations of your Drive Compliance policies.
To put this into perspective, we’re currently scanning over 2 billion documents across our entire customer base, and adding millions more every day. This means we’re routinely processing tens of billions of push notifications, a truly amazing feat.
Challenge #2: Build a robust regex search engine to find specific strings in file content
If you’re not familiar with regex, consider the following scenario. You need to find a Google Sheet that is a list of phone numbers, but you don’t know any of the contacts personally so you don’t know the phone numbers by memory. What you do know, however, is that phone numbers follow specific patterns (888-888-8888, or 9999999999, for example). Regex search allows you to search for manifestations of the number pattern, rather than specific phone numbers.
It’s search, but 100 times smarter.
So apply that same exercise and think about sensitive information, like social security numbers, credit card numbers, personal addresses, student ID numbers, or proprietary information that follows a specific text pattern. With regex, you can write one search to find any text or number that fits the pattern. This is the magic and genius of regex.
We chose to build a POSIX regex engine, which allows our customers to easily write their own regular expressions, or choose to use our pre-built expressions.
Challenge #3: Validate results to reduce noise
Of course, with a powerful tool like regular expressions, there is the potential to identify false positives and false negatives. While we typically encourage customers to err on the side of capturing too much content in their policy, we also want to make Drive Compliance policies as accurate as possible.
To achieve a high level of accuracy, our team has employed a number of tools to perform additional checks on the content captured by the regex search to further validate the content and reduce the number of false positives in a search.
For example, the Luhn Algorithm is a formula that determines if a credit card number is valid. We run this algorithm during all regex searches for credit card information, allowing us to identify what is truly a credit card versus any other 16-digit number. We also perform similar validation tests on social security numbers, phone numbers, and zip codes.
Challenge #4: Give customers convenience, but also complete flexibility
The pre-built regular expressions in Drive Compliance satisfy common use cases, such as searching for payment information, personal information, and other common types of sensitive data. But many customers who need this type of solution have complex needs, and need to tweak our search to better meet their needs.
To make this as easy as possible for our customers, we made the decision to display the full strings of our pre-built regular expressions for your administration or security personnel to view, edit, and build upon. We realize building regular expressions can be more of an art than a science, so we provide customers with the ability to troubleshoot and update the regex as necessary to fit their exact needs.
A new challenge: detecting obscenities with regex
While the pre-built regex searches we launched Drive Compliance with will satisfy a wide variety of use-cases, one request we heard from a number of Google Apps customers was that they wanted to find obscene or objectionable content in Google Drive. So, naturally we set out to add this type of search into Drive Compliance, and it’s now available to all BetterCloud Enterprise customers.
This feature, called our “Profanity” content filter in the application, searches an organization’s external or publicly-shared Google Drive content for any documents with harsh or inappropriate language. Once identified, the Drive Compliance policy engine takes any actions outlined in your policy configuration, such as changing the sharing settings, transferring document ownership, notifying specific users (like a manager or teacher), and more.
As you might imagine, there are hundreds of obscene words and phrases to find and these phrases are continually changing. In fact, Google has dedicated an entire team of engineers to work on this for years. If you would like to view the list we used (caution: extremely offensive and not safe for work!), you can find it here on GitHub.
BetterCloud is the only compliance tool that provides you access to the full regex, so you can update this search to include or remove additional language that’s a concern for your organization.
While this type of policy is not necessarily tied to security, it can be critical to maintaining a professional image throughout your organization, identifying potential HR or misconduct issues, and regulating educational environments.
So what’s next?
From the initial wireframes all the way through the entire development process, we designed Drive Compliance to be a best-of-breed solution, working with customers all along the way to validate our direction and progress (and thank you all very much—we couldn’t have done this without you).
We feel very strongly that we have built the strongest security tool for Google Drive on the market. We look forward to continuing to enhance the product to provide our customers with a comprehensive solution, and we would love to help you evaluate the solution if you’re interested.