Last Updated on May 26, 2021 by Ashok Kumar
Data Sprawl happens when enterprises collect, process, and store vast amounts of data, and it’s becoming increasingly harder for them to keep track of what data they have, where it is located and who has access to it.
What is Data Sprawl?
Our data (both structured and unstructured) is consumed by a wide range of applications and operating systems and stored on a variety of endpoints and servers. Our data might be stored locally, or on one or more cloud storage platforms, which might be located in different geographic zones. Having such a complex, distributed, and dynamic IT environment leads to what is referred to as “data sprawl”, and presents a huge risk to our data.
Not only is it difficult to monitor and control data stored on many disparate systems, but it also makes it harder for employees to quickly and accurately retrieve the information they need – thus resulting in a loss of productivity. To make matters worse, a lot of the data we store is duplicate or ROT (Redundant, Obsolete, and Trivial), which also leads to excessive utilization of resources. Likewise, companies often perform analytics on their data in order to make important business decisions, which is significantly more challenging when your data is spread across multiple environments.
With the advent of stringent data privacy laws, such as the GDPR, it has never been so important to ensure that you know exactly where your sensitive data resides and that you are able to retrieve the data you need in a timely manner. For example, under the GDPR, individuals have a right to access, modify and remove any personal data that is collected on their behalf, and organizations are required to respond to a Subject Access Request (SAR) within one month of its receipt. A failure to do so could result in a costly lawsuit or fine.
How to Manage Data Sprawl
The first step to managing data sprawl is to make a decision about the best place to store your data. These days, due to significant improvements in cloud security standards, you will probably find that a popular cloud storage repository will be your best choice. Additionally, storing all of your data in the cloud will make it more accessible to your employees – thus improving productivity. By storing all of your data in a centralized cloud repository you can establish a “single source of truth”. You will also need to establish a comprehensive set of policies that deal with data access governance (DAG), which provides details about how data should be collected, processed and stored, and should contain information about access controls, retention, and disposition, risk management, compliance and more.
Data Discovery & Classification
Knowing exactly what data you have, where it is located, and who has access to it, is crucial if you want to prevent data sprawl. As I’m sure you can imagine, manually classifying data by sifting through files stored in multiple independent repositories would be a daunting task. Fortunately, technologies exist which can spare you the trouble. Automated data discovery and classification solutions will do what they say on the tin. It will scan your repositories (both on-premise and cloud) for sensitive data and classify it according to your chosen schema. Most sophisticated solutions will allow you to choose a classification taxonomy that aligns with the data privacy laws that are relevant to your industry.
Removing Duplicate and Irrelevant Data
As mentioned previously, organizations store a lot of data that is either duplicated or redundant. It would be a good idea to remove this data as soon as possible to ensure that the repositories you are working with are lean and clean.
There are data deduplication tools available that are primarily used for backup and restore operations, however, they can still be used to remove duplicate data from your repositories. Even if you choose not to use a data deduplication solution, classifying your data will make the process of removing duplicate data much easier – as the duplicate files will be classified under the same label.
Removing ROT data is slightly more complicated since there are no automated tools that can accurately determine whether the data is needed or not. That said, there are numerous content searching tools that will allow you to sort data based on the date it was last accessed – thus giving you an insight into whether the data is still relevant. For example, if a given file hasn’t been accessed for several years, and doesn’t contain any sensitive data (PII, PCI, PHI, IP and so on), you can move the file to a “Redundant” folder, which will allow you to retrieve the data at a later date if required.
Data-Centric Audit & Protection (DCAP)
Another great way to prevent data sprawl is to use a Data Security Platform that gives you a data-centric view of your security. While monitoring access to sensitive data doesn’t directly prevent data sprawl, a Data Security Platform will give you valuable insights into where your data is located, how and when it is used, and a lot more.
To start with, most solutions can aggregate and correlate event data from multiple sources – including both on-premise and cloud environments, and display a summary of these events via a single dashboard. Many solutions also provide built-in data discovery and classification tools. They use machine learning techniques to establish typical usage patterns, which can provide useful information about who has access to what data, and when.
Another cause of data sprawl relates to inactive user accounts. Inactive (or “ghost”) user accounts are a big threat to the security of our networks because attackers often try to hijack these accounts as they are rarely monitored, thus allowing them to gain access to the network without getting noticed.
Most solutions will automate the process of identifying and removing inactive user accounts, and since these accounts are typically tied to data, removing them (and their associated data) can help to prevent data sprawl.
Prevention is Better Than Cure
You should have policies in place that determine how data should be collected, why the data is collected, where the data should be stored, and for how long the data should be retained. All employees must be trained to ensure that they can adhere to these policies. All data should be stored in one repository, where it can be searched and sorted via a centralized dashboard. It would also be a good idea to prevent employees from storing data on their personal devices. Some companies provide their employees with “thin clients”, which don’t have any built-in storage capabilities.
Additionally, companies may want to consider using a mobile device management solution to prevent users from downloading data onto portable drives and devices. Finally, some Data Loss Prevention (DLP) solutions are able to automatically detect and block sensitive data as it leaves the network, thus helping to keep your critical assets in one place, and thus out of the wrong hands.
If you’d like to see how the Lepide Data Security Platform can help you to manage data sprawl and prevent data breaches, schedule a demo with one of our engineers.