There are but three inevitabilities in this world: death, taxes, and way too much data. Everything from web browsing and search engine results to Amazon orders and Facebook posts produces new digital information at a blistering pace. According to a 2014 EMC Digital Universe article, total data creation in 2013 amounted to 4.4 zettabytes (or 4.4 trillion gigabytes, a figure anticipated to double every two years. In every way imaginable, society is truly embracing the “Digital Age”… but is there a true understanding of the data created, collected, and stored within it?

Prior to the current incarnation of cloud computing and storage, data was traditionally stored on individual PCs and/or on storage devices in specialized data centers. Today’s organizations, meanwhile, seek better solutions for data. Cloud storage solutions—including file storage, databases, and message queues—are now commonplace, and cloud vendors can provide reliable solutions that meet modern data storage needs.

However, a 2013 MIT Technology Review report determined that less than 0.5% of created and stored data is actually used, which begs the question: why is there such a great need to store it all? There are several viewpoints on this matter amongst my industry colleagues. Some say that storage is cheap, and we don’t know yet what the value is data is or might someday be. Others think this overabundance of data represents an organizational risk, and argue it should be purged to avoid legal and reputational consequences.

Each organization has varying storage needs and storage costs. Many companies manage their own data centers and disk storage, while other companies use cloud-only storage. A growing number of companies even use both to some degree. That being said, there are no clear-cut statistics available that define the averages for data storage by organization or organization type.

During one of my previous engagements, for example, I built and supported a SaaS Data Warehouse in the Scholarly Publishing industry. The database alone contained more than 4 terabytes, and file storage added over 10 times that quantity. While 44 terabytes isn’t an earth-shattering amount of data, other costs factored into the overall cost of that data. The HADR requirement doubled expenses all by itself, then there was the offline/backup solution, and so on and so forth from there. Every feature had a price tag, and that of the data itself represented only a fraction of the total capital required. This pattern more or less continues when it comes to data creation and growth. More and more customers want more and more data: social media data, website data, membership data, society data, etc. While data sources in this space continue to expand, though, customer use of this data remains difficult to quantify—in particular, whether said use justifies the cost of storing and maintaining it.

In the 1990s, during the “Great Data Warehousing” build-out, many vendors talked about stale data—data that had no relevance to business processes and/or decision-making. In those days, storage was costlier as compared to our current models.

So what’s changed nowadays? In many ways, nothing. Useless data is simply that: useless data. Despite the plethora of excuses for keeping all organizational data—“storage is cheap”, “we’ll figure out a way to use the data later”, “I was just holding these files until the real owner gets back”—is that really the best practice? In my experiences, I have rarely known a customer to validate “dead” data as meaningful.

Five years from now, this seemingly harmless data collection may outgrow both our expectations and our capacity to handle it. And to make matters worse, it’s likely that no one will even remember why much of their old data matters at all. Organizations may not realize it now, but hoarding data now is liable to result in bloated and untenable ”Data Landfills” later.

In these budding “Data Landfills”, over 80% of the data therein are either redundant, obsolete, or trivial. Even worse, these data are not information or business assets. They’re just plain data—zettabytes of it.

So if organizations want to prevent this from happening, what should they do next?

  1. Turn data into information wherever possible. Create data quality rules for data transformations, and place them as close to the business as possible.
  2. Get rid of “n number of copies” rules. One version of the data is fine in most cases. What good is 10-year data if no one looks at it or no business decisions are made on it?
  3. Store only what is needed. Will it be important five years from now what some random person said about our company? Does the organization have any legal obligation to keep the data? If not, define the business case to support it.
  4. Stay relevant. Define requirements. Grow the data that is needed. Prune the dead data.

Landfills don’t always smell the best. Don’t make the organization stink with Data Landfills.

About The Author

Benny Mouton

Benny Mouton

Benny Mouton is a senior consultant with over 28 years of Information Technology experience. He has worked in a variety of industries including finance, state government, healthcare, telecom, scholarly publishing, and pharmaceuticals. He has designed, developed, and implemented both transactional and analytical systems over the course of his career. His primary focus is data management, data governance, and data architecture. He is a graduate of Louisiana State University with a degree in Quantitative Business Analysis/Computer Science.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>