Skip to main content
What is Dark Data?
  1. Glossary/

What is Dark Data?

6 mins·
Ben Schmidt
Author
I am going to help you build the impossible.

You hear it constantly in the startup world.

Data is the new oil.

Investors want to see your metrics. Advisors tell you to track every user interaction. Engineering teams build pipelines to capture every event, click, and server response. You are building a culture of data collection because you believe that information equals power.

But there is a catch.

Collecting data is easy. Storing data is relatively cheap. Analyzing data is difficult.

What happens to the vast majority of the information you collect? It sits in a cloud storage bucket. It gathers digital dust. It is never queried. It is never used to make a strategic decision or improve a product feature.

This is Dark Data.

It is not sinister. It is simply unused. However, for a founder trying to build a lean and efficient organization, ignoring this data is a mistake. It represents either a wasted opportunity or a hidden liability.

Defining the Idle Asset

#

Dark Data is officially defined as information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes.

Think of it as the operational exhaust of your business.

When a user visits your site, your server creates logs. When a customer emails support, that text is archived. When you hold a Zoom meeting, the recording is saved to the cloud.

If you take that server log and use it to fix a bug, it is functional data. If you take that log and let it sit on a server for five years without ever looking at it, it has become dark data.

It is estimated that over half of all data stored by organizations is dark.

In a startup context, this usually happens because of a fear of missing out. You do not know what questions you will need to answer in the future. So, the default setting is to save everything. You assume that one day you will have the time and the data science team to mine this information for gold.

That day rarely comes.

Instead, you end up with unstructured data. This includes documents, text messages, audio files, video files, and still images. Unlike structured data, which fits neatly into the rows and columns of a database, dark data is messy. It is hard to search and hard to quantify.

The Financial and Technical Burden

#

There is a misconception that storage is so cheap it does not matter.

While the cost per gigabyte has dropped historically, the volume of data generated by modern startups has exploded.

If you are running a SaaS platform, you might be generating terabytes of log data a month. If you are keeping high definition video backups, that footprint grows exponentially.

You are paying a monthly rent on digital real estate that nobody visits.

Beyond the direct invoice from your cloud provider, there is technical debt. When your engineers need to migrate databases or upgrade infrastructure, the sheer volume of this dark data slows them down. It makes backups take longer. It increases the time it takes to restore systems after an outage.

Dark data acts as friction. It is weight in the boat that makes the startup move slower.

We have to ask ourselves a difficult question. Are we paying to store this data because it has value, or are we paying to store it because we are too lazy to decide what to delete?

The Liability in the Shadows

#

The financial cost is annoying, but the security risk is dangerous.

This is where dark data turns into a boardroom issue.

Data you do not use costs money
Data you do not use costs money

If you do not know what is in your data, you cannot protect it. Dark data often contains sensitive information that you are not even aware you possess.

Consider a folder of old customer support exports from three years ago. It might contain personally identifiable information (PII) or credit card fragments that were never properly sanitized.

If your startup suffers a data breach, hackers will not just steal your active database. They will steal everything they can access.

If that dark data leaks, you are still liable for it.

You cannot tell regulators or your customers that you did not know the data was there. Ignorance is not a defense in compliance frameworks like GDPR or CCPA.

Holding onto data you do not use increases your attack surface without increasing your business value. It creates a scenario where you have all the risk and none of the reward.

Mining for Value

#

It is not all bad news.

Sometimes, dark data is simply untapped potential. The goal for a founder is to convert dark data into functional data.

This is where modern machine learning and AI tools can actually help.

Previously, analyzing thousands of hours of customer support calls was impossible. You would need a human to listen to them. Now, you can use transcription services and natural language processing to analyze that dark data.

You can uncover patterns.

Are customers angry about a specific feature?

Is there a competitor mentioned frequently in your sales calls that you are ignoring?

This transforms the data from a storage cost into a business asset.

The key is intent. You must have a specific question you want to answer. If you can define the question, you can look into your dark data to see if the answer exists.

If the answer is not there, and you cannot foresee a scenario where the data answers a critical business question, it is likely time to let it go.

Operational Hygiene for Founders

#

How do you handle this as you build?

You need a policy for data retention. This sounds corporate and boring, but it is essential for a well run business.

Start by auditing what you collect. Look at your cloud storage bills. Look at your server logs. Look at your third party tools.

Classify the data into three buckets.

First is critical business data. This is what you use every day to run the company.

Second is compliance data. This is what you are legally required to keep for a set period, like financial records or tax documents.

Third is everything else.

For that third bucket, you need to make a decision. Can you automate an analysis of it to gain insights? If not, can you set an automated deletion schedule?

Maybe you only keep server logs for 30 days instead of forever. Maybe you delete video recordings of meetings after 90 days.

By actively managing this, you reduce your costs. You reduce your legal risk. You make your engineering team faster.

Building a company is about focus. That focus should extend to your digital assets. Do not be a data hoarder. Be a data editor. Keep what matters and have the confidence to delete what does not.