Data Silos: More Silos More Problems

It’s hard to keep up with your data silos. It’s even harder to compare data from different silos. Here’s how you can standardize your data in order to conduct data analysis.

Outlier AI
9 min readAug 18, 2017

Most businesses are using dozens of software products to run their business, ranging from payments and advertising to analytics and databases. Each of these systems contains a wealth of valuable information about your business, but having that information in different products can make it hard to get a complete view of all your data. When your data is not collected into one place and instead isolated among different systems, you have data silos.

Silos can be very frustrating, because the most useful and valuable insights usually involve more than one part of your business. How do your marketing tactics relate to customer support requests? How do changes to your website affect product returns? If your data is isolated into silos it can be hard to answer these questions.

Even if you invest the time required to combine your data from various products into one place (often called a “data warehouse” or a “data lake”), there are always a few systems that resist consolidation due to obscure data formats or difficulty in exporting their data. These silos persist because the types and number of software tools we use in business are changing far faster than our ability to aggregate.

We’ll cover techniques for breaking down data silos and combining your data together to make analysis easier. Specifically we will cover:

  • Part 2 — Export, Transform, Load (ETL)
  • Part 3 — Joins
  • Part 4 — Fingerprinting
  • Part 5 — Correlations

Extract, Transform, Load

The process of consolidating your data from multiple different products and sources is known as Extract, Transform and Load (ETL). This should be a familiar phrase for anyone working in data since up to 90% of your time as a data analyst or engineer can be spent doing ETL. Unlike some of the terms we review here, this one is named for exactly what it involves:

  1. Extract. Downloading your data from software tools is easier today than it was years ago, but it still can be complex. Some tools lets you download everything at once via either their user interface or an API query, others require you to make a request and wait a few hours (or days) and even others have no easy way to get your data. This difficulty is why data collection tools (see Data Collection Tools) were created and are very popular today — they allow you to avoid this problem.
  2. Transform. Unfortunately, almost every product and tool you use will store its data in a different format. In order to be able to consolidate you need to transform the data from the format of the product into a common format that will allow you to combine it with others. Sometimes this is as simple as changing some terms (e.g. referring to California as “CA” instead of “CALIFORNIA”) and others it means changing the data format entirely (e.g. moving from JSON to SQL). There are many tools that can help with this process, but it is generally the hardest part.
  3. Load. Finally, you need to load the transformed data from all of your sources into one place. Typically this is a SQL-based system (see Data Storage Tools), but there are many different options available today.

While ETL is conceptually simple, the difficulties lie in all of the details involved in the data for each product you use. It can take a while to refine your ETL process even for a single product, and quite a long time to reliably ETL the data from all of your products. Many companies have dedicated people working on nothing but ETL to ensure they have the best data available at all times.

Even when your ETL is complete you haven’t broken down your data silos! The data from each source will be in the same place but still be separate, until you find some way to join it together. That’s what we’ll cover in the next few sections: different ways to combine your data from different sources after you have it in one place.

Joins

The good news about combining data from different sources is that every part of your business should have something in common. Perhaps it’s your product names, employees or customers, but the same pieces of information should appear across many different systems (if not all of them). We can use the common elements of data to join (hence “joins”) the two data sources together.

For example, let’s say we work at an e-commerce company and want to understand how our email marketing campaigns affect customer support request. Do some of our promotions create more support problems than others? To answer that, we’ll need to combine the records between our customer support system and email marketing system so we know which customers received emails and contacted customer support. Below are two example records, one from each system:

While much of the data between the two systems is different, you will notice that both use email address as part of their record. Since email addresses should be unique to a specific customer, we can use that information to combine everything we know from both systems together.

That was easy!

Unfortunately, you will rarely have the luxury of a unique identifier like email address available across systems by default. To ensure these identifiers are available, you will need to make sure that identifier is used as broadly as possible in both your workflows and in your data systems. It’s no coincidence that companies assign “Employee ID Numbers” and hospitals use “Patient ID Numbers” since they have high incentive to guarantee there is a unique identifier across every data system.

Now we will talk about more advanced ways to break down data silos even when there is no unique identifier available. Get out your detective kits, it’s time to track some fingerprints!

Fingerprinting

Joins seem like a great way to consolidate all of your data quickly and easily, but, unfortunately, you won’t always have the luxury of a common unique identifier. Your customer support system may identify customers by their email addresses, but your payments system might only have their physical addresses. How do you combine different sources of data with no common identifier?

Fingerprinting is a method of creating a new identifier from the information you do have in each system. Instead of using a single, unique, identifier you combine a few different pieces of data to create a composite identifier.

For example, returning to our online e-commerce company example, we might want to join the data from our Email Marketing System and our Website Logs to see which users visited our website from an email we sent them. Below are two example records from each:

As you can see, there are no unique identifiers in common across the two different systems. However, we can create a unique identifier (fingerprint) by combining a few different pieces of information together:

  • IP Address — This is the network address of a computer, which changes frequently in many cases. A customer might have a different IP address every time they connect, but they will use the same IP address for the entirety of their session.
  • Date/Time — This is the date and time that the user connected. Many different users might connect at the same time.

While neither the IP Address or the Date/Time is unique on their own, together they will uniquely identify a customer!

Combining different pieces of data together into fingerprints is a very common approach to joining data, but it does have some drawbacks:

  1. It may not be possible to guarantee the fingerprint you choose will be unique, so using them to join data can introduce errors into your data. You should expect to have duplicates or undercounting.
  2. Depending on the fingerprint, you may not be able to use it to identify the same user across multiple different interactions. In our example above, using IP address and Date/Time will not help us identify when a customer returns a week later because both their IP Address and Date/Time will be different.

The best fingerprints combine a wide variety of data to minimize these drawbacks, but no matter what you do it will not be a perfect substitute for a unique identifier.

Correlations

We’ve covered correlations previously in a number of different ways, including finding insights. If you remember, a correlation is a measure of the similarity of two metrics (if you don’t remember, see our series on Simple Statistics). That’s great, but how can correlations help us join data from different data sources and break down data silos?

When it’s not possible to join two data sources together using either unique identifiers or fingerprinting, you can instead think of them as two different sources of metrics. By looking for correlation relationships between the metrics from these different sources we can infer joins in the source data.

For example, let us assume that we want to understand how our email marketing campaigns affect our customer support traffic but have neither unique identifiers for a join nor reliable fingerprints to use. Consider the chart below, which plots two recent email marketing campaigns and two types of customer support requests:

If we measure the correlation coefficients between all of these metrics, we find that Email Campaign 1 (blue line) and Support Request: Discount Error (purple line) are highly related with a coefficient value of 0.85. Clearly something about that campaign is related to the rise in those types of requests, and we can delve into the individual records in our customer support and email systems to try and figure out why. There is no clear relation between Email Campaign 2 and either type of Support Requests, as their correlation coefficients are around 0.5.

Of course, it’s also obvious from looking at the chart, but you might have hundreds or thousands of metrics from each system. Measuring the correlation is a faster way to infer joins than visually comparing each pair of metrics!

This approach will let you answer high-level questions, such as “Which marketing campaigns resulted in more customer support requests?” It will not help you figure out which specific users received those emails or opened those support requests, so it is only useful as a high-level tool. The good news is that once you find a correlated relationship, you can pull up the individual data from each system and look for patterns that might be hiding in them.

In Review: Data silos are everywhere, and they can prevent you from doing important analysis by breaking your data up into multiple different places. You have many tools at your disposal to bring that data back together again, ranging from unique identifier joins to fingerprinting to correlation analysis. Often the best insights come from crossing between data silos, so mastering these techniques can produce big benefits.

--

--

Outlier AI

Outlier discovers unexpected changes and patterns in your data automatically.