Resources - CN Blog

Using Python To Combine Datasets For Digital Marketing

POST January 22, 2020

By Sarah Burke

Working in digital marketing, there are several reasons why you might want to combine two datasets from two different sources together. It could be combining Google Analytics data from two different accounts or properties, or turning two different reports into one super report - such as combining page data from Screaming Frog with data from Search Console. Whatever your reason may be, you arrived on this page for answers!

In this article, we’ll run through the different ways to combine datasets with Python so you can find the best method for your datasets and goals. We’ll go over:

Concatenation
Merging
Joining

Fire up a nice new Jupyter Notebook and let’s get started!

Step 1: Set Up

Import pandas

import pandas as pd

Read In Your Datasets

For an introduction on reading in CSV and XLSX files, take a look at Python & Google Analytics: Prepare Data For Analysis.

gsc1 = pd.read_csv("GSC 1.csv")

gsc2 = pd.read_csv("GSC 2.csv")

The process is similar for other file types like JSON or HTML - if you need to read in one of these or from a database, a quick Google search should return the answers you need.

Step 2: Choose Your Method

Concatenation

Concatenation is used to essentially add one dataset on to another. The datasets must have the same column or row names so you can add one to the other and create one big dataset.

In the example below, I have query data from two different Google Search Console accounts (actual queries hidden for privacy!). Both websites/accounts are under the same parent brand and my goal is to create one master report of their search query data.

After importing pandas and reading in my CSV files, I take a look at the head of each dataset to make sure everything pulled in okay and my columns are indeed identical between both datasets:

I also want to take a look at the info on each of these DataFrames so I can compare it to my concatenated DataFrame to make sure it all adds up.

1,000 entries on each - noted! I should then expect 2,000 on my concatenated DataFrame.

Now it’s time to concatenate. This is done with one simple line where you are passing your DataFrames into the concat function as a list.

As expected, there are 2,000 rows, and my same 5 columns remain.

As you can see, the data is concatenated on the row by default. If the dataset you’re using requires you to concatenate on the column instead of the row, specify the axis=1:

pd.concat([df1, df2], axis=1)

I’m also going to set this DataFrame to a new name so I can call it later on.

From here, you’ll need to do some work to clean up the data to handle any missing values and duplicate queries.

Merging

Merging should be used when you want to combine datasets that contain different dimensions. It requires at least one common column to merge the reports on.

Unlike the concatenation example, I may want to create a really powerful report that combines page data from Google Analytics and Screaming Frog. Each of these reports will contain different columns but will have the page URLs in common.

I’ll start by reading in my CSV files and viewing the heads:

A couple things to note here:

The URLs are hidden for privacy, but you can see with this little glimpse that Google Analytics uses only the page path, while Screaming Frog uses the full address including the domain. For the merge to work properly, I’ll need to change the URL structure on one of the datasets to make them match.

I plan on merging my DataFrames on the “Page” column but as you can see, Google Analytics calls it “Page” and Screaming Frog calls it “Address”. There are two different ways to handle this, I’ll go over both to show how the output differs:
- Make the column names match between datasets and use the on parameter
- Keep the column names as is and use the left_on and right_on parameters

Before we run through how each of these work, let’s take a look at the info on each DataFrame so we can compare it with our merged DataFrame.

Merge DataFrames On a Shared Column Name (on parameter):

I’ll start by changing the “Address” column on my Screaming Frog data to match Google Analytics’ “Page” column, and adding the domain onto my Google Analytics page paths so both DataFrames use the full URL.

Now, I can merge on the common Page column:

Since I’m using an inner join, rows only remain if they had a common URL between both datasets. The inner join is the default used by pandas and is the one you’ll probably want to use in most cases. But, be aware that pandas also supports left, right and outer joins if you need them.

Merge DataFrames Using Different Column Names (left_on and right_on parameters):

With this method, I’m specifying the different column names to merge on.

As you can see, my merged DataFrame contains the same amount as rows as using the on parameter, but I have an extra column. With this method, both columns that were merged on appear in the DataFrame. Personally, I prefer using the on parameter to keep the DataFrame cleaner.

Joining

Joining is essentially the same idea as merging, except it’s used when you want to join on an index instead of a column.

This is done by adding the join function to one of the DataFrames and passing in the other DataFrame. Using my DataFrames, it would look like:

ga.join(sf)

This will default to an inner join, but you can also specify a different type of join:

ga.join(sf, how='left')

Step 3: (optional): Export Your Data

At this point, you might have more work to do in your Jupyter Notebook or elsewhere, but if and when you’re ready, you can export your new dataset into a CSV or excel file.

gsc_all.to_csv('all_gsc.csv')

gsc_all.to_excel('all_gsc.xlsx')

Your file will save to the same place that the Jupyter Notebook you’re working in is located.

I hope this article helped you find the best way to combine your datasets. Now you can join me on the road to better reporting! Don’t forget to merge in carefully! If you can think of a pun for concatenation or would like to share how you’re actually using these methods to make meaningful datasets and reports, please share in the comments below!

While you’re here, check out all of the ways Cypress North can use data & analytics to help your business.

Categories: Data Analysis Python for Digital Marketing

POST January 22, 2020

By Sarah Burke

Related Resources

Blog by: Jack Novorr

6.16.2025

Data Analysis

Is Your Digital Marketing Strategy Putting You at Risk? Understanding CCPA’s New Legal Precedent

Capital One’s privacy lawsuit highlights growing risks for marketers relying on standard tracking technologies. Here's what you need to know.

Blog by: Jack Novorr

5.6.2025

Data Analysis

Using BigQuery to Overcome GA4 Data Retention Limits

Keep your GA4 data forever with BigQuery. Learn how to set up BigQuery to start storing raw GA4 data before it's gone for good.

Blog by: Kristen Nalewajek

9.27.2024

Data Analysis

Why US Businesses Need to Prioritize Data Privacy Now

The U.S. doesn't have a comprehensive national data privacy policy in place, but that doesn't mean businesses aren't being impacted. Learn more about the state-level policies reshaping digital marketing strategy and compliance.

Data - Blog - Google Collab [Background]

Blog by: Kristen Nalewajek

3.21.2024

Data Analysis

How to Get Started Using Python for Data Analysis in Google Colaboratory

Learn how to use the free Google Colab tool and perform data analysis with Python programming language in this tutorial for digital marketers and data analysts.

Blog by: Kristen Nalewajek

2.16.2024

Data Analysis

How to Save Universal Analytics Data

All historical data from Google’s Universal Analytics will be deleted on July 1, 2024. Learn more about what your options are for backing it up before it’s gone for good.

Blog by: Matt Mombrea

5.10.2023

Web Programming & Development

Is Google Analytics 4 a Tactical Move Away From Free Analytics?

There’s something fishy going on with the way that Google is handling GA4. To me, it’s playing out as a backdoor cash grab, hidden under a thin veil of a free and easy migration from UA.

Blog by: Sarah Burke

11.22.2022

Data Analysis

Data Lakes & Data Warehouses: What Are They? (& Why Your Company Probably Needs Both)

Data lakes and data warehouses have gained increased interest from organizations in recent years for their ability to support a single source of truth for data-driven decision-making across various departments. Understanding the strengths and applications of each is important not […]

Blog by: Olive Caresosa

4.21.2022

Conversion Optimization

How to Get Started with GA4: A Step-by-Step Guide

Need help setting up GA4 for your company or client’s website? Look no further! This post provides a step-by-step process for creating GA4 properties and best practices to make sure necessary events are tracked and the data flowing into GA4 are accurate.

Blog by: Sarah Burke

1.13.2022

Data Analysis

How To Change Your Google Analytics Attribution Model in GA4

One of the biggest changes to Google Analytics has arrived in 2022 - the ability to change your Google Analytics attribution models. This is a first for Google Analytics as this attribution model change will not just apply to a […]

Blog by: Greg Finn

12.17.2021

Data Analysis

Why You Should Set Up Google Analytics 4 Today

Let's face it. GA4 isn't GR8. Google Analytics 4 is a work in progress to put it kindly. However, in these final weeks of 2021 you have an opportunity to get GA4 installed and tuned up, giving your future self […]

Blog by: Olive Caresosa

10.25.2021

Data Analysis

What to Include in a PPC Dashboard

Learn What Metrics to Include in PPC Reports. Then, Download Our Free Data Studio Dashboard Template! Let’s be real, pay per click advertising is all about data. What campaigns are bringing in the most revenue? What landing pages are converting […]

Blog by: Greg Finn

5.31.2021

Data Analysis

6 Google Ads Custom Columns to Help Uncover More Data

You may already know you can create custom columns in the Google Ads online interface. But, if you're anything like me, you may not always think about how you can leverage custom columns to surface essential Google Ads performance metrics, […]

Blog by: Greg Finn

10.29.2020

Data Analysis

How To See Audience Performace Across Campaigns With Google Ads Reports

Google Ads makes it really easy to see performance at the campaign or ad group level, but analyzing audience performance across multiple Google Ads campaigns is easier said than done. You're left wondering.... What's working well? What's not? Combining like-minded […]

Blog by: Sarah Burke

10.8.2020

Data Analysis

Install Google Analytics on Web Stories With the Official WordPress Plugin

You read that right - the moment we've all been waiting for is here! Google’s Web Stories plugin is out of beta and now offers the ability to install Google Analytics on Web Stories directly in the plugin. If you […]

cross-domian tracking google tag manager

Blog by: Sarah Burke

9.30.2020

Data Analysis

Cross-Domain Tracking With Google Tag Manager: A Simple Guide

Cross-domain tracking can make your life a lot simpler if you find yourself having to analyze Google Analytics data from two different sites. It allows you to capture the full user journey from the moment they land on one domain […]

google analytics multi-channel report matching

Blog by: Sarah Burke

8.20.2020

Data Analysis

Why Don't Multi-Channel Funnel Reports Match Up With Other Reports in Google Analytics?

Why don't numbers from the multi-channel funnel reports match up with numbers for the same metrics in other Google Analytics reports? The discrepancy is largely due to differences in what Google considers direct traffic. Read our guide to gain a full understanding of attribution differences in Google Analytics reporting.

Blog by: Sarah Burke

5.29.2020

Data Analysis

Exploring a New Dataset With Python Part II: Using Seaborn To Visualize Data

Welcome to Part II of Exploring a new dataset with Python! If you missed Part I: The Basics, you can check it out here. In this article, we’ll be returning to our animal mug company’s dataset to continue our exploratory […]

Blog by: Sarah Burke

4.22.2020

Data Analysis

Exploring a New Dataset With Python: Part I

We’re taking it back to the basics in this article. Why? The day of a Digital Marketer is busy. We’re pulled in all sorts of different directions and are responsible for a lot of different things. In my personal experience, […]

Blog by: Sarah Burke

3.17.2020

Data Analysis

Strip Query Strings From URL Data: Python For Digital Marketing

If you’ve ever spent any time in a Google Analytics account, you’re all too familiar with the fact that the data isn’t always pretty. One exceptionally common scenario that us marketers run into all the time is page data being […]

Blog by: Sarah Burke

2.18.2020

Data Analysis

Pandas Groupby Function: Python for Digital Marketing

If you’ve been following along with our Python for Digital Marketing posts, you’ve imported data from Google Analytics into a Jupyter Notebook and may have even combined it with another dataset. If you’ve never been here in your entire life […]