Strip Query Strings From URL Data: Python For Digital Marketing
If you’ve ever spent any time in a Google Analytics account, you’re all too familiar with the fact that the data isn’t always pretty. One exceptionally common scenario that us marketers run into all the time is page data being split up due to query parameters in the URL. It’s not uncommon for a landing page from an ad campaign, for example, to take up hundreds of rows in an analytics report. Each click generates a unique combination of parameter values including click ID, device, placement, or any other parameters you may be tracking. Google Analytics sees each of these as a unique URL that goes into its own row, making it difficult for marketers to see true totals or averages for the page.
While these parameters are usually put in place for a reason and are extremely valuable for campaign tracking, I always recommend setting up a separate view in every Google Analytics account that filters out the parameters so you can flip back and forth depending on what you need. But, this doesn't always happen and there are plenty of times when you’re stuck cleaning up the messy data for one reason or another.
In this article, I’m going to show you how you use Python's pandas library to:
- Strip the query string off of each URL
- Aggregate the data for each URL
- Create a nice clean CSV file that has the aggregate value for each URL
We’ll use techniques from our past blogs, including exporting data from Google Analytics and reading it into a Jupyter Notebook, and grouping data using the groupby() function. So if you’re new to any of these steps, check those out first for full explainers.
To start, you should have the page data you need exported from Google Analytics along with all of the columns that you’re interested in.
Go ahead and import the pandas library and your Google Analytics file, and take a look at the info() and the head().
In the info() output, take note of the data types for each metric. If your numbers contain commas, they’ll come in as objects instead of a numeric data type and you’re going to have problems later on when it’s time to aggregate the data. Use your favorite method to make sure your numbers are actually coming in as numbers - for mine, I simply used the thousands separator when reading in my CSV.
Next up, we’re going to split the URL where the query string starts, on the question mark. The first part will be the page path (the part we want), and the second part will be the query string. I’m going to add str[0] so it only displays the first part.
This will replace the values in the Landing Page column.
Okay - our dataframe is cleaned up and we’re only looking at URL paths that have the query strings removed.
Now, we can group by each URL. This means that data will be aggregated for all duplicate URLs. In this next step, I’m defining which type of aggregation I want for each metric - I want the sum for Sessions and New Users, and the average for Pages/Session. I’m adding round(2) to keep my Pages/Session values to the nearest hundredth.
newga = ga.groupby('Landing Page').agg({'Sessions':'sum', 'New Users': 'sum', 'Pages / Session': 'mean'}).round(2).reset_index()
You can either take a look at your data here in the Jupyter Notebook, filtering by a certain URL path for example:
Or, you can export all of your data to a new CSV file to explore.
newga.to_csv('pages-noparams.csv')
This will give you the full list of your pages and their aggregate values. Now look at your clock and see how much time you saved! If crunching the numbers is still taking up a bit too much of your time, see how we can work together on your next data & analytics project. Happy reporting.
0 Comments