Uploaded Premium Download Uploaded Premium Generate to Google Drive

Tutorial: Create a Remarketing List with Predictive Analytics

In this tutorial, you volition acquire how to create a predictive model for customer conversion based on a combination of in-firm CRM data and Google Analytics Premium logs. It consists of an initial code lab using pre-generated sample information, followed by a detailed implementation guide that shows you how to put predictive analytics into do using your own data.

The cloth in this repository complements an article introducing the topic on the Google Cloud Platform Solutions website.

Two step tutorial

You will showtime by diving into the deep-end and use the R statistics software to generate a report of how likely it is to brand a conversion for a given customer based on website behaviour. This will exist a pace-by-step lawmaking-lab which aims to give you a solid grounding in some of the statistical methods yous can comprise into your own predictive analytics. Hopefully, it will also whet your appetite for implementing this using your own Google Analytics Premium and CRM data.

Post-obit the code lab, you can move onto the detailed implementation guide showing you how to analyze your own Google Analytics Premium logs using the powerful capabilities of BigQuery. The implementation guide will walk y'all through the required steps to:

setup your website to capture the all-of import Google Analytics Customer ID
use BigQuery to generate your ain training dataset for creating a statistical model
create a remarketing list based on your statistical predictive model, again using the power of BigQuery
create an AdWords audition that volition target only the website visitors that are more likely to catechumen into a sale

If you're all set to go, dive into the lawmaking lab to build a statistical model.

Code Lab: Create a Remarketing List with Predictive Analytics

In this lawmaking lab you will analyze a sample dataset and create a predictive statistical model based on a combination of CRM and website log information. Y'all volition exist using the R software to clarify the sample data.

Before you Outset

The code lab makes some assumptions about your skills and feel. These are non strict requirements, but ideally, you lot will have some experience using R and a bones understanding of statistical concepts like correlation and regression.

Follow these steps to setup R on your platform if you practise not already have an R surround bachelor:

Download and install R for your platform using these installation instructions. The Code Lab was developed and tested using R v3.ii.1., but a later version should piece of work besides.
Install the required R packages by post-obit these steps:
- Commencement your R interactive environment by executing the R control
- Run the following commands on the interactive prompt

              > install.packages("ggplot2") > install.packages("ROCR") > install.packages("auto") > install.packages("rms")

The Code Lab Scenario

The example scenario is that of a car dealership chain with showrooms across the nation. Your company website provides general information and advertises your latest promotions, just all actual motorcar sales happen in-person in one of your showrooms. Wouldn't information technology exist slap-up if y'all could observe out which website visitors are most probable to visit your exhibit and take a test drive if you reach out to them?

Conveniently, the internal company CRM arrangement has a field, customer_level, that records the level of date the customer has had with the company to date. A value of iii, for example, means that the client has been on a examination drive. A value of v means that the customer has made a car purchase.

In an endeavor to analyse this existing data and better empathize how their website has affected customers' decisions and behaviour in the past, the visitor has linked their historical Google Analytics Premium log information and the relevant CRM client records together. They at present want to build a statistical model from the historical information, with the aim of using this model to predict future behaviour and prioritise remarketing efforts towards website visitors who seem most probable to further engage based on their web browsing on the corporate website.

To this end, the company has added a field in the CRM called CV_flag to indicate whether a client has at to the lowest degree taken a test drive. In other words, CV_flag will be 1 if the customer_level is 3 or higher, and 0 otherwise. They accept created a set of grooming information based on historical GAP logs linked to specific customers in the CRM, including the CV_flag variable, and at present it is your task to take this information and build a statistical model (in the implementation guide you will larn how to create this type of grooming information using your own data).

Edifice the model

This model hinges on a binary issue. Either the customer converts and at least books a examination drive (success) or they don't (failure). To model this scenario, you will build a statistical model using logistic regression analysis to predict the likelihood of a conversion.

The formal model can be defined every bit follows:

Bernoulli function

Y_i is the objective variable, in this case the conversion success or failure for each customer, i
10_i is the explanatory variable
alpha is the intercept and beta_i are the regression coefficients
p_i is predicting the conversion rate, which volition be used in the remarketing list

The sample data has been provided in sample_data/train_data.csv.If yous open the file to have a look, y'all will see that it contains comma separated values (CSV) where the kickoff line contains the header data for each cavalcade. (After completing this code lab you volition acquire how to create this type of file using your own data)

The outset ii fields, a_fullVisitorId and a_hits_customDimensions_value are unique IDs that are generated by Google Analytics Premium to identify a specific client device (Client ID). The next fields from a_COUNT to a_page201406 are Google Analytics Premium fields that the data analysts accounted potentially relevant to the model. And finally, b_hits_customDimensions_value provides the CRM link dorsum to the Google Analytics Premium data and b_CV_flag refers to the conversion flag for that customer. You tin find detailed descriptions for all Google Analytics Premium fields here.

Plenty with the build-up, allow's run some code! Open your R interactive execution surroundings so that you run across a > prompt, typically by running the following command from the root folder of the cloned github repository:

If your R execution surroundings is unlike, run the commands in the way that you usually run interactively with your R installation.

First, you need to read the data in the train_data.csv file and check that the correct columns have been imported. Execute the commands below to save the sample data in the variable data1 (Practice non enter the >, it merely represents the interactive R prompt):

              > data1 <- read.csv("./sample_data/train_data.csv") > names(data1)

You should see 23 columns at this bespeak. Some of these columns are probably not relevant to the statistical model. Specifically, you can remove the a_fullVisitorId, a_hits_customDimensions_value, a_mobile_flag and b_hits_customDimensions_value, or columns number 1, 2, 12 and 22. To exercise so, run the following control that removes these columns from the variable data1.

              > data1 <- data1[c(-1,-2,-12,-22)]

Based on this data, you can now generate correlation coefficients for each pair of variables. The correlation coefficients tell you lot how correlated two variables are where a value of 1 or -1 means they are perfectly positively or negatively correlated. A value of 0 ways they are completely uncorrelated. For this model, you volition select 0.ix equally the upper threshold. The commands beneath outputs the correlation coefficients for each variable pair.

Let's group correlations and assign them ASCII symbols in the output so nosotros can see which ones are most interesting.

              > symnum(abs(cor(data1)),cutpoints = c(0, 0.ii, 0.4, 0.6, 0.nine, 1), symbols = c(" ", ".", "_", "+", "*"))

Await for the * symbol to indicate a value higher than 0.9. Note that each variable'south own coefficient is always 1, so don't remove all the stars. Below is an extract from the sample data, showing two examples of strong correlation coefficients.

                              a_SUM_totals_hits a_SUM_totals_pageviews  0.954889672                          a_diffdays_oldest a_diffdays_latest       0.968898001

Based on this, you should exclude 1 of each pair, in this case a_SUM_totals_hits and a_diffdays_oldest from the model. This is hands done using R using the post-obit control. The data1 variable now no longer contains columns three and seven:

              > data1 <- data1[c(-iii,-7)]

If you were to run through with this information to the finish, you would notice that the regression coefficient for the a_midnight_flag variable would be undefined when running the model. So in the involvement of fourth dimension you tin remove this variable from the data past running the following command.

              > data1 <- data1[,c(-13)]

Run the Regression

In this step, you will run the regression. Y'all will use the R function glm to perform the logistic regression by specifying a binomial distribution (family=binomial). You can generate the initial model and print the results to the screen with the following commands. Note the utilize of b_CV_flag every bit the objective variable:

              > model <- glm(formula = b_CV_flag ~., information = data1, family = binomial("logit")) > outcome <- summary(model) > result

This shows the estimated coefficients. Before you accept this model, yet, you should check for multicollinearity, dependencies betwixt three or more variables. To practice this, you tin can use the vif function and see if any variables take a VIF (Variables Aggrandizement Factor) of more than than 10. If so, you lot should also exclude those variables from the analysis. For this part to work, you'll need to load a couple of the previously installed libraries.

              > library(motorcar) > library(rms) > vif(model)

Yous should see that two variables have a value of more ten, a_SUM_totals_pageviews (over xxx) and a_SUM_hits_hitNumber (over 27). In other words, multicollinearity occurs. Therefore, you should delete one of the variables, in this case a_SUM_totals_pageviews, and rerun the logistic analysis giving the information set and model a new proper name, data1_2 and model1_2 respectively.

              > data1_2 <- data1[,c(-2)] > model1_2 <- glm(formula = b_CV_flag ~., data = data1_2, family = binomial("logit")) > result1_2 <- summary(model1_2) > result1_2

Below is the consequence output, showing the estimated model. The estimated coefficients are listed in the column Gauge for each of the included variables.

              Call: glm(formula = b_CV_flag ~ ., family = binomial("logit"), data = data1_2)  Deviance Residuals:     Min       1Q   Median       3Q      Max -2.0110  -0.5281  -0.2039  -0.0923   3.2817  Coefficients:                           Estimate Std. Error z value Pr(>|z|) (Intercept)             -ii.3298773  1.3069659  -ane.783   0.0746 . a_COUNT                  0.0578391  0.2152773   0.269   0.7882 a_SUM_totals_timeOnSite -0.0004421  0.0003392  -i.303   0.1924 a_SUM_hits_hitNumber     0.0389365  0.0433719   0.898   0.3693 a_diffdays_latest       -0.1301964  0.0110120 -11.823   <2e-sixteen *** a_desktop_flag           0.9450177  1.1271481   0.838   0.4018 a_tablet_flag            ii.9104792  1.3544122   2.149   0.0316 * a_OS_Windows_flag        ane.0340938  0.9495262   1.089   0.2761 a_OS_Macintosh_flag      0.8657634  1.0523633   0.823   0.4107 a_SUM_morning_visit      0.3425241  0.2438371   one.405   0.1601 a_SUM_daytime_visit      0.2992608  0.2228069   ane.343   0.1792 a_SUM_evening_visit      0.2578074  0.2488274   1.036   0.3002 a_page201404             0.3279693  0.3790867   0.865   0.3870 a_page201405             ane.0019170  0.5065072   one.978   0.0479 * a_page201406            -1.3596393  1.2626093  -1.077   0.2815 --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  (Dispersion parameter for binomial family taken to be 1)      Nothing deviance: 1062.20  on 1294  degrees of freedom Residue deviance:  758.65  on 1280  degrees of freedom AIC: 788.65  Number of Fisher Scoring iterations: seven

Run the VIF function over again to bank check for multicollinearity, and the output should now show that at that place are no variables with VIF values over 10.

Verify the Model Accuracy (Gain and ROC)

One manner to evaluate the value of a predictive model is to generate a cumulative Gain chart. Merely put, it visually shows the gain in response from using the predictive model equally opposed to remarketing randomly across the customer database. The larger the distance between the Gain line and the baseline the better the predictive model is. To generate the Gain chart in R you will need to run the following commands.

              > prob <- data.frame(predict(model1_2, data1_2, type = "response")) > gain <- cumsum(sort(prob[, 1], decreasing = True)) / sum(prob) > png('gain_curve_plot.png') > plot(proceeds,main ="Gain nautical chart",xlab="number of users", ylab="cumulative conversion rate") > dev.off()

This saves the Proceeds chart as a PNG file called gain_curve_plot.png in your local directory. It should look like this:

Gain chart

The ROC (Receiver Operating Characteristic) curve, is normalized from the gain nautical chart. Hither, the value of AUC (the surface area under the ROC curve) becomes the indicator of the goodness of the functioning of the classification algorithm. It takes a value between nil and one, where 0.5 would exist random and a college number would bespeak a ameliorate model than randomness. You can generate the curve by running the R commands below.

              > pred <- prediction(prob, data1_2$b_CV_flag) > perf <- performance(pred, measure = "tpr", x.mensurate = "fpr") > pdf('Rplots.pdf') > qplot(ten = perf@x.values[[i]], y = perf@y.values[[1]], xlab = perf@x.name, ylab = perf@y.proper noun, master="ROC curve") > dev.off()

It saves the output graph in a PDF file named Rplots.pdf in your local directory or opens a new window depending on which R tool you use. Either way, information technology should look like this.

ROC plot

Bank check the model in R using an alternative function, lrm (logistic regression model):

              > Logistic_Regression_Model <- lrm(b_CV_flag ~., data1_2) > Logistic_Regression_Model

Bank check the AIC (An Information Criteria) of the two models. A smaller number indicates a improve model.

              > AIC(model) > AIC(model1_2)

As expected, the afterward model has a lower number and appears to be the meliorate model. Apply that model to create a simple coefficient list of all the variables. To do so, run the following commands to parse the information from model1_2 into something easy to copy and paste:

              > coef <- names(model1_2$coefficient) > value <- equally.vector(model1_2$coefficient) > result <- data.frame(coef, value) > result                        coef         value 1              (Intercept) -ii.3298773233 2                  a_COUNT  0.0578391477 three  a_SUM_totals_timeOnSite -0.0004420572 four     a_SUM_hits_hitNumber  0.0389364789 v        a_diffdays_latest -0.1301964399 six           a_desktop_flag  0.9450177277 vii            a_tablet_flag  2.9104792404 viii        a_OS_Windows_flag  1.0340938208 9      a_OS_Macintosh_flag  0.8657633791 x     a_SUM_morning_visit  0.3425240801 eleven     a_SUM_daytime_visit  0.2992608455 12     a_SUM_evening_visit  0.2578073685 13            a_page201404  0.3279692978 xiv            a_page201405  one.0019169671 fifteen            a_page201406 -1.3596393104

Congratulations! Y'all've successfully completed a statistical model based on historic information and come to the finish of this lawmaking lab. But in that location's plenty more than to learn in the next section where you will run into a detailed implementation guide for how you can apply this approach to your ain data. You will also learn how to take the coefficient values you generated in R and apply them to your visitor dataset in BigQuery to generate a remarketing list.

All the commands are too provided in the gap-bq-regression.r file. To execute this script, run the following command (assuming that you are running it from the base of the sample code):

              $ Rscript gap-bq-regression.r

If your R execution environs is dissimilar, run the gap-bq-regression.r scripts in the way y'all ordinarily run r scripts. Running the script is the equivalent the footstep-past-step interactive execution you just performed.

Implementation Guide

Hopefully you've been able to complete the code lab and are at present ready to acquire more about how to implement predictive analytics with Google Analytics Premium, BigQuery and R using your own data. In this implementation guide you volition get a pace by step view on what to practice, also as helpful templates to become yous started.

You'll go through these main steps:

The steps

Step ane: Setup Google Analytics Premium and your Website

Earlier you tin can commencement implementing the predictive analytics, you lot need to have all these following preliminary steps completed:

Sign up for Google Analytics Premium: Submit the signup form and our sales representative volition walk you lot through the subscription process. Later on completing the process, yous volition receive a notification from the sales representative that you have successfully been signed up for Google Analytics Premium
Configure your website to use Google Analytics: If you're not using Analytics on your website already, you need to enable information technology
Create a new projection and open up the BigQuery Browser tool: Open the Google Developers Console and printing [Create Projection]. Open the project and click [Large Data] - [BigQuery] menu to launch the BigQuery Browser tool and take note of your Projection ID
Reach out to your Google Analytics Premium Account Director to submit a BigQuery export request with your Project ID: Your business relationship manager will take care of your BigQuery export request and will give you a monthly credit of USD 500 towards usage of BigQuery for this project
That's information technology! You will encounter your Google Analytics logs imported into your projection several times per day, with BigQuery tables named ga_sessions_YYYYMMDD

Creating an Entry Grade to Capture Customer ID

You take at least two data sources: Google Analytics Premium logs and your own CRM or similar client-based data. To link the 2 information sets, you need to take a unique cardinal in both places to match a customer in your CRM with the spider web visitor that is browsing your website. The key you lot should use is chosen Customer ID and is a unique key that is generated by Google Analytics Premium based on the cookie information of the visitor. This idea is illustrated below:

Client ID

The question, then, is how do you lot get this Client ID inserted into your CRM database? One common arroyo is to prepare up a course on your website where visitors enter information, such every bit their proper noun and email address, along with a hidden field for Client ID, and then salve that information in your internal CRM database. Past saving the same unique ID in Google Analytics yous can tie the ii records together. It is of import to note that you should never salvage personally identifiable data (PII) in Google Analytics, merely the Client ID.

Add Google Analytics Client ID to HTML Entry Course

To achieve this in practice, your simple HTML entry course might look something like the below HTML snippet on your website, where the /formComplete action will perform the necessary steps to save this information in your CRM organization.

              <form id="lead-class" method="POST" action="/formComplete"> 	<input id="email" type="text"/> 	<input id="proper name" type="text"/> 	. . . 	<input id="clientId" type="hidden"/> 	<input type="submit" value="Submit"/> </grade>

The key here is the subconscious input field called clientId. With just the HTML form element, this value would exist empty when the visitor submits the form. Therefore, y'all will need to add some JavaScript code on your web page that will trigger when the visitor submits the lead-form and inserts the value:

              <script> document.getElementById('atomic number 82-grade').addEventListener(   'submit', function(event) {   ga(function(tracker) {     clientId = tracker.get('clientId');     certificate.getElementById('clientId').value = clientId;     tracker.ship('event', 'online form', 'submit', {       'dimension6': clientId   });   }); }); </script>

The iii key lines are:

                              clientId = tracker.get('clientId');     document.getElementById('clientId').value = clientId;     tracker.send('event', 'online form', 'submit', {       'dimension6': clientId

Line 1 gets the clientId value, a unique identifier assigned to this website visitor, from the Google Analytics tracker object
Line 2 sets the subconscious form field you simply created to the value of clientId

And then when the website company submits the entry form, the unique clientId identifier is saved in your CRM database.

Line three saves the clientId as a custom dimension (dimension6) in Google Analytics

You lot may be wondering what the dimension6 element in the terminal line refers to. Google Analytics allows for custom data chosen by you to be added to its log entries, and telephone call these custom dimensions. The number represents the alphabetize of the particular dimension, index 6 in this example. If you are using multiple dimensions in your Google Analytics already, yous may choose another index, for example dimension7 instead.

Since you are using hit level custom dimensions y'all also need a manner of injecting the clientId into every page view on your website. Otherwise, Google Analytics Premium will not store this value for each hitting. A sample script is provided below, but yous should adapt this to fit your site, if necessary. Encounter the developer guide for examples of different approaches you tin can take. This example Google Analytics snippet sets the clientId for every folio view in the custom dimension dimension6:

              <script>   (function(i,s,o,m,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){   (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),   m=due south.getElementsByTagName(o)[0];a.async=1;a.src=g;1000.parentNode.insertBefore(a,yard)   })(window,document,'script','//www.google-analytics.com/analytics.js','ga');    ga('create', 'UA-XXXXX-Y', 'auto');   ga(function(tracker) {     var clientId = tracker.become('clientId');     ga('set', 'dimension6', clientId);   });   ga('send', 'pageview'); </script>

Congratulations! You've successfully linked your CRM database with Google Analytics, and then what's next? As it turns out, linking your different information sources opens up a number of exciting opportunities for generating more insight into your customers.

Pace two: Import your Own CRM Data into BigQuery

Before attempting this stride, yous demand to activate the BigQuery API (if you haven't already) and download the Deject SDK. This volition allow you to utilize the bq command line tool to load data into BigQuery. You should also ensure that you are working with the correct Google Cloud Project (the one that you created when setting upwardly the Google Analytics Premium export to BigQuery). To gear up the working project, you run the post-obit command that is part of the Cloud SDK:

              $ gcloud config set project <project-id>

Where you supercede the <project-id> with the unique ID of your project.

Your Google Analytics Premium logs are already imported automatically into BigQuery. Merely since the CRM database contains fundamental columns like the conversion probability, you will demand to import the CRM data into BigQuery as well to run queries confronting both data sources. You lot can practice this easily using the bq command line tool. Y'all can find an case mock data set in the sample_data/crm_mock_data.csv file if you lot simply want to try it out before uploading your own CRM data.

BigQuery also requires a database schema when you lot load data into a new table, then you will observe the table schema associated with the mock information in the file sample_data/crm_schema.json.

Create a BigQuery Dataset

BigQuery has the concept of a Dataset that acts as a container for i or more tables. Y'all can create a new Dataset using the bq mk command giving the name of the new Dataset to create as an argument. In this case, the Dataset is called gap_bq_r, simply you lot can call it something else if y'all like, as long as you lot use the same name consistently.

The Google Analytics Premium logs consign volition produce tables in BigQuery with names like ga_sessions_20150630, where the concluding role is the date. To collate these tables, Google Analytics Premium creates a Dataset with a numeric name similar 12345678. So the full path for a specific log file would look this this:

              <project-id>:12345678.ga_sessions_20150630

Load Mock CRM Data into BigQuery

Next, you tin can load your own CRM data into the dataset yous created. The commands below employ the sample data files, but y'all can supersede them with your own database exports if you have them. The command to load a CSV or JSON file into a BigQuery table is bq load. You and so give the tabular array proper noun, input data file and tabular array schema as arguments. If the table does not exist, it will create it nether your called Dataset.

              $ bq load --skip_leading_rows=1 gap_bq_r.CRM ./sample_data/crm_mock_data.csv ./sample_data/crm_schema.json

In the case above the new table will be chosen CRM. The --skip_leading_rows=i optional argument tells the load job to ignore the header row of the CSV since that data will come from the schema JSON file. This job will execute synchronously, but y'all can automate a like chore to run asynchronously. After the load command completes, verify that it was uploaded correctly by showing the first three rows using the head control. For case, using the sample data:

              $ bq head -north 3 gap_bq_r.CRM

Of course, you lot tin can do all this from the BigQuery web interface besides, but by learning some of the primal uses of commands it is easier to automate and script solutions. For case, if y'all desire to upload CRM information from thousands of existing files, it volition exist a lot easier and less cumbersome to use the control line directly.

Stride iii: Clean the Data

Before you start running regression analyses against your newly imported data sources, it's important to verify its validity and make clean the data to make it ready for processing. In this footstep you will commonly run across a few data-quality problems including:

Missing data
Invalid or inconsistent data types
Distribution bias or uniformity

In this section you will learn some techniques of how to address them, and by the end you will accept prepared a complete BigQuery SQL statement that tin can extract the relevant data from the Google Analytics Premium logs likewise as your CRM database. Y'all tin find a sample template SQL query statement based on the mock information in the file join_data_sources.sql, but it's important to annotation that your own final query volition be dissimilar from the template, since your own information sources will differ from the mock information provided as a sample.

The next few sections volition explicate in more detail what the different lines in the join_data_sources.sql template hateful, and will hopefully give you insight to start crafting your own BigQuery SQL argument to join your ain information sources together.

Match the Google Analytics data with your ain CRM database

One disquisitional part of the logic behind the SQL statement is to perform a BigQuery table bring together between the CRM database and Google Analytics Premium log tables. The line that does this in the SQL argument is:

              JOIN   [<PROJECT-NAME>:gap_bq_r.CRM] b ON   a.hits.customDimensions.value = b.hits_customDimensions_value

It links the Google Analytics (a) and CRM (b) tables past performing a Bring together on the rows where the variable hits.customDimensions.value is the same in both tables. Recall that variable? That'south right, it's the 1 generated through the HTML entry form and JavaScript example in Step 1. It's important that you replace the <YOUR-PROJECT> part with the name of your specific project, and the Dataset and tabular array name if you're using your own data.

Processing missing values

For statistically meaningful results, yous should only include variables with a sufficiently high make full rate in the analysis phase. This means that variables that take a big number of missing values will be exist excluded. For the variables that volition exist included in your model, you should ensure that whatsoever missing values are of the right information type. For continuous variables, for instance, replace the empty value with with a number that does not significantly affect the distribution of the data, like the average (mean), or median value.

Here is one example of this technique in the sample query template that uses the built-in BigQuery IF() function:

              IF (SUM(INTEGER(totals.pageviews)) is zip, 1, SUM(INTEGER(totals.pageviews)))

This line checks if SUM(INTEGER(totals.pageviews) is a null field and, if truthful, replaces it with the value 1 instead. Otherwise it leaves the value as is.

Catechumen Nominal or Mutually Exclusive Variables into Dummy Variables

You may want to include nominal variables, such equally whether the visitors used desktops or tablets, Windows or MacOS, in your regression model. For instance, Google Analytics Premium logs provide a cavalcade called deviceOperatingSystem. The values hither can include Windows and Macintosh. Unfortunately, with the data in that form, you won't be able to include it as an explanatory variable in the model.

To exercise so, y'all demand to create dummy variables for each category of value (e.g. Windows). As an example, yous can create a new tabular array cavalcade called OS_Windows_flag and insert the value 1 if deviceOperatingSystem is Windows, or else the value 0 if it is annihilation else. You need to create a column for each alternative category as well, for example OS_Macintosh_flag. Here is one example from the sample template:

              Example WHEN device.operatingSystem = 'Windows' So 1 ELSE 0 END

Convert Hit Level Logs into Per-User Format

The BigQuery SQL query also converts the information from striking level rows and aggregates it based on the unique user. In other words, each row in the result set will represent one unique website company. This is done through the following line.

You volition also want to use BigQuery aggregate functions, similar SUM(), MAX(), and MIN() to correspond the data in the most appropriate, grouped manner. For example the following calculation of the well-nigh recent and oldest timestamp for the visit using the DATEDIFF() function.

              DATEDIFF(TIMESTAMP('2014-09-10 00:00:00'), MAX(date)) AS diffdays_latest, DATEDIFF(TIMESTAMP('2014-09-10 00:00:00'), MIN(date)) Equally diffdays_oldest,

Yous may likewise desire to include specific events that the user performs, and tin can track these using Google Analytics Premium Events. In the line below, all the hits for the specific eventLabel are summed together using the BigQuery SUM() function.

              SUM(hits.eventInfo.eventLabel='<website_path>') Every bit page201404

Once you have executed your BigQuery SQL statement to join and extract the appropriate data, you lot can proceed to create a statistical model, which is the side by side step.

Stride four: Create the Regression Model

Does this step wait familiar? Well, information technology should. This is the footstep that yous performed in the Lawmaking Lab earlier. You can accept the sample R script provided for the Code Lab and update it to fit your ain data and modelling needs. Once you're satisfied with the regression model and you have the coefficient for every variable, you tin continue to the side by side step, creating a remarketing list based on your regression model.

Step v: Create the Remarketing List

Having completed the regression assay using R, you should now have regression coefficients (beta) for each explanatory variable (Ten_i) in your model. You tin can apply these coefficients against each user (j) in the Google Analytics exported information to calculate their conversion probability. To perform this computation in R would have a very long fourth dimension if the number of users is high. That is why you volition utilize BigQuery instead, since this service can easily calculate a score for each user through in-built mathematical functions, like exp, equally demonstrated in this formula:

Probability formula

Run BigQuery Chore

Below, you lot can see the superlative part of a BigQuery SQL argument using sample data. The coefficients are multiplied with the corresponding variable and multiplied by 100 to create a score between 0 to 100 for each user. The results are sorted in descending gild of probability. Y'all can find the complete SQL query template in the file generate_remarketing_list.sql.

              SELECT hits.customDimensions.value, (1 / (1+exp(-(-2.3298773233 + 0.0578391477*(COUNT) + (-0.0004420572)*SUM_totals_timeOnSite) + 0.0389364789*(SUM_hits_hitNumber) + ... one.0019169671*(page201405) + (-ane.3596393104)*(page201406) ))) * 100 As CV_probability ...

              hits_customDimensions_value    CV_probability 6600000098.3400000057          95.18373911417612 3100000022.2400000081          94.07515998878802 8000000037.7400000053          93.55577449415483 940000034.2400000085           93.30393516985602 240000000.9400000018           92.78654733606425 ...

The first results cavalcade contains the Google Analytics ID for each user (hits_customDimensions_value), and the second column shows the probability of that user converting based on the regression model. In other words, the model is estimating a 95% likelihood that the top user volition at least volume a examination drive.

By sorting this list based on the conversion probability in descending order, yous will have a prioritised list of users to follow up with for remarketing outreach, for example through Google Adwords, DoubleClick, and Google Analytics Premium.

Download the Remarketing List from BigQuery

Before you can import this listing into Google Analytics Premium, you volition take to download the results from BigQuery. To do this, click the "Download equally CSV" button and save the file with a representative name, for example remarketing_list.csv.

Stride 6: Import Remarketing Listing into Google Analytics Premium

Before importing the remarketing list into Google Analytics, you demand to have completed the post-obit:

Link your AdWords business relationship with the Google Analytics Premium business relationship that you will import the remarketing list into
Enable remarketing in your Google Analytics Premium account
Create a custom dimension and (optionally) a custom metric to store the conversion probability

If you lot are including the Client ID with every hit, you may wish to non create a custom metric for the conversion probability and employ but a custom dimension since the value tin can get very large. If you approach it from a session or user level and so the metric can be very useful. It all comes down to the specific approach that's right for your scenario.

Create a New Google Analytics Data Fix

Login to your Google Analytics Premium business relationship and navigate to the "Admin" section. Create a new information set past clicking the "+ NEW DATA Fix" push under the "Data Import" carte particular for the relevant property. This starts a pace-by-stride setup sorcerer.

New data set

Select the "Custom Information" blazon and go to the next step.
Select "Processing time" as the Import behavior. Do annotation that this ways the uploaded conversion probability data will simply exist candy when the user visits the website again with the same Client ID.
Proper name the data set up and choose which views you want to make employ of it. You'll have to select at least one view or else the information set up volition be inactive.
Define the schema for the data set up. Specifically, you should choose the "clientId" Custom Dimension (custom dimension half dozen in the sample) as the fundamental and the "Conversion Probability" Custom Metric. Click "Salve" and move to the next stride.
Select "Yes" when asked about "Overwrite hit information" and define the header line in the CSV file as shown in the screenshot below. Note the link to Custom Dimension 6, Custom Dimension nine and Custom Metric i equally defined above.

CSV Header

Import Remarketing List into Google Analytics Premium

Now that you accept defined the Data Gear up in Google Analytics Premium, information technology's time to import the remarketing listing CSV file. Again, navigate to "Information Import" and click the "Manage Uploads" link in your newly created Data Set.

Click the "Upload file" button, choose the CSV file y'all exported from BigQuery containing the conversion probabilities and click the "Upload" push button.

Create Remarketing Audience in Google Analytics Premium

After the import processing has completed, you lot tin create Remarketing Audiences. Note that it may take some time before the newly imported data is set to use. Once gear up, go to the "Admin" section and cull the "Remarketing" and "Audiences" menu options. Click the "+ NEW Audience" button to create a new Audience and select the AdWords account that you'd like to share the remarketing listing with.

New Audience

Then, select the "Conditions" bill of fare in the "Advanced" section of the Audition Architect screen. Here you can ascertain the status for including a user in the Audience. In the screenshot beneath the condition chosen is to only include users with a Conversion Probability of more 75%. Click "Apply" and, subsequently a while, the remarketing listing volition get-go to populate based on the condition listed.

Audience builder

As y'all experiment, be sure to read the usage guidelines with regards to what information you can ship to Google Analytics and in item the restriction on sending whatever personally identifiable information (PII), and your Google Analytics Premium agreement, including the Advertising Features policy.

Hopefully, this tutorial has shown you how to get started and gather more insight into your data based on the Google Analytics Premium + BigQuery integration. This is only a starting bespeak, so now you tin can experiment with your own information and come upwards with other ways to gain value from online and offline data sources.