Unbaked Cohorted Data

The Problem

My company provides performance marketing services through machine learning–driven mobile advertising. In our industry, clients often evaluate the effectiveness of our strategies using cohorted data, which tracks events and return on ad spend (ROAS) relative to the day a user installs the app. This approach enables consistent performance measurement across acquisition dates. The most common cohort used is Day 7 (D7), but since it takes seven days for this data to fully “bake,” there’s often a need for earlier indicators of campaign performance. This example highlights my approach to surfacing unbaked data in Looker, enabling our commercial and operations teams to make faster, more informed decisions. Field names and customer data will be obfuscated.

Here’s a visual representation of the problem, with unbaked data represented in magenta. This data was previously unavailable to our teams, resulting in a 7 day lag before the fully baked data was made available.

How I solved it

Unbaked cohorted data has broad utility across a variety of teams, making it a strong candidate for a self-service tool. Rather than building and maintaining a set of static Tableau dashboards, I centralized access by designing a flexible, user-friendly dataset in Looker. However, serving this data came with technical challenges. Accurate cohorting required querying from a large hourly table which would’ve created a performance bottleneck for a Looker Explore. Implementing cohort logic directly in Looker would have been too bulky and computationally expensive.

To solve for this, I designed a multi-layered warehousing architecture that balanced precision and speed. Aggregation and cohort logic were pushed upstream into an ETL built from the hourly table, which was folded into a derived SQL table that aggregated the cohort metrics at a more usable grain. I then layered this into a Looker PDT to further improve query times and used the LookML view layer to expose lightweight, intuitive fields. This approach ensured fast performance, simple maintenance, and ease of use.

Step 1: Creating an ETL

The source table for this pipeline ingests an average of 144 million rows per hour. As a result, generating a full week of data for a single D7 cohort would involve aggregating roughly 24.2 billion rows, leading to multi-hour query times, which are unsuitable for self-service analytics in Looker.

To make this data accessible and performant, I reduced cardinality by omitting sub-campaign-granularity dimensions such as creatives and source apps, and pre-aggregated the cohort-based metrics. Thinking ahead, I also aggregated the date field to the day, to reduce the row count and prepare for an eventual union. I also included some fields that would act as join conditions later on once data from the pipeline was populated in Looker. This approach allowed me to produce a much leaner table, which I could query to create a derived Table in Looker. See an obfuscated version of the hourly partitioned Dagger Pipeline Below.

SQL Code Embed with Syntax Styling
    
SELECT
     org_level
  ,  ad_type
  ,  acct_mgr_email
  ,  ops_mgr_email
  ,  client_segment
  ,  goal_primary
  ,  goal_secondary
  ,  goal_tertiary
  ,  goal_primary_val
  ,  goal_secondary_val
  ,  goal_tertiary_val
  ,  promo_id
  ,  promo_title
  ,  promo_class
  ,  promo_daily_cap_micros
  ,  promo_tracker_name
  ,  promo_tracker_type
  ,  traffic_source
  ,  ARBITRARY(promo_target_country_list) AS promo_target_country_list
  ,  ARBITRARY(promo_label_list) AS promo_label_list
  ,  creative_format
  ,  opt_state_current
  ,  client_id
  ,  client_title
  ,  target_app_genre
  ,  target_app_code
  ,  target_app_title
  ,  supply_partner_id
  ,  supply_partner_name
  ,  conversion_goal_id
  ,  conversion_goal_title
  ,  opt_state_final
  ,  is_engaging
  ,  is_monetized
  ,  is_filtered_out
  ,  is_video_ad
  ,  is_view_attributed
  ,  asset_size
  ,  model_variant
  ,  device_platform
  ,  primary_kpi_id
  ,  primary_kpi_title
  ,  tracker_provider
  ,  show_endcard
  ,  CAST(NULL AS VARCHAR) AS device_group
  ,  ARBITRARY(video_length_bucket) AS video_length_bucket
  ,  is_non_targeted
  ,  app_title_id
  ,  is_precision_targeted
  ,  SUM(bid_count_est) AS bid_count_est
  ,  SUM(no_bid_count_est) AS no_bid_count_est
  ,  SUM(spend_micro_units) * 1000000 AS spend
  ,  SUM(spend_micro_units) AS spend_micro_units
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '1' DAY, rev_micros, 0)) * 1000000 AS rev_d1
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '3' DAY, rev_micros, 0)) * 1000000 AS rev_d3
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '7' DAY, rev_micros, 0)) * 1000000 AS rev_d7
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '1' DAY, rev_micros, 0)) AS rev_micros_d1
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '3' DAY, rev_micros, 0)) AS rev_micros_d3
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '7' DAY, rev_micros, 0)) AS rev_micros_d7
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '1' DAY, net_value_micros, 0)) * 1000000 AS net_value_d1
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '3' DAY, net_value_micros, 0)) * 1000000 AS net_value_d3
  ,  SUM(IF(from_iso8601_timestamp(event_time) - from_iso8601_timestamp(
              IF(promo_group = 'reengage',
                  COALESCE(click_time, imp_time, event_time),
                  COALESCE(install_time, click_time, event_time)
              )
          ) < INTERVAL '7' DAY, net_value_micros, 0)) * 1000000 AS net_value_d7
  ,  SUM(imps) AS imps
  ,  SUM(clicks) AS clicks
  ,  DATE_TRUNC('day', CAST(from_iso8601_timestamp(
             IF(promo_group = 'reengage',
                COALESCE(click_time, imp_time, event_time),
                COALESCE(install_time, click_time, imp_time, event_time)
             )
          ) AS DATE)) AS event_day
FROM schema_name.table_name
WHERE 1 = 1
  AND dt = '{{ dt }}'
  AND promo_tracker_type <> 'SKAN'
  AND is_filtered_out <> 'true'
GROUP BY
     org_level
  ,  ad_type
  ,  acct_mgr_email
  ,  ops_mgr_email
  ,  client_segment
  ,  goal_primary
  ,  goal_secondary
  ,  goal_tertiary
  ,  goal_primary_val
  ,  goal_secondary_val
  ,  goal_tertiary_val
  ,  promo_id
  ,  promo_title
  ,  promo_class
  ,  promo_daily_cap_micros
  ,  promo_tracker_name
  ,  promo_tracker_type
  ,  traffic_source
  ,  creative_format
  ,  opt_state_current
  ,  client_id
  ,  client_title
  ,  target_app_genre
  ,  target_app_code
  ,  target_app_title
  ,  supply_partner_id
  ,  supply_partner_name
  ,  conversion_goal_id
  ,  conversion_goal_title
  ,  opt_state_final
  ,  is_engaging
  ,  is_monetized
  ,  is_filtered_out
  ,  is_video_ad
  ,  is_view_attributed
  ,  asset_size
  ,  model_variant
  ,  device_platform
  ,  primary_kpi_id
  ,  primary_kpi_title
  ,  tracker_provider
  ,  show_endcard
  ,  CAST(NULL AS VARCHAR)
  ,  is_non_targeted
  ,  app_title_id
  ,  is_precision_targeted
  ,  DATE_TRUNC('day', CAST(from_iso8601_timestamp(
             IF(promo_group = 'reengage',
                COALESCE(click_time, imp_time, event_time),
                COALESCE(install_time, click_time, imp_time, event_time)
             )
          ) AS DATE))
    
  

Step 2: Creating the Derived Table

To ensure a seamless transition between existing baked data and the newly available unbaked data from my pipeline, I implemented a dynamic union strategy that automatically appends only new partitions. Specifically, I filtered the unbaked dataset to include only date partitions more recent than the maximum date found in the baked table. This allowed the two datasets to be joined cleanly without duplication or gaps in coverage. To give end users transparency and control, I also added a column that flags each row as either "Baked" or "Unbaked." This simple addition enables flexible filtering in Looker, so users can choose whether to evaluate performance on finalized data, explore the early signals from in-flight data, or view both simultaneously.

From a performance and architecture perspective, I surfaced this unioned data in Looker using a persistent derived table (PDT). This ensured fast response times while minimizing repeated computation on large volumes of raw data. This source became lightweight and composable, allowing downstream explores and visualizations to reference it efficiently. This approach not only reduced load times but also simplified adoption for end users, enabling faster access to time sensitive data.

Obfuscated SQL Query
    
WITH temp_date AS(
     SELECT MAX(from_iso8601_timestamp(log_date)) AS max_log_ts
  FROM data_source_cohort_v2
  WHERE from_iso8601_timestamp(log_date) <= NOW()
)

SELECT *
  , (SELECT max_log_ts FROM temp_date) AS max_log_ts
  , 'Unbaked' AS cohort_status
FROM schema.pipeline_from_above
WHERE 1 = 1
  , AND from_iso8601_timestamp(log_date) <= NOW()
  , AND from_iso8601_timestamp(log_date) > (SELECT max_log_ts FROM temp_date)
  , AND event_date <= NOW()
  , AND event_date > (SELECT max_log_ts FROM temp_date)

UNION ALL

SELECT
     org_level
  ,  ad_type
  ,  mgr_email
  ,  ops_email
  ,  client_seg
  ,  goal_1
  ,  goal_2
  ,  goal_3
  ,  goal_1_val
  ,  goal_2_val
  ,  goal_3_val
  ,  promo_id
  ,  promo_name
  ,  promo_class
  ,  promo_daily_cap
  ,  promo_tracker
  ,  promo_tracker_typ
  ,  click_origin
  ,  promo_target_cntry_list
  ,  promo_action_tags
  ,  creative_fmt
  ,  current_opt_state
  ,  client_id
  ,  client_name
  ,  target_app_cat
  ,  target_app_id
  ,  target_app_name
  ,  partner_id
  ,  partner_name
  ,  conv_goal_id
  ,  conv_goal_name
  ,  final_opt_state
  ,  interactive_flg
  ,  maf_flg
  ,  uncredited_flg
  ,  video_flg
  ,  viewthrough_flg
  ,  asset_size
  ,  model_var
  ,  client_platform
  ,  kpi_id
  ,  kpi_name
  ,  tracker_ref
  ,  use_partner_endcard
  ,  device_family
  ,  video_length_bucket
  ,  nonpersonalized_flg
  ,  target_app_title_id
  ,  extended_match_flg
  ,  est_bid_count
  ,  est_no_bid_count
  ,  total_spend
  ,  spend_micro
  ,  rev_day1
  ,  rev_day3
  ,  rev_day7
  ,  rev_micro_day1
  ,  rev_micro_day3
  ,  rev_micro_day7
  ,  aov_day1
  ,  aov_day3
  ,  aov_day7
  ,  aov_micro_day1
  ,  aov_micro_day3
  ,  aov_micro_day7
  ,  imps
  ,  clicks
  ,  inst_day1
  ,  inst_day3
  ,  inst_day7
  ,  inst_ctit10_day1
  ,  inst_ctit10_day3
  ,  inst_ctit10_day7
  ,  inst_ctit15_day1
  ,  inst_ctit15_day3
  ,  inst_ctit15_day7
  ,  inst_ctit20_day1
  ,  inst_ctit20_day3
  ,  inst_ctit20_day7
  ,  inst_ctit30_day1
  ,  inst_ctit30_day3
  ,  inst_ctit30_day7
  ,  inst_ctit60_day1
  ,  inst_ctit60_day3
  ,  inst_ctit60_day7
  ,  inst_ctit1hr_day1
  ,  inst_ctit1hr_day3
  ,  inst_ctit1hr_day7
  ,  inst_ctit24h_day1
  ,  inst_ctit24h_day3
  ,  inst_ctit24h_day7
  ,  total_evt_day1
  ,  total_evt_day3
  ,  total_evt_day7
  ,  target_first_day1
  ,  target_first_day3
  ,  target_first_day7
  ,  target_day1
  ,  target_day3
  ,  target_day7
  ,  final_first_day1
  ,  final_first_day3
  ,  final_first_day7
  ,  final_day1
  ,  final_day3
  ,  final_day7
  ,  client_rev_day1
  ,  client_rev_day3
  ,  client_rev_day7
  ,  client_rev_micro_day1
  ,  client_rev_micro_day3
  ,  client_rev_micro_day7
  ,  CAST(NULL AS DOUBLE) AS pred_client_rev_ct
  ,  CAST(NULL AS DOUBLE) AS pred_client_rev_vt
  ,  CAST(NULL AS DOUBLE) AS pred_client_rev_micro_ct
  ,  inc_sq_cap_client_rev_day1
  ,  inc_sq_cap_client_rev_day3
  ,  inc_sq_cap_client_rev_day7
  ,  inc_sq_client_rev_day1
  ,  inc_sq_client_rev_day3
  ,  inc_sq_client_rev_day7
  ,  pred_clicks
  ,  pred_inst_ct
  ,  pred_inst_vt
  ,  pred_installs
  ,  pred_target_evt_ct
  ,  pred_target_evt_vt
  ,  pred_target_evt
  ,  pred_client_rev_ct
  ,  pred_client_rev_vt
  ,  pred_client_rev_micro_ct
  ,  pred_client_rev_micro_vt
  ,  pred_client_rev_micro
  ,  place_client_rev_micro_day1
  ,  place_client_rev_micro_day3
  ,  place_client_rev_micro_day7
  ,  coal_client_rev_micro_day1
  ,  coal_client_rev_micro_day3
  ,  coal_client_rev_micro_day7
  ,  avg_bid_shade_mult
  ,  avg_second_disc
  ,  CAST(NULL AS DATE) AS log_date
  ,  log_date_str
  ,  (SELECT max_log_ts FROM temp_date) AS max_log_ts
  ,  'Baked' AS cohort_status
FROM data_source_cohort_v2
WHERE 1 = 1
  , AND from_iso8601_timestamp(log_date) <= NOW()
  , AND is_filtered_out <> 'true'
    
  

Step 3: LookML

Within Looker, I defined dimensions and measures for all fields in the derived table. Because the heavy lifting (transformations, aggregations, and filtering) was handled upstream in the pipeline, the LookML definitions were intentionally lightweight. Most fields mapped directly to the pre-aggregated table or leveraged simple sum logic for measure exposure. I structured the explore within a new model and incorporated joins to several related views to enhance usability (view names have been obfuscated). One view worth highlighting is the advanced date field layer, which I developed to provide more flexible time-based slicing across cohort data. It’s the focus of the next example.

  
  LookML Explore Snippet
  



 
    
include: "/models/generic_datagroups.lkml"
include: "/views/unprocessed_cohorts/unprocessed_cohorts.view.lkml"
include: "/views/insights/*.view.lkml"
include: "/views/daily_clients/*.view.lkml"
include: "/views/advanced_date_fields/*.view.lkml"

connection: "generic_trino"

persist_with: daily_3pm
explore: cohort_explore {
  view_name: "data_source_cohort_v2"
  label: "Unbaked Cohorted Data"
  always_filter: {
    sql: ${cohort_explore.log_date} <= "7 day ago for 7 days"() ;;
  }
  
  join: processed_cohorts {
    view_name: "data_source_cohort_v2"
    type: left_outer
    sql_on: ${cohort_explore.log_date} = ${processed_cohorts.log_date} ;;
    relationship: many_to_one
  }
  
  join: unprocessed_cohorts {
    view_name: "biz.unprocessed_cohort_v1"
    type: left_outer
    sql_on: ${cohort_explore.log_date} = ${unprocessed_cohorts.log_date} ;;
    relationship: many_to_one
  }
  

}