Skip to main content

Level 1: SQL Exploration

Objective: Find and understand your dataset.

Story beat Before you analyze anything, you must locate who in the dataset lives in RJ and what orders belong to them. You’re drawing a map of the crisis zone.


Problem 1.1: Know Your Audience (Simple Filtering)

Question

Business Value: Before we analyze behavior, we need to know the size of our cohort in Rio de Janeiro.

  • Goal: List all customer profiles located in the state of Rio de Janeiro ('RJ').

  • Success Criteria:

    • "How many unique customers do we have in Rio de Janeiro?" (Hint: It should be around 12,000+ rows).
    • "Besides the city 'rio de janeiro', what other cities appear in this list?"
💡 Hint
  • Which table contains customer location information?
  • What column would tell you the state where a customer lives?
  • Question 1: List all customer profiles located in the state of Rio de Janeiro ('RJ')
SELECT *
FROM customers
WHERE customer_state = 'RJ';

  • Question 2: How many unique customers do we have in Rio de Janeiro?
SELECT COUNT(DISTINCT customer_id) AS rj_customers
FROM customers
WHERE customer_state = 'RJ';

  • Question 3: Besides the city 'rio de janeiro', what other cities appear in this list?
SELECT
customer_city,
COUNT(DISTINCT customer_id) AS n_customers
FROM customers
WHERE customer_state = 'RJ'
GROUP BY customer_city
ORDER BY n_customers DESC, customer_city;


Problem 1.2: The Transaction History (Basic Join)

Question

Business Value: A customer profile is useless without their purchase history. We need to attach orders to these people.

  • Goal: Join the Orders table to the Customers table to find every order placed by an RJ customer.

  • Success Criteria:

    • "Does the row count match your previous query, or is it different? Why?"
    • "Can you see the order_status column in the results?"
💡 Hint
  • You need to combine customer and order information
  • What field connects customers to their orders?
  • Build upon your previous query from Problem 1

-Question 1: Join the Orders table to the Customers table to find every order placed by an RJ customer:

SELECT
c.customer_id,
c.customer_city,
c.customer_state,
o.order_id,
o.order_status,
o.order_purchase_timestamp
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
WHERE c.customer_state = 'RJ';

  • Question 2:Does the row count match your previous query, or is it different? Why?

    Số lượng row khác nhau giữa problem 1.2 và problem 1.1( ở 1.2 lớn hơn) do ở 1.1 lấy kết quả theo unique customers, tức mỗi hàng tương ứng với 1 KH; còn problem 1.2 lấy tất cả các đơn hàng của KH sống ở bang RJ, 1 dòng tương ứng với 1 đơn hàng do đó 1 KH có thể có nhiều đơn hàng => số rows nhiều hơn problem 1.1

  • Question 3:"Can you see the order_status column in the results?"

    Cột order_status xuất hiện vì sau khi join bảng Orders với bảng Customers, tất cả thông tin đơn hàng liên quan đến mỗi khách hàng ở RJ đều được đưa vào kết quả.


Problem 1.3: The Timeline (Date Handling)

Question

Business Value: The VP needs to know if this is a recent problem or a historical one. We need to establish the date range of our data.

  • Goal: Find the date range of all orders placed in RJ.

  • Success Criteria:

    • "What is the date of the very first order in RJ?"
    • "When was the last order placed?"
    • "Does this cover the Black Friday period?"
💡 Hint
  • Think about aggregate functions for finding earliest and latest values
  • Which column contains the purchase date/time?
  • Consider formatting the output for better readability
  • Question 1: “What is the date of the very first order in RJ?” Hướng giải quyết
  • JOIN bảng orders với customers dựa trên customer_id
  • Chỉ giữ khách có customer_state = 'RJ'
  • Lấy giá trị nhỏ nhất (MIN) của order_purchase_timestamp
SELECT
DATE(MIN(o.order_purchase_timestamp)) AS first_order_date
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
WHERE c.customer_state = 'RJ';

  • Question 2: “When was the last order placed?” Hướng giải quyết
  • JOIN như câu trên
  • Lấy giá trị lớn nhất (MAX) của order_purchase_timestamp
SELECT
DATE(MAX(o.order_purchase_timestamp)) AS last_order_date
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
WHERE c.customer_state = 'RJ';

  • Question 3: "Does this cover the Black Friday period?"

    Có, khoảng thời gian này bao gồm Black Friday của năm 2016 và 2017

  • **


Problem 1.4: The Funnel Audit (Aggregation & Nulls)

Question

Business Value: Not all orders make it to the customer. We need to see how many orders were actually delivered vs. cancelled or unavailable.

  • Goal: Count the number of orders per order_status for RJ customers.

  • Success Criteria:

    • "How many orders in RJ were 'canceled'?"
    • "How many were 'delivered'?"
    • Critical Check: "Are there any statuses where the count is surprisingly high?"
💡 Hint
  • You need to count orders grouped by their status
  • What SQL clause helps you organize data into categories?
  • How can you sort the results to see the most common statuses first?
  • Question 1: “How many orders in RJ were canceled?” Hướng giải quyết:
  • JOIN customers (c) với orders (o) qua customer_id
  • Lọc khách hàng ở bang RJ: customer_state = 'RJ'
  • Lọc thêm trạng thái đơn hàng: order_status = 'canceled'
SELECT
COUNT(*) AS canceled_orders_in_RJ
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
WHERE c.customer_state = 'RJ'
AND o.order_status = 'canceled';

  • Question 2: “How many were delivered?” Hướng giải quyết:
  • JOIN customers (c) với orders (o) qua customer_id
  • Thay điều kiện trạng thái thành order_status = 'delivered'
SELECT
COUNT(*) AS delivered_orders_in_RJ
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
WHERE c.customer_state = 'RJ'
AND o.order_status = 'delivered';

  • Question 3: “Are there any statuses where the count is surprisingly high?” Hướng giải quyết:
  • JOIN bảng customers(c) với orders(o) bằng customer_id
  • Lọc khách sống ở RJ: customer_state = 'RJ'
  • Group theo order_status.
  • Sắp xếp giảm dần để xem trạng thái nào bất thường (quá cao hoặc quá thấp).
SELECT
o.order_status,
COUNT(*) AS order_count
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
WHERE c.customer_state = 'RJ'
GROUP BY o.order_status
ORDER BY order_count DESC;


Problem 1.5: The "Pulse Check" (3-Table Join)

Question

Business Value: Now we need the "Voice of the Customer." We need to see the average star rating for these specific orders.

  • Goal: Calculate the average review score for all delivered orders in RJ.

  • Success Criteria:

    • "What is the average score for RJ? (Is it below 4.0?)"
    • "How does this compare if you remove the 'RJ' filter and look at the whole country?"
💡 Hint
  • You need to bring in review data from another table
  • What aggregate function calculates averages?
  • Remember to filter for both RJ customers and delivered orders only
  • Question 1: "What is the average score for RJ? (Is it below 4.0?)" Hướng giải quyết:
  • Cần JOIN 3 bảng: 1.customers → để lọc khách ở RJ 2.orders → để chỉ lấy đơn có order_status = 'delivered' 3.order_reviews → để lấy review_score
  • JOIN qua khóa:
  • customers.customer_id = orders.customer_id
  • orders.order_id = order_reviews.order_id
  • Lấy trung bình điểm review: AVG(review_score)
SELECT
AVG(r.review_score) AS avg_review_score_RJ
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
JOIN order_reviews r
ON o.order_id = r.order_id
WHERE c.customer_state = 'RJ'
AND o.order_status = 'delivered';

  • Question 2: “How does this compare if you remove the 'RJ' filter and look at the whole country?”
SELECT
AVG(r.review_score) AS avg_review_score_brazil
FROM orders o
JOIN order_reviews r
ON o.order_id = r.order_id
WHERE o.order_status = 'delivered';