Technical Report

YouTube Trending Video Analysis - Methodology and Findings

Executive Summary

This analysis investigates what drives YouTube video performance and virality using a combination of historical trending data and live API data. We merged the Kaggle YouTube Trending Videos dataset (covering trending videos from November 2017 to June 2018) with current video statistics pulled from the YouTube Data API to create a longitudinal dataset of thousands ofunique videos.

Key Findings:

  • Early likes are the strongest predictor of both current view count (71% feature importance) and long term view growth (62% feature importance)
  • Music dominates in total views but not in engagement rate — People & Blogs and Comedy have higher like-to-view ratios
  • Most videos trend within 1-2 days of publishing; videos that don’t trend within a week almost never do
  • Random forest models predicting current views and view growth achieved R² scores of 0.679 and 0.669 respectively, while predicting time to trend proved harder (R² = 0.274), reflecting the inherent unpredictability of viral timing

Project Context

Motivation: YouTube is the world’s largest video platform, yet the factors that drive a video to trend remain poorly understood. This project investigates whether early engagement signals can predict long term video success, and what content characteristics are associated with trending speed and view growth over time.

Stakeholders:

  • Content creators seeking to optimize upload strategy
  • Brands and marketers evaluating YouTube as an advertising channel
  • Data science students interested in social media analytics

Success Criteria:

  • Successfully merge Kaggle historical data with live YouTube API data
  • Build reproducible cleaning and analysis pipelines as Python packages
  • Train predictive models with interpretable feature importances
  • Produce visualizations suitable for both technical and non-technical audiences

Data Sources

  • Primary dataset: Kaggle YouTube Trending Videos Dataset — 40,949 rows of US trending video data from November 2017 to June 2018, including video metadata, tags, view/like/comment counts at time of trending
  • Supplementary data: YouTube Data API v3 — used to fetch current view, like, and comment counts for each unique video ID in the Kaggle dataset
  • Data access notes: Kaggle dataset is publicly available under CC0 license. YouTube API requires a Google Cloud API key with a daily quota of 10,000 units. Data was collected in April 2026.

Methodology

1. Data Acquisition

The Kaggle dataset was downloaded programmatically using the kagglehub Python library. The YouTube Data API was queried using the google-api-python-client library, fetching video statistics in batches of 50 (the API maximum per request) across 128 total API calls to cover all 6,351 unique video IDs. Both datasets were loaded into pandas DataFrames and merged on video_id.

2. Cleaning Pipeline

The cleaning pipeline is implemented in cleaning.py and callable via run_cleaning_pipeline(). Key transformations include:

  • Merged Kaggle and API DataFrames on video_id using an inner join, yielding 37,095 rows
  • Renamed duplicate columns (title_x/y, views_x/y, likes_x/y) to descriptive names (views_2017, views_current, etc.)
  • Converted trending_date, publish_time, and published to datetime objects
  • 663 videos from the Kaggle dataset returned no API results, indicating they were deleted or made private since 2017

3. Analysis Workflow

The full analysis is implemented in analysis.py and callable via run_analysis_pipeline(df). The workflow consists of two stages:

Exploratory Data Analysis:

  • Growth analysis: computed view, like, and comment growth for each unique video
  • Trending patterns: analyzed trending frequency by day of week and month
  • Category analysis: compared average views and engagement rates across 15 content categories
  • Time to trend: calculated days between publish date and first trending appearance

Predictive Modeling:

Three Random Forest regression models were trained using scikit-learn:

  • Model 1 — Predict current views from 2017 engagement metrics (likes, dislikes, comments, category). R² = 0.679
  • Model 2 — Predict time to trend from engagement and publish timing features. R² = 0.274
  • Model 3 — Predict view growth (2017 to current) from initial engagement metrics. R² = 0.669

All continuous targets were log-transformed using np.log1p() to handle skewed distributions. An 80/20 train/test split was used with random_state=42 for reproducibility.

4. Tooling

  • pandas, numpy — data manipulation
  • google-api-python-client — YouTube Data API v3
  • kagglehub — Kaggle dataset download
  • scikit-learn — Random Forest models, train/test split, evaluation metrics
  • matplotlib, seaborn — visualizations
  • python-dotenv — API key management
  • Environment managed with uv and .venv
  • Full pipeline reproducible via run_cleaning_pipeline() and run_analysis_pipeline(df)

Results & Diagnostics

EDA Findings

View growth is heavily right-skewed — a small number of music videos (Ed Sheeran’s “Perfect”, Maroon 5’s “Girls Like You”) grew by over 4 billion views since 2017, while the median video grew by about 1.6 million views. Trending frequency was uniform across days of the week, suggesting YouTube’s algorithm does not favor any particular day. The dataset covers November 2017 through June 2018, so monthly patterns reflect dataset coverage rather than true seasonality.

Music dominated average current views at over 1 billion per video — roughly 10x the next category (Film & Animation). However, People & Blogs had the highest like rate at 2.8%, suggesting stronger audience loyalty despite lower absolute view counts. Most videos trended within 1-2 days of publishing, with a sharp dropoff after day 5.

Distribution of View Growth

Average Views by Category

Like Rate by Category

Trending Videos by Day of Week

Trending Videos by Month

Days from Publishing to Trending

Model Results

Model Target Top Feature
Model 1 Current views 0.679 likes_2017 (71%)
Model 2 Time to trend 0.274 comments_2017 (29%)
Model 3 View growth 0.669 likes_2017 (62%)

Models 1 and 3 performed similarly well, both explaining roughly 67-68% of variance. Model 2 performed worse, reflecting that viral timing contains significant randomness that early metrics cannot fully capture.

Model 1: Predicted vs Actual Current Views Model 1: Feature Importance

Model 2: Predicted vs Actual Time to Trend Model 2: Feature Importance

Model 3: Predicted vs Actual View Growth Model 3: Feature Importance

Discussion & Next Steps

Interpretation: The consistent dominance of early likes across both view prediction models suggests that YouTube’s recommendation algorithm heavily weights initial engagement signals. This is actionable for creators — maximizing early engagement in the first 24-48 hours after upload appears to be the most important factor for long term success. The lower R² for time to trend (0.274) suggests that while engagement helps, there is significant luck involved in when exactly a video gets picked up by the trending algorithm.

Limitations:

  • The Kaggle dataset only covers US trending videos from a 7-month window in 2017-2018, which may not reflect current YouTube algorithm behavior
  • Videos deleted since 2017 are excluded, potentially introducing survivorship bias toward more successful content
  • The dataset does not include thumbnail quality, title length, or description keywords, which may be important predictors
  • Random Forest models are not easily interpretable beyond feature importance

Future Directions:

  • Incorporate NLP features from titles, tags, and descriptions
  • Analyze channel-level effects — do established channels trend faster than new ones?
  • Build a time series model to track how individual video trajectories evolve over weeks and months