Frontend code can be found here
- Initial Channel Scraping (Every ~6 hours via crontab)
graph TD
A[Crontab] --> B[index.js]
B --> C[Puppeteer Browser]
C --> D[YouTube Channel Pages]
D --> E[Video Data]
E --> F[temp/*_videos.json]
- Video Processing Stream (Continuous)
graph TD
A[process_stream.py] --> B[Watchdog Observer]
B --> C[VideoDataHandler]
C --> D[Detect new *_videos.json]
D --> E[Filter target_keyword-related videos]
E --> F[Update master_videos.csv]
E --> G[video_data MongoDB ]
E --> H[Trigger comment scrape]
- Comment Scraping Flow
graph TD
A[process_stream.py] --> B[scrape_comments function]
B --> C[comment_scrape.js]
C --> D[Puppeteer with MITM Proxy]
D --> E[xhr_scrape_ds.py intercepts]
E --> F[Process XHR responses]
F --> G[Save to CSV]
G --> H[Move to data/channel_id/video_id/]
- Analysis Pipeline
graph TD
A[analysis.py] --> B[Load all data]
B --> C[Join with channel-to-state mapping]
C --> D[CommentAnalyzer]
D --> E[Keyword analysis]
D --> F[Sentiment analysis]
D --> G[Engagement metrics]
E & F & G --> H[MongoDB collections]
- Data Collection:
-
index.jsscrapes channel video listings periodically -
process_stream.pywatches for new video data and manages the pipeline -
comment_scrape.js+xhr_scrape_ds.pyhandle comment collection
- Data Processing:
-
Videos are filtered for election-related content
-
Comments are processed and organized by channel/video
-
Geographic attribution is maintained throughout
- Analysis:
-
analysis.pyaggregates all data -
comment_analysis.pyprovides specialized content analysis -
Results are stored in MongoDB for the frontend to access
- MongoDB Collections:
-
video_data: Raw video information -
comments_with_video: Processed comments with video context -
state_analysis: State-level aggregated metrics -
video_analysis: Video-level analysis results
