Caskalytics
Overview
Data related to cask is scattered around different sources. The goal of this application is to collect and aggregate that data to provide unified access and generate useful statistics.
sources
- Salesforce
- Web Beacons
- Social media analytics (Youtube, LinkedIn, Twitter)
- Meltwater
- Pro Ranking (SEO)
- Github web-hooks
- AWS s3 access logs
Motivation
To be able to generate and display aggregates and trends in one central location and to render front end in order to help marketing team.
Requirements
The system automatically fetches the latest data from respective apis and keeps the historical data
The system should notify the stake holders in case of failures
The system should be extensible to add more sources
Retrieval is optimized and should not incur any additional cost—meaning the data retrieved should not be pulled multiple times.
Data should be processed without any data loss.
The statistics should be aggregated at different time intervals:
Hourly
Daily
Weekly
Monthly
Every 3 months
Every 1 year
System should be able to process and catch-up in case of major outages.
System should have the ability to visualize metrics in the form of Dashboard Widgets: Line, Bar, Pie, Scatter, etc.
System should have the ability to configure notifications based on constraints specified for metrics:
External Api call fail
High and Low mark reached
Weekly or daily digest
The system is highly-available and the reports are available 24x7x365
The system should render charts as well as provide raw data to feed into external applications like tablue.
Assumptions
All the sources have developer APIs which supports retrieval of data
Information generated does not need different access for different roles
Infrastructure
- 2 node backend cluster for availability and replication (trying to keep the replication factor low to save costs)
- S3 bucket to regularly backup data
- Lean singlenode cluster for frontend (could be deployed on one of the backend nodes aswell)
Design
Partitioning of TimePartitionedFileset
Each data source will be in its own TPFS instance
Source: "Sourcename Tpfs"
Format: Parquet Record with fields - ts, attributes
Cube Name: “SourceNameCube”
- Example
Github: “GithubTPFS”
Format: Parquet Record with fields - ts, repo, stars, forks, watchers, pulls
Cube Name: “GithubCube”
API
External Apis to be used
API | API Provider | Metrics gathered |
---|---|---|
Force | Salesforce.com | Raw Leads, MQLs, Sales Opportunities |
Youtube reporting API | Youtube.com | Views, Subscribers |
LinkedIn Api | LinkedIn.com | Followers |
Twitter4j | Open Source | Followers |
AWS API | Amazon.com | S3 product download logs |
Github Webhooks | Github | Github Statistics |
Pro Ranking Api | Pro Ranking | Website ranking |
Api Calls
- Use a Workflow Custom Action to run periodic RESTful calls to APIs
- A spark job can read the data from filesystem and update the cube
- In order to allow different scheduling of different calls, each call will have its own workflow
REST EndPoints
Method | End Point | Description | Response |
---|---|---|---|
GET | /pipeline/{time period} | Returns the data related to marketing and sales leads time period E {week, month} | { start: 06-06-2016, end: 06-07-2016, rawleads: 180, mlq: 60, inquiries: 100, opp: 20 } |
GET | /awareness/webtraffic/{time priod} | Returns the traffic related information for website and blog | { start: 06-06-2016, end: 06-07-2016, sessions: 200, newVisitors: 68, returningVisitors: 80 blogViewers: 100 } |
GET | awareness/socialmedia/subscribers | Returns the subscribers on various social media sites | { youtube: { views: 23, subscribers: 2900 }, linkedin: 68, twitter: 80 } |
GET | awareness/seo | Returns the share of voice numbers | { cask: 25, informatica: 25, talend: 25, snaplogic: 25 } |
GET | adoption/downloads | Returns the number of downloads for cdap | {downloads: [ {version: 3.5 , dl: 2000 } ]} |
UI
- UI could be deployed on a thin coopr node
- Probable stack for UI will be Jquery embedded in a bootsrap dashboard
- ChartJS and c3JS would be used to render charts
Trends
UI should allow refining all metrics to different time granularities (Hourly, Daily, Weekly, Monthly, Every three months, Every year)
Visualize metrics in the form of Dashboard (Widgets - Line, Bar, Pie, Scatter, ...)
Dashboard and backend should support overlaying week-over-week, month-over-month or year-over-year for any metric
Backend should allow for raw querying of data through SQL commands
Reports
Email and text notifications can be sent using SendGrid or Amazon SNS service
- Users can unsubscribe or subscribe using front end backed by apis
Generate daily and weekly digest report and email them to stakeholders
Export data into PDF/Excel available for download in UI
Alerts
Allow user to specify some threshold values for metrics that will alert by email
High-mark Low-mark reached alerts to users via email and sms( tentative )
Api call fail alerts to Admin and dev
Created in 2020 by Google Inc.