Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

The purpose of this page is to document the plan for redesigning the Github statistic collection in the Caskalytics app.


Current Implementation of Github Metrics

  • API:

  • Use a Workflow Custom Action to run periodic RESTful calls to the Github API

  • Results will be written into the GitHub partition of the Fileset.

  • A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.


  • These will be REST endpoints used to get repo stats for Caskalytics
  • Dataset will contain two stores: a Table to hold the raw messages and a Cube to hold the metrics.
  • As the raw data is written to the Table store, the metrics in the Cube will be updated as needed
    • MethodEndpointDescriptionParametersResponse
      GET/{org}/{repo}/statsReturns the stats of the given repo
      orgString - the org for the repoYes
      repoString - the name of the repoYes
      Code Block
        "name": "russorat/savage-leads",
        "size": 481,
        "forks": 0,
        "watchers": 1,
        "stargazers": 1,
        "openIssues": 3,
        "totalPullRequests": 2
      GET/{org}/{repo}/messages/{messageType}Returns the messages for a given repo. A list of events can be found here:
      orgString - the org for the repoYes 
      repoString - the name of the repoYes 
      messageTypeString - the type of message to returnYes 
      startTimestart time to search for in SecondsNo0
      endTimeend time to search for in SecondsNonow
      Code Block
        totalMessages: 2,
        messages: ["{...}","{...}"]
      GET/{sender}/statsReturns statistics for a given github user (sender). If no sender is found, an empty stats list is returned.
      senderString - The github username to get stats forYes 
      Code Block
        "sender": "russorat",
        "stats": {
          "issue_comment": 1,
          "issues": 3,
          "create": 1,
          "ping": 1,
          "push": 1
      Code Block
        "sender": "russoratsdfsdf",
        "stats": {}
      GET/topSenders/{messageType}?limit={limit}Returns an array of the top senders for the given message type
      messageTypeString - The type of message to get the top senders forYes 
      limitlong - The number of results to returnNo10
      Code Block
          "sender": "russorat",
          "stats": {
            "push": 1
      GET/{org}/{repo}/metric?metric={metric}Returns a given custom metric for a repo
      orgString - the org for the repoYes 
      repoString - the name of the repoYes 
      metricString - the custom metric to returnYes 
      Code Block
        repoName: "russorat/savage-leads",
        metricName: "repository.watchers",
        metric: 0
      GET/{messageId}Returns the raw message given a message id
      messageIdString - the Github message id to return. Can be found using the messages endpointYes 

Github Dataset

    • Code Block
        "ref": "refs/heads/testbranch",
        "before": "0000000000000000000000000000000000000000",
        "after": "6d6db4855be89fb10f5b09a214a20b6125cd7be8",
        "created": true,
        "deleted": false,
        "forced": true,
        "base_ref": "refs/heads/master",
        "compare": "",
        "commits": [],
      GET/{org}/{repo}/messages/{messageType}?startTime={startTime}&endTime={endTime}&limit={limit}&offset={offset}Returns a list of message Ids for the given repo and message type
      orgString - the org for the repoYes 
      repoString - the name of the repoYes 
      messageTypeString - the type of message to search (push, issue, pull_request, etc.)Yes 
      startTimelong - the start time as a unix timestamp in secondsNo0
      endTimelong - the end time as a unix timestamp in secondsNoNow
      limitint - the number of results to returnNo10
      offsetint - the offset used for pagingNo0
      Code Block
        "totalMessages": 1,
        "messageIds": [


Github Raw Dataset

  • Dataset to store the raw messages captured from Github
  • Key is the X-GitHub-Delivery header of the message
  • The table has three columns, one for the messageId (String), one for the messageType (String), and one for the jsonPayload (String)
  • This table is RecordScannable so the data can be viewed in the UI.

Github Parsed Dataset

  • Dataset will contain a Table to hold the parsed messages.
  • The JSON message is first flattened and then each value inserted as a column in the Table. A final field called rawPayload is also written to capture the full payload.Additional columns for eventId and messageType are also added. 
  • The key to the table will be <fullRepoName>-<messageType>-<timestampInSeconds><inverseTimestampInSeconds>-<X-GitHub-Delivery>. This will allow scanning by message and by time with the most recent messages returned first.

Github Metrics

  • Data is stored in a Cube dataset
  • The Cube will have the following properties
    • Resolutions: 60,3600,86400,604800
    • Dimensions: 
      • repository
      • message_type
      • repository, message_type
      • sender
      • sender, message_type
