Tracker Audit Log

Tracker Audit Log

Audit Log

Use-cases

Case #1

  • Rishab is a data scientist/engineer at a company that implements a Data Lake. He is analyzing the effectiveness of the recommendation engine on the company's e-commerce site. For this investigation, he wants to analyze a dataset that includes click log for the last year. He is looking for clean click log data that is up-to-date. He wants to use part of the data to build model and rest to score the model and validate the predictions.  

  • Before he can conduct an analysis, Rishab needs to confirm the dataset is available in the Data Lake.

  • To do so, he wishes to find all entities that include “click log”.

  • He arrives at the Finder home screen (from nav, search results, other entry points?).For this analysis, Rishab is most concerned with the recency, the accuracy, and the integrity of the data.

    • Enters “click log” in the Search Box and clicks Search.

    • He arrives at the Results Page. 

      • Results returned

      • By default, they are sorted by creation time

      • Each Result includes:

        • Snippet of the metadata that matches his query in context.

          • Important to help him evaluate the relevance of the results.

        • Date Created

          • To know how recent/new it is.

  • He clicks the result and arrives at the Entity Detail Page where he can view all of the metadata associated with an entity. 

  • Rishab wished to verify the validity of the sources of this dataset. To do so, he clicks the Lineage Tab to trace the creation of this dataset to its source.

  • Finder displays the lineage for this dataset as a diagram. The selected dataset displays in the center; to the left is the entity that precedes it and to the right is the one it precedes.

  • Rishab discovers that it has been created from two separate sources.

  • He then clicks one of the sources which takes him to the Entity Page of that dataset.

  • He clicks on a program to see what has been done to the dataset.

  • Rishab clicks the Audit Logs Tab to see how active this dataset has been - when was it last updated, who is using it, writing to it, reading from it.

  • Rishab clicks the appropriate action to make this dataset a new source for his existing Click Log processing pipeline.

  • This takes him to the Hydrator Studio where he can edit the Master Click Log pipeline.

 

Storing Audit Log

  • Goal: Read AuditLog messages from Kafka and write messages to Table dataset.

    • Reusing the MetadataConsumer flowlet from the Navigator App to handle reading messages from Kafka

      • Beacuse of this, the app requires a Kafka config in order to be installed

        • { "config": { "auditLogKafkaConfig": { "zookeeperString": "<host>:<port>/cdap/kafka", "topic" : "audit" } } }
    • New Flowlet (AuditLogPublisher) for writing Kafka messages to Dataset

      • Dataset is a Table class

      • Data is stored using the inverse timestamp so that the most recent message is always stored and returned first

      • Dataset key format: <namespace>DELMITER<type>DELMITER<name>DELMITER<inverseTimeInMilliSecondsLong>DELMITER<UUID>

      • DELMITER currently "\1"

      • Dataset Columns: 

        • timestamp - Long - timestamp of the message generated

        • entityId - EntityId - the entity id that the message refers to. Only entity types with a namespace are supported.

        • user - String - the name of the user that generated the message. If the user blank, a default value of "unknown" is inserted.

        • actionType - String - The type of action that was taken. For more details, see: Audit information publishing

        • entityType - String - The EntityType from the id, lowercase

        • entityName - String - The name of the Entity

        • metadata - AuditPayload - The change that was made, either a metadata change or an access. For all other types, the payload is empty

Reading Audit Log

  • Goal: Expose the AuditLog dataset as a REST API for consumption by the UI

    • Fields returned

      • totalResults - the total number of results for the query. If there are more than 100 results, this bails early since that can't be shown in the UI.

      • offset - The starting offset of the first result

      • results - An array of result records with a max length of limit and most recent timestamp first

    • REST API Design

 

 

Created in 2020 by Google Inc.