Tracker Audit Log
Audit Log
Use-cases
Case #1
Rishab is a data scientist/engineer at a company that implements a Data Lake. He is analyzing the effectiveness of the recommendation engine on the company's e-commerce site. For this investigation, he wants to analyze a dataset that includes click log for the last year. He is looking for clean click log data that is up-to-date. He wants to use part of the data to build model and rest to score the model and validate the predictions.
Before he can conduct an analysis, Rishab needs to confirm the dataset is available in the Data Lake.
To do so, he wishes to find all entities that include “click log”.
He arrives at the Finder home screen (from nav, search results, other entry points?).For this analysis, Rishab is most concerned with the recency, the accuracy, and the integrity of the data.
Enters “click log” in the Search Box and clicks Search.
He arrives at the Results Page.
Results returned
By default, they are sorted by creation time
Each Result includes:
Snippet of the metadata that matches his query in context.
Important to help him evaluate the relevance of the results.
Date Created
To know how recent/new it is.
He clicks the result and arrives at the Entity Detail Page where he can view all of the metadata associated with an entity.
Rishab wished to verify the validity of the sources of this dataset. To do so, he clicks the Lineage Tab to trace the creation of this dataset to its source.
Finder displays the lineage for this dataset as a diagram. The selected dataset displays in the center; to the left is the entity that precedes it and to the right is the one it precedes.
Rishab discovers that it has been created from two separate sources.
He then clicks one of the sources which takes him to the Entity Page of that dataset.
He clicks on a program to see what has been done to the dataset.
Rishab clicks the Audit Logs Tab to see how active this dataset has been - when was it last updated, who is using it, writing to it, reading from it.
Rishab clicks the appropriate action to make this dataset a new source for his existing Click Log processing pipeline.
This takes him to the Hydrator Studio where he can edit the Master Click Log pipeline.
Storing Audit Log
Goal: Read AuditLog messages from Kafka and write messages to Table dataset.
Reusing the MetadataConsumer flowlet from the Navigator App to handle reading messages from Kafka
Beacuse of this, the app requires a Kafka config in order to be installed
{ "config": { "auditLogKafkaConfig": { "zookeeperString": "<host>:<port>/cdap/kafka", "topic" : "audit" } } }
New Flowlet (AuditLogPublisher) for writing Kafka messages to Dataset
Dataset is a Table class
Data is stored using the inverse timestamp so that the most recent message is always stored and returned first
Dataset key format: <namespace>DELMITER<type>DELMITER<name>DELMITER<inverseTimeInMilliSecondsLong>DELMITER<UUID>
DELMITER currently "\1"
Dataset Columns:
timestamp - Long - timestamp of the message generated
entityId - EntityId - the entity id that the message refers to. Only entity types with a namespace are supported.
user - String - the name of the user that generated the message. If the user blank, a default value of "unknown" is inserted.
actionType - String - The type of action that was taken. For more details, see: Audit information publishing
entityType - String - The EntityType from the id, lowercase
entityName - String - The name of the Entity
metadata - AuditPayload - The change that was made, either a metadata change or an access. For all other types, the payload is empty
Reading Audit Log
Goal: Expose the AuditLog dataset as a REST API for consumption by the UI
Fields returned
totalResults - the total number of results for the query. If there are more than 100 results, this bails early since that can't be shown in the UI.
offset - The starting offset of the first result
results - An array of result records with a max length of limit and most recent timestamp first
REST API Design