Audit Log

Use-cases

Case #1

Rishab is a data scientist/engineer at a company that implements a Data Lake. He is analyzing the effectiveness of the recommendation engine on the company's e-commerce site. For this investigation, he wants to analyze a dataset that includes click log for the last year. He is looking for clean click log data that is up-to-date. He wants to use part of the data to build model and rest to score the model and validate the predictions.
Before he can conduct an analysis, Rishab needs to confirm the dataset is available in the Data Lake.
To do so, he wishes to find all entities that include “click log”.
He arrives at the Finder home screen (from nav, search results, other entry points?).For this analysis, Rishab is most concerned with the recency, the accuracy, and the integrity of the data.
- Enters “click log” in the Search Box and clicks Search.
- He arrives at the Results Page.
  - Results returned
  - By default, they are sorted by creation time
  - Each Result includes:
    - Snippet of the metadata that matches his query in context.
      - Important to help him evaluate the relevance of the results.
    - Date Created
      - To know how recent/new it is.
He clicks the result and arrives at the Entity Detail Page where he can view all of the metadata associated with an entity.
Rishab wished to verify the validity of the sources of this dataset. To do so, he clicks the Lineage Tab to trace the creation of this dataset to its source.
Finder displays the lineage for this dataset as a diagram. The selected dataset displays in the center; to the left is the entity that precedes it and to the right is the one it precedes.
Rishab discovers that it has been created from two separate sources.
He then clicks one of the sources which takes him to the Entity Page of that dataset.
He clicks on a program to see what has been done to the dataset.
Rishab clicks the Audit Logs Tab to see how active this dataset has been - when was it last updated, who is using it, writing to it, reading from it.
Rishab clicks the appropriate action to make this dataset a new source for his existing Click Log processing pipeline.
This takes him to the Hydrator Studio where he can edit the Master Click Log pipeline.

Storing Audit Log

Goal: Read AuditLog messages from Kafka and write messages to Table dataset.
- Reusing the MetadataConsumer flowlet from the Navigator App to handle reading messages from Kafka
  - Beacuse of this, the app requires a Kafka config in order to be installed
    - { "config": { "auditLogKafkaConfig": { "zookeeperString": "<host>:<port>/cdap/kafka", "topic" : "audit" } } }
- New Flowlet (AuditLogPublisher) for writing Kafka messages to Dataset
  - Dataset is a Table class
  - Data is stored using the inverse timestamp so that the most recent message is always stored and returned first
  - Dataset key format: <namespace>DELMITER<type>DELMITER<name>DELMITER<inverseTimeInMilliSecondsLong>DELMITER<UUID>
  - DELMITER currently "\1"
  - Dataset Columns:
    - timestamp - Long - timestamp of the message generated
    - entityId - EntityId - the entity id that the message refers to. Only entity types with a namespace are supported.
    - user - String - the name of the user that generated the message. If the user blank, a default value of "unknown" is inserted.
    - actionType - String - The type of action that was taken. For more details, see: Audit information publishing
    - entityType - String - The EntityType from the id, lowercase
    - entityName - String - The name of the Entity
    - metadata - AuditPayload - The change that was made, either a metadata change or an access. For all other types, the payload is empty

Reading Audit Log

Goal: Expose the AuditLog dataset as a REST API for consumption by the UI

Fields returned
- totalResults - the total number of results for the query. If there are more than 100 results, this bails early since that can't be shown in the UI.
- offset - The starting offset of the first result
- results - An array of result records with a max length of limit and most recent timestamp first

REST API Design

HTTP Request Type

Endpoint:

Request Params

Response Status

Response Body

GET

/namespaces/{namespace-id}/apps/_Tracker/services/AuditLog/methods/auditlog/{type}/{name}

name	is Required	Description	Default Value
type	yes	The type of the entity to search for, e.g. dataset or stream. Any namespaced entity can be searched for. Possible values: dataset, stream, stream_views
name	yes	The name of the entity to search for
startTime	no	The start time to search for. Accepts "now - 1d" syntax. Milliseconds granularity for timestamps.	0
endTime	no	The end time to search for. Accepts "now - 1d" syntax. Milliseconds granularity for timestamps.	now
offset	no	The offset to start the results at for paging	0
limit	no	The max number of results to return in the results	10

200 returns the audit log entries requested

400 Bad request is returned when the input values are invalid such as incorrect date format, negative offsets/limits, or invalid range. The response will include an appropriate error message.

500 unknown server error

{
	totalResults: 1,
	results: [{
		time: 1457467029557,
		entityId: {
			namespace: "default",
			application: "testCubeAdapter",
			type: "Workflow",
			program: "ETLWorkflow",
			entity: "PROGRAM"
		},
		user: "unknown",
		type: "METADATA_CHANGE",
		payload: {
			previous: {
				SYSTEM: {
					properties: { },
					tags: [ ]
				}
			},
			additions: {
				SYSTEM: {
					properties: { },
					tags: [
						"ETLMapReduce",
						"Batch",
						"Workflow",
						"ETLWorkflow"
					]
				}
			},
			deletions: {
				SYSTEM: {
					properties: { },
					tags: [ ]
				}
			}
		}
	}],
	offset: 0
}

Example of no results being found.

{
	totalResults: 0,
	results: [ ],
	offset: 0
}

CDAP

Tracker Audit Log

Audit Log

Use-cases

Storing Audit Log

Reading Audit Log