Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Implement configuring impersonation at the application level. Enable impersonation in Explore queries.

Goals

Application Impersonation: As a part of, we implemented impersonation for programs and data operations, but this could only be configured at the namespace level. We need the ability to configure this at the application level, so that we can run programs as different users, without having to manage additional namespaces for each additional user.

Explore Impersonation: As a part of, we implemented impersonation in Hive for Explore queries to impersonate the namespace user if one was provided. For better security measures we will like to run explore queries as the user who submits them.

Entity ownership: Entities created by applications should be owned by the application owner. Access permissions to those entities could be given to other users at create time or at a later time.

User Stories

(Similar to Secure Impersonation user stores, but with application-level impersonation)
1. As a CDAP admin, I would like to map an application (and the entities it contains) to a Kerberos principal. When CDAP programs of this application are submitted to YARN, the applications should be run as that user.
2. As a CDAP application developer, my application should access HDFS, HBase, Hive, and other resources as the user/principal configured for it, instead of the global 'cdap' (or other, configured) user.
As CDAP admin, I would like explore queries to run as the user submitting the explore query.
As CDAP application/dataset/stream owner I will like give access to other users on application/dataset/stream during creation or afterwards.

Scenarios

Scenario 1: App Creation

Alice is a human user and will like to create an app using an artifact. Alice has ADMIN access on the CDAP instance. She specifies a kerberos principal Louis as the owner. After the app has been created she will like the following to be true:

Louis should get all the privileges (READ/WRITE/EXECUTE) on all the entities created by the deployed application, with CDAP authorization.
Louis should own all streams/dataset created by deployed app i.e. he will own the HDFS files, HBase tables, and Hive tables.
All programs should run with Louis' credentials (e.g. Kerberos ticket) i.e if another user Bob, who has sufficient privileges to run a program (EXECUTE on program and READ on namespace, if CDAP Authorization is turned on), starts the program then the program should run as Louis.

Scenario 2: Dataset Creation/Maintenance

Alice is a human user. Alice will like to create a dataset without deploying an application and during creation she wants to specify an owner who will own the dataset i.e. the HDFS files/HBase tables/Hive tables. She specifies the principal for a headless user Louis, whose account she has access to, as the owner.
Alice will like to perform dataset maintenance operations (truncate, delete, update) from REST APIs, CLI, or UI and she will like for these operations to be performed as the dataset owner Louis.
Another user Bob who has sufficient privileges to administer the dataset can perform the maintenance operations, all operations will be performed as the dataset owner Louis.

Scenario 3: Access Control

Jules is a human user who does not have CDAP credentials and wants to run a Hive query outside of CDAP. Her access to the data can be controlled by group permissions.
Mary is a headless user who owns a CDAP program that reads from a dataset owned by Louis. An admin adds Mary to the group for the dataset. The program owned by Mary can now read the dataset.
Eve is a human user who has both LDAP and kerberos credentials. She logs into the CDAP UI with her LDAP credentials and submits a query. While submitting the query she provides her kerberos principal and password. The query should be run as her kerberos principal.

Design

Currently, whenever we need to perform a data operation or launch a program in YARN, we lookup the namespace that this entity exists in, and based upon the principal mapping for that namespace, we impersonate for that principal. If there is no mapping, we perform actions as the current user (cdap system user). Now, we will need to maintain a mapping from entities such as applications, streams, and datasets.

Entity ownership

The ownership information for entities will be stored in a "owner.meta" table. The table will store the Entity to the owners kerberos principal (as a string) mapping. This information along with the permissions on the entity will be pushed down to the storage provider and that will be used to control access (future work).

This will introduce an additional step during entity creation. An entry will need to be made to the owner.meta table.

The table will not be used to store ACLs for this release as that will be handled by the storage provider but in future releases, we can expand this to manage the ACLs. This feature will be useful for storage providers that don't support ACLs. It will also be useful in providing a layer of abstraction over authorization backends like Apache Sentry and Apache Ranger.

Note: If an entity exists with an associated owner and the same entity is being created by some other user then this operation will fail. Also, if this entity creation was triggered by some other operation then the complete operation will fail too. For example, Alice has deployed an app in CDAP which created a dataset called 'employees'. Now if Bob tries to deploy another app which creates the same dataset called 'employee' then the app deployment will fail. If Bob wants to read the employee dataset from his app then he should be get the 'employee' dataset in his program dynamically. Now he should be able to read this dataset if Scenario 3.2 conditions are meet.

Rows in owner.meta will be of the format

The row key will be constructed from the entity id and will capture the Entity hierarchy. e.g. for a stream it will be constructed using the namespace and stream id.

rowkey: {<created from entity id>}, column {'c'}, and the owner's principal as the value

User management

To allow headless users access to the system, other authorized users need to impersonate them. To allow this impersonation we set the following convention:

All keytabs are present on the local filesystem on which CDAP master is running.
These keytabs are present under path which needs to be specified in cdap-security.xml:
1. /dir1>/<dir2>/${name}/${name}.keytab
${name} will be replaced with the short name of the owner's principal. They can be used anywhere in the path. e.g. /home/${name}/kerberos/keytabs/${name}.keytab

<property>
   <name>keytab.path</name>
   <value>/dir1/dir2/${name}/${name}.keytab</value>
</property>

Pushing permissions to storage engines after creation (Out of 4.1 Scope)

The permissions assigned for entities will need to be pushed down to storage providers so that access outside the system will have the same restrictions. Both HBase and HDFS support ACLs and they will be used to assign finer grained permissions to the underlying tables or files.

Directory permissions

The directory structure will be as follows, CDAP will own the parent directories for the namespace. The directories will be group writable and everyone who has app deployment privileges will be part of that group so that they can create subdirectories. For any cleanup, for example, when the namespace is being deleted, the system user will impersonate the subdirectory owners to do the deletion. With this impersonation in place, the system user will not need access permissions on user directories.

The groups for the directories will be specified while the entry is being created and once the directory is created the system will do a chgrp to change it to the provided group.

e.g.

drwxrwxr-x - cdap supergroup 0 2017-01-16 04:39 /cdap/namespaces/

To be able to create a namespace the user will need to be a part of the "supergroup".

A group can also be specified in cdap-security.xml with property "namespace.creators". If a group is specified for this property then CDAP will change the group of /cdap/namespaces to the specified group allowing users in the existing group to create namespace.

The namespace directory will be owned by the namespace owner

During the creation of namespace a group can be specified and this group will have write and execute permission on the namespace directory allowing the users of this group to deploy application in the namespace. Note: This will require change in our existing namespace creation API.

drwxrwxr-x - accountadmin accountgroup 0 2017-01-16 04:39 /cdap/namespaces/account

To be able to create anything under that namespace the user will have to be a part of the "accountgroup"

Stream:

drwxr-xr-x - account1 accountgroup 0 2017-01-17 02:41 /cdap/namespaces/account/streams/st1

All the directories will be owned by the headless users whose keytabs need to be present so that they can be impersonated. Additionally during the creation of app, stream and dataset the user can specify a group and CDAP will change the group of the the associated files on hdfs and tables on hbase and hive so that the given group have read access.

Explore Impersonation

For explore impersonation we won't be using keytabs. A human user will login using their credentials and to run explore queries they will have to provide a kerberos username and a password. The system will authenticate with KDC on behalf of the user and use the tgt to create a UGI for the user through the static method

static UserGroupInformation getUGIFromTicketCache(java.lang.String ticketCache, java.lang.String user)

This UGI will then be used to impersonate the queries.

The RemoteUGIProvider provides methods that are called when a UGI is needed to impersonate a user. During the call to RemoteUGIProvider#createUGI the Kerberos TGT can be obtained from the master through a rest API (/impersonation/credentials)

class ImpersonationInfo currently contains a principal and their keytab. This will change to include the path to the ticket cache for the user.

Workflows

UI:

The explore window shows up when the user clicks on the explore icon on any explorable entity. If kerberos is enabled in the cluster then a modal window will show up the first time the explore icon is clicked. Through this window, the user can provide the Kerberos principal that the explore query should run as and the TGT for that principal.

The UI forwards the principal and the TGT to the router which forwards it to CDAP master. Both these routes support SSL. Once master has the TGT it can be serialized to HDFS with permissions set to 600.

Explore container can then use the TGT on HDFS to create a UserGroupInformation object and use that to impersonate the principal for running the query. The UGI once created will be cached.

CLI:

The user would need to do a kinit before they would be able to launch an Explore query from the CLI. The CLI would then pick up the TGT and rest of the flow is the same as UI.

REST:

For running Explore queries through the REST APIs the user will need to provide the TGT and the principal along with the query.

Upgrade tool

None

Open Questions

Currently, hive impersonation does not work when the engine is set to spark. Do we need to fix this in 4.1?

Notes

The principal configured for an application MUST have privileges to create tables in the (HBase) namespace it is deployed in. What happens if cdap is the entity creating this HBase namespace? How will the custom principal have CREATE privileges in that namespace?
We will use AuthorizationHandler and PrivilegesManager for managing ACLs on the entities during and after creation.
The specification for impersonation is at Secure Impersonation Specification

API changes

New Programmatic APIs

New internal APIs:
Impersonation Store: Stores the user keytab information

public class ImpersonationStore {
  public void addImpersonationInfo(final ImpersonationInfo impersonationInfo) throws IOException {  }

  public ImpersonationInfo getImpersonationInfo(final String principal) throws IOException, ImpersonationInfoNotFound {  }

  // idempotent
  public void delete(final String principal) throws IOException {  }

Permission Store: Stores the entity ownership information.

public class PermissionStore {
  public void addOwner(final EntityId entityId, final String principal) throws IOException {  }
  public ImpersonationInfo getOwner(final EntityId entityId) throws IOException, NotFoundException {  }
  // idempotent
  public void deleteOwner(final EntityId entityId) throws IOException {  }
}

public final class ImpersonationInfo {
  private final String principal;
  private final String keytabURI;
}

Potential new external APIs (TBD):
Allowing group and permissions for FileSets/Streams/(other?)

New REST APIs

Entity Ownership:

Please see Secure Impersonation Specification#EntityOwnership

Remote Owner Service

We need a Remote implementation of OwnerAdmin so that the program container or cdap service container which performs request under impersonation (which can be either namespace/app/dataset/stream owner) can look up owner information internally if needed.

For example, a explore query on a stream is handled by ExploreQueryExecutorHttpHandler. The handlers here does impersonation as the namespace owner. Now when the query actually runs its might need to look up other cdap resources (for example say the stream configuration). This call in itself does impersonation by doing a doAs for the resource involved (in this case the stream). The Impersonator which is responsible for providing the UGI to be impersonated for this call tries to look up owner information for the resource and will fail since it tries to access owner.meta table which is a system table and cannot be accessed under user impersonation.

This requires adding a Remote implementation of OwnerAdmin which program container and cdap service container can use to get the owner information. We will also need to add a handler in cdap-app-fabric which will serve the requests from the remote client. Since this handler will reside inside cdap master it can query owner store through owner admin since it will be running as cdap user.

We will expose the following endpoints: (Note: Currently, we only support owner for namespace, app, artifact, stream, dataset)

Path

Method

Request Body

Response Code

Response

Adding Owner

/v1/owner/

POST

{
 "namespacedEntityId": {},
 "kerberosPrincipalId":{}
}

200 - On success

409 - if owner information for entity already exists

500 - Any internal errors

Deleting Owner

/v1/owner/

DELETE

{
  "stream": "stream",
  "namespace": "default",
  "entity": "STREAM"
}

200 - On success

500 - Any internal errors

Getting Owner

/v1/owner/

GET

{
  "stream": "stream",
  "namespace": "default",
  "entity": "STREAM"
}

200 - On success

500 - Any internal errors

{
  "principal": "user/host.net@KDC.NET"
}

Getting Impersonation Information

/v1/owner/impinfo

GET

{
 "namespacedEntityId": {},
 "impersonatedOpType":{}
}

200 - On success

500 - Any internal errors

{
 "principal": "user/host.net@KDC.NET",
 "keytabURI":"/some/path"
}

Entity Creation:

Please see: Secure Impersonation Specification#EntityCreation

CLI Impact or Changes

CDAP-8079 - Provide a way to specify kerberos credentials for launching Explore queries through CLI in impersonated environment ( Open) Provide a way for the user to specify kerberos credentials while launching an Explore query
(optional) Create CLI for the above REST APIs

UI Impact or Changes

CDAP-8078 - Provide a way to specify kerberos credentials for launching Explore queries through UI in impersonated environment ( Open) Provide a way for the user to specify kerberos credentials while launching an Explore query
(optional) Create UI for the above REST APIs

Security Impact

We will need to implement authorization on the above REST APIs (which manage the impersonation metadata). Authorization will also need to be added when programmatically accessing this metadata (such as when launching the programs or performing dataset operations involving impersonation).

Impact on Infrastructure Outages

This will rely on HBase for storing metadata (Similar to how we store all sorts of other metadata for applications). Without HBase (and dataset service), this will definitely not work.

Test Scenarios

Test ID	Test Description	Expected Results
IMP100	(default namespace) Deploy an application from an artifact, for principal X, and run a program.	The program should run as X. Datasets/streams should havetheirhdfs/hbaseownedby X.
IMP101	(default namespace) Deploy another application from the same artifact, without specifying principal, and run a program.	The program should run as the cdap system user. Datasets/streams should havetheirhdfs/hbaseownedby cdap system user
IMP102	RUN IMP100 and IMP102 in a custom namespace, that doesn't have impersonation	Expectation should be the same.
IMP103	Run IMP100 and IMP102 in a namespace that already has impersonation configured.	< Expected behavior TBD >
IMP104
IMP105
IMP106

Introduction

Goals

User Stories

Scenarios

Scenario 1: App Creation

Scenario 2: Dataset Creation/Maintenance

Scenario 3: Access Control

Design

Entity ownership

User management

Pushing permissions to storage engines after creation (Out of 4.1 Scope)

Directory permissions

Explore Impersonation

Workflows

UI:

CLI:

REST:

Upgrade tool

Open Questions

Notes

API changes

New Programmatic APIs

New REST APIs

Entity Ownership:

Remote Owner Service

Entity Creation:

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release 4.1.0

Related Work

Future work