Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Let's take a closer look at the RecordScannable interface.

Defining the Record Schema

The record schema is given by returning the Java type of each record, and CDAP will derive the record schema from that type:

...

Code Block
(key STRING, value INT)

Limitations

  • The record type must be a structured type, that is, a Java class with fields. This is because SQL tables require a structure type at the top level. This means the record type cannot be a primitive, collection or map type. However, these types may appear nested inside the record type.

  • The record type must be that of an actual Java class, not an interface. The same applies to the types of any fields contained in the type. The reason is that interfaces only define methods but not fields; hence, reflection would not be able to derive any fields or types from the interface.

    The one exception to this rule is that Java collections such as List and Set are supported as well as Java Map. This is possible because these interfaces are so commonly used that they deserve special handling. These interfaces are parameterized and require special care as described in the next section.

  • The record type must not be recursive. In other words, it cannot contain any class that directly or indirectly contains a member of that same class. This is because a recursive type cannot be represented as a SQL schema.

  • Fields of a class that are declared static or transient are ignored during schema generation. This means that the record type must have at least one non-transient and non-static field. For example, the java.util.Date class has only static and transient fields. Therefore a record type of Date is not supported and will result in an exception when the dataset is created.

  • A dataset can only be used in ad-hoc queries if its record type is completely contained in the dataset definition. This means that if the record type is or contains a parameterized type, then the type parameters must be present in the dataset definition. The reason is that the record type must be instantiated when executing an ad-hoc query. If a type parameter depends on the jar file of the application that created the dataset, then this jar file is not available to the query execution runtime.

    For example, you cannot execute ad-hoc queries over an ObjectStore<MyObject> if the MyObject is contained in the application jar. However, if you define your own dataset type MyObjectStore that extends or encapsulates an ObjectStore<MyObject>, then MyObject becomes part of the dataset definition for MyObjectStore.

Parameterized Types

Suppose instead of being fixed to String and int, the Entry class is generic with type parameters for both key and value:

...

While this seems a little more complex at first sight, it is the de-facto standard way of dealing with Java type erasure.

Complex Types

Your record type can also contain nested structures, lists, or maps, and they will be mapped to type names as defined in the Hive language manual. For example, if your record type is defined as:

...

Refer to the Hive language manual for more details on schema and data types.

StructuredRecord Type

There are times when your record type cannot be expressed as a plain old Java object. For example, you may want to write a custom dataset whose schema may change depending on the properties it is given. In these situations, you can implement a record-scannable dataset that uses StructuredRecord as the record type:

...

The CDAP Table and ObjectMappedTable datasets implement RecordScannable in this way and can be used as references.

Scanning Records

The second requirement for enabling SQL queries over a dataset is to provide a means of scanning the dataset record by record. Similar to how the BatchReadable interface makes datasets readable by MapReduce programs by iterating over pairs of key and value, RecordScannable iterates over records. You need to implement a method to partition the dataset into splits, and an additional method to create a record scanner for each split:

...

An example demonstrating an implementation of RecordScannable is included in the CDAP Sandbox in the directory examples/Purchase, namely the PurchaseHistoryStore.

Writing to Datasets with SQL

Data can be inserted into datasets using SQL. For example, you can write to a dataset named ProductCatalog with this SQL query:

...

Let's take a closer look at the RecordWritable interface.

Defining the Record Schema

Just like in the RecordScannable interface, the record schema is given by returning the Java type of each record, using the method:

...

The same rules that apply to the type of the RecordScannable interface apply to the type of the RecordWritable interface. In fact, if a dataset implements both RecordScannable and RecordWritable interfaces, they will have to use identical record types.

Writing Records

To enable inserting SQL query results, a dataset needs to provide a means of writing a record into itself. This is similar to how the BatchWritable interface makes datasets writable from MapReduce programs by providing a way to write pairs of key and value. You need to implement the RecordWritable method:

...

Note that a dataset can implement either RecordScannableRecordWritable, or both.

Formulating Queries

When creating your queries, keep these limitations in mind:

...