Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

These properties are configured at the time the FileSet is created. They apply to all files in the dataset. Every time you use a FileSet in your application code, you can address either the entire dataset or, by specifying its relative path as a runtime argument, an individual file in the dataset. Specifying an individual file is only supported for MapReduce programs.

Creating a FileSet

To create and use a FileSet in an application, you create it as part of the application configuration:

...

  • setUseExisting(true): This directs the FileSet to accept an existing location as its base path and an existing table in Hive for exploring. However, because the existing location may contain files prior to the FileSet creation, the location and the Hive table will not be deleted when the dataset is dropped, and truncating the FileSet will have no effect. This is to ensure that no pre-existing data is deleted.

  • setPossessExisting(true): Similarly, this allows reuse of an existing location. The FileSet will assume ownership of existing files in that location and of the Hive table, which means that those files and the Hive table will be deleted when the dataset is dropped or truncated.

Using a FileSet in MapReduce

Using a FileSet as input or output of a MapReduce program is the same as for any other dataset:

...

If you do not specify both the input and output paths, your MapReduce program will fail with an error.

Using a FileSet Programmatically

You can interact with the files of a FileSet directly, through the Location abstraction of the file system. For example, a Service can use a FileSet by declaring it with a @UseDataSet annotation, and then obtaining a Location for a relative path within the FileSet:

...

See the Apache™ Twill® API documentation for additional information about the Location abstraction.

Exploring FileSets

A file set can be explored with ad-hoc queries if you enable it at creation time; this is described under FileSet Exploration.