MongoDB database plugin

MongoDB database plugin

Introduction

A separate database plugin to support MongoDB-specific features and configurations.

Use-Case

  • Users can choose and install MongoDB source and sink plugins.

  • Users should see MongoDB logo on plugin configuration page for better experience.

  • Users should get relevant information from the tool tip:

    • The tool tip should describe accurately what each field is used for.

  • Users should not have to specify any redundant configuration

  • Users should get field level lineage for the source and sink that is being used.

  • Reference documentation should be updated to account for the changes.

  • The source code for MongoDB database plugin should be placed in repo under data-integrations org.

  • The data pipeline using source and sink plugins should run on both mapreduce and spark engines.

User Stories

  • User should be able to install MongoDB specific database source and sink plugins from the Hub

  • Users should have each tool tip accurately describe what each field does

  • Users should get field level lineage information for the MongoDB source and sink 

  • Users should be able to setup a pipeline avoiding specifying redundant information

  • Users should get updated reference document for MongoDB source and sink

  • Users should be able to read all the DB types

Plugin Type

Batch Source
Batch Sink 
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Design Tips

MongoDB driver reference: http://mongodb.github.io/mongo-java-driver/3.10/driver/

Design

The suggestion is to move existing mongodb-plugins module to the mongodb-plugins repository.



MongoDB Overview

Document database

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

{ "_id" : ObjectId("5d3f1c2a2f547625b0bbb397"), "string" : "AAPL", "int32" : 10, "double" : 23.23, "array" : [ "a1", "a2" ], "object" : { "inner_field" : "val" }, "binary" : { "$binary" : "YmluYXJ5IGRhdGE=", "$type" : "00" }, "undefined" : undefined, "boolean" : false, "date" : ISODate("2019-07-29T16:17:46.109Z"), "null" : null, "regex" : /./, "dbpointer" : DBRef("source", "5d079ee6d078c94008e4bb3a"), "javascript" : var l = 1;, "javascriptwithscope" : { "$code" : var l = 1; , "$scope" : { "scope" : "scope_val" } }, "symbol" : "a", "timestamp" : Timestamp(1564417066, 1), "long" : NumberLong(9223372036854775807), "decimal" : NumberDecimal("3.100000"), "minkey" : { "$minKey" : 1 }, "maxkey" : { "$maxKey" : 1 } }

BSON

BSON is a binary serialization format used to store documents and make remote procedure calls in MongoDB. The BSON specification is located at bsonspec.org

Document limitations

  • The maximum BSON document size is 16 megabytes.

  • In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key. If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field.

Flexible schema

Unlike SQL databases, where you must determine and declare a table’s schema before inserting data, MongoDB’s collections, by default, does not require its documents to have the same schema.

  • The documents in a single collection do not need to have the same set of fields and the data type for a field can differ across documents within a collection. 

  • To change the structure of the documents in a collection, such as add new fields, remove existing fields, or change the field values to a new type, update the documents to the new structure.

Query filter documents

A query filter document and query operators can be used to specify conditions.

The following example uses '{ status: { $in: [ "A", "D" ] } }' query filter document to retrieve all documents from the 'inventory' collection where 'status' equals either "A" or "D":

db.inventory.find( { status: { $in: [ "A", "D" ] } } )

The operation corresponds to the following SQL statement:

SELECT * FROM inventory WHERE status in ("A", "D")

Sink Properties

User Facing Name

Widget Type

Description

Constraints

User Facing Name

Widget Type

Description

Constraints

Label

textbox

Label for UI.



Reference Name

textbox

Uniquely identified name for lineage.



Host

textbox

Host that MongoDB is running on.

Required

(defaults to localhost on UI)

Port

number

Port that MongoDB is listening to.

Optional

(default 27017)

Database

textbox

MongoDB database name.

Required

Collection

textbox

Name of the database collection to write to.

Required

ID Field

textbox

Allows the user to specify which of the incoming fields should be used as an object identifier.

Optional.

Object ID will be generated if no value is specified.

Username

textbox

User identity for connecting to the specified database.



Password

password

Password to use to connect to the specified database.



Connection Arguments

keyvalue

A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments.



Sink Data Types Mapping

To support all data types in the Sink we can use MongoDB extended JSON format or/and infer a data type of record field based on its name. 

The table below does not honor non-standard MongoDB data types and lists how CDAP data types are stored.

The following MongoDB data types are missing: Undefined, Regular Expression, DBPointer, JavaScript, Symbol, JavaScript (with scope), Timestamp, Min key, Max key.

CDAP Schema Data Type

MongoDB Data Types

Comment

CDAP Schema Data Type

MongoDB Data Types

Comment

boolean

Boolean



bytes

Binary data, ObjectId(if 'ID Field' specified)



date

Date



double

Double



decimal

Decimal128

The Decimal128 type supports up to 34 digits of precision.

float

Double



int

32-bit integer



long

64-bit integer



string

String, ObjectId(if 'ID Field' specified)



time

String



timestamp

Date



array

Array



record

Object



enum

String



map

Object



union

Depends on the actual value.

For example, if it's a union:

["string","int","long"]

and the value is actually a long, the mongo document will have the field as a 64-bit integer. If a different record comes in with the value as a string, the mongo document will end up with a String for that field.





Source Properties

User Facing Name

Widget Type

Description

Constraints

User Facing Name

Widget Type

Description

Constraints

Label

textbox

Label for UI.



Reference Name

textbox

Uniquely identified name for lineage.



Host

textbox

Host that MongoDB is running on.

Required

(defaults to localhost on UI)

Port

number

Port that MongoDB is listening to.

Optional

(default 27017)

Database

textbox

MongoDB database name.

Required

Collection

textbox

Name of the database collection to write to.

Required

Output Schema

schema

Specifies the schema of the documents.

Required

On Record Error

radio-group

Specifies how to handle error in record processing. An error will be thrown if failed to parse value according to a provided schema.

Possible values are:

  • Skip error

  • Fail pipeline

Default: 'Fail pipeline'

Input Query

json-editor

Optionally filter the input collection with a query. This query must be represented in JSON format and use the MongoDB extended JSON format to represent non-native JSON data types.



Username

textbox

User identity for connecting to the specified database.



Password

password

Password to use to connect to the specified database.



Authentication Connection String

textbox

Auxiliary MongoDB connection string to authenticate against when constructing splits.



Connection Arguments

keyvalue

A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments.





Source Data Types Mapping

The source requires Output Schema to be set. Based on the schema source will expect a field in each document to be of a specific Mongo data type.

On Record Error error handling property allows the user to decide whether the pipeline should fail, the record should be skipped, or the record should be sent to the error dataset.

The following table shows what MongoDB data types can be read as CDAP types.

CDAP Schema Data Type

MongoDB Data Types

CDAP Schema Data Type

MongoDB Data Types

boolean

Boolean

bytes

Binary data, ObjectId

date

-

double

Double

decimal

Decimal128

float

-

int

32-bit integer

long

64-bit integer

string

String, Symbol

time

-

timestamp

Date

array

Array

record

Object

The following schema:

{"type":"record","name":"object","fields":[{"name":"inner_field","type":"string"}]}

is used for 'object' field:

{ "object" : { "inner_field" : "val" } }



* We can map all non-standard data types to record, like JavaScript (with scope) in the example below.

The following schema:

{ "type":"record", "name":"javascriptwithscope", "fields":[ {"name":"$code","type":"string"}, {"name":"$scope","type":{"type":"record","name":"scope-object-record","fields"[{"name":"scope","type":"string"}]}} ] }

is used for 'javascriptwithscope' field:

{ "javascriptwithscope" : { "$code" : var l = 1; , "$scope" : { "scope" : "scope_val" } } }

enum

-

map

Object

The following schema:

{"type":"map","keys":"string","values":"string"}

is used for 'object' field:

{ "object" : { "inner_field" : "val" } }



union

-


Approach

Move existing mongodb-plugins module to the mongodb-plugins project. Add MongoDB-specific properties to configuration, add support for MongoDB-specific datatypes. Update UI widgets JSON definitions.

Pipeline Samples



Releases

Release X.Y.Z

Related Work

Database plugin enhancements

Created in 2020 by Google Inc.