...
Story 2: Multiple Sources
Option 1: Introduce different types of connections. One for data flow, one for control flow
Code Block |
---|
{
"stages": [
{
"name": "customersTable",
"plugin": {
"name": "Database",
"type": "batchsource", ...
}
},
{
"name": "customersFiles",
"plugin": {
"name": "TPFSParquet",
"type": "batchsink", ...
}
},
{
"name": "purchasesTable",
"plugin": {
"name": "Database",
"type": "batchsource"
}
},
{
"name": "purchasesFiles",
"plugin": {
"name": "TPFSParquet",
"type": "batchsink", ...
}
},
],
"connections": [
{ "from": "customersTable", "to": "customersFiles", "type": "data" },
{ "from": "customersFiles", "to": "purchasesTable", "type": "control" },
{ "from": "purchasesTable", "to": "purchasesFiles", "type": "data" }
]
} |
An alternative could be to introduce the concept of "phases". Each phase has its own dag. Phases can be connected, with connections denoting between phases always control flow rather than , and connections within phases as data flow.
Code Block |
---|
{ "phases": [ { "name": "phase1", "stages": [ { "name": "customersTable", "plugin": { "name": "Database", "type": "batchsource", ... } }, { "name": "customersFiles", "plugin": { "name": "TPFSParquet", "type": "batchsink", ... } ], "connections": [ { "from": "customersTable", "to": "customersFiles" } ] }, { "name": "phase2", "stages": [ { "name": "purchasesTable", "plugin": { "name": "Database", "type": "batchsource" } }, { "name": "purchasesFiles", "plugin": { "name": "TPFSParquet", "type": "batchsink", ... } } ], "connections": [ { "from": "purchasesTable", "to": "purchasesFiles" } ] } ] "connections": [ { "from": "phase1", "to": "phase2" } ] } |
...
Option2: Introduce Make it so that connections into certain plugin types imply control flow rather than data flow. For example, introduce "condition" plugin type. Connections into a condition imply control flow rather than data flow.
Code Block |
---|
{ "stages": [ { "name": "customersTable", "plugin": { "name": "Database", "type": "batchsource", ... } }, { "name": "customersFiles", "plugin": { "name": "TPFSParquet", "type": "batchsink", ... } }, { "name": "afterDump", "plugin": { "name": "AlwaysRun", "type": "condition" } }, { "name": "purchasesTable", "plugin": { "name": "Database", "type": "batchsource" } }, { "name": "purchasesFiles", "plugin": { "name": "TPFSParquet", "type": "batchsink", ... } }, ], "connections": [ { "from": "customersTable", "to": "customersFiles" }, { "from": "customersFiles", "to": "afterDump" }, { "from": "afterDump", "to": "purchasesTable" }, { "from": "purchasesTable", "to": "purchasesFiles" } ] } |
You could also say that certain plugin types connections into a source imply control flow (runcondition being one of them), whereas other plugin types imply data , or connections into an action imply control flow.