Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Story 2: Multiple Sources

 

Option 1: Introduce different types of connections. One for data flow, one for control flow

Code Block
{
  "stages": [
    {
      "name": "customersTable",
      "plugin": {
        "name": "Database",
        "type": "batchsource", ...
      }
    },    
    {
      "name": "customersFiles",
      "plugin": {
        "name": "TPFSParquet",
        "type": "batchsink", ...
      }
    },
    {
      "name": "purchasesTable",
      "plugin": {
        "name": "Database",
        "type": "batchsource"
      }
    },
    {
      "name": "purchasesFiles",
      "plugin": {
        "name": "TPFSParquet",
        "type": "batchsink", ...
      }
    },
  ],
  "connections": [
    { "from": "customersTable", "to": "customersFiles", "type": "data" },
    { "from": "customersFiles", "to": "purchasesTable", "type": "control" },
    { "from": "purchasesTable", "to": "purchasesFiles", "type": "data" }
  ]
}

An alternative could be to introduce the concept of "phases". Each phase has its own dag. Phases can be connected, with connections denoting between phases always control flow rather than , and connections within phases as data flow

Code Block
{
  "phases": [
    {
      "name": "phase1",
      "stages": [
        {
          "name": "customersTable",
          "plugin": {
            "name": "Database",
            "type": "batchsource", ...
          }
        },   
        {
          "name": "customersFiles",
          "plugin": {
            "name": "TPFSParquet",
            "type": "batchsink", ...
        }
      ],
      "connections": [
        { "from": "customersTable", "to": "customersFiles" }
      ]
    },
    {
      "name": "phase2",
      "stages": [
        {
          "name": "purchasesTable",
          "plugin": {
            "name": "Database",
            "type": "batchsource"
          }
        },
        {
          "name": "purchasesFiles",
          "plugin": {
            "name": "TPFSParquet",
            "type": "batchsink", ...
          }
        }
      ],
      "connections": [
        { "from": "purchasesTable", "to": "purchasesFiles" }
      ]
    }
  ]
  "connections": [
    { "from": "phase1", "to": "phase2" }
  ]
}

 

...

 

Option2: Introduce Make it so that connections into certain plugin types imply control flow rather than data flow.  For example, introduce "condition" plugin type.  Connections into a condition imply control flow rather than data flow.

Code Block
{
  "stages": [
    {
      "name": "customersTable",
      "plugin": {
        "name": "Database",
        "type": "batchsource", ...
      }
    },    
    {
      "name": "customersFiles",
      "plugin": {
        "name": "TPFSParquet",
        "type": "batchsink", ...
      }
    },
    {
      "name": "afterDump",
      "plugin": {
        "name": "AlwaysRun",
        "type": "condition"
      }
    },
    {
      "name": "purchasesTable",
      "plugin": {
        "name": "Database",
        "type": "batchsource"
      }
    },
    {
      "name": "purchasesFiles",
      "plugin": {
        "name": "TPFSParquet",
        "type": "batchsink", ...
      }
    },
  ],
  "connections": [
    { "from": "customersTable", "to": "customersFiles" },
    { "from": "customersFiles", "to": "afterDump" },
    { "from": "afterDump", "to": "purchasesTable" },
    { "from": "purchasesTable", "to": "purchasesFiles" }
  ]
}

You could also say that certain plugin types connections into a source imply control flow (runcondition being one of them), whereas other plugin types imply data , or connections into an action imply control flow.