Tokenizer

Tokenizer

Introduction



       Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter. 

Use-Case

  • User wants to extract the hashtags from the twitter feeds.User would tokenize the words based on space and then can identify the words that start with hashtags

    Input source:

    Tokenizer:

    •  

      • User wants to tokenize the sentence data using “ ” as a pattern

    Output:



User Stories

  • As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.

  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.

  • As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.

  • As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

Conditions

  • Source field ,to be tokenized,can be of only string type.

  • User can tokenize single column only from the source schema.


Example

Input source:

topic

sentence

Java

Hello world / is the /basic application

HDFS

HDFS/ is a /file system

Spark

Spark /is engine for /bigdata processing

Tokenizer:

  •  

    • User wants to tokenize the sentence data using “/” as a delimiter

    • Mandatory inputs from user:

    • Column on which tokenization to be done:”sentence”

    • Delimiter for tokenization:”/”

    • Output column name for tokenized data:”words”

    • Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

Output:

topic

sentence

words

topic

sentence

words

Java

Hello world / is the /basic application

[hello world, is the, basic application]

HDFS

HDFS/ is a /file system

[hdfs, is a ,file system]

Spark

 Spark /is engine for /bigdata processing

[spark ,is engine for ,bigdata processing]

 


Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

  • columnToBeTokenized :Column name on which tokenization is to be done

  • patternSeparator:Pattern Separator

  • outputColumn:Output column name for tokenized data 



Input JSON:

{

        "name": "Tokenizer",

        "plugin": {

        "name": "Tokenizer",

        "type": "sparkcompute",

        "label": "Tokenizer",

        "properties": {

           " columnToBeTokenized": "sentence",

           " patternSeparator": "/",

           " outputColumn": "words",

 

         }

       }

     }

 

Table of Contents

Checklist

User stories documented 
User stories reviewed 
Design documented 
Design reviewed 
Feature merged 
Examples and guides 
Integration tests 
Documentation for feature 
Short video demonstrating the feature

Created in 2020 by Google Inc.