.. _tokenize:

Tokenize Strings
~~~~~~~~~~~~~~~~

A ``tokenize`` function is available in H2O-3, which converts strings into tokens, then stores the tokenized text into a single column, making it easier for additional processing. 

Simple Tokenize Example
'''''''''''''''''''''''

Below is a simple example showing strings from frames tokenized into a single column. Refer to the following demos for a more extensive demo using tokenized text in Word2Vec:

- Python: https://github.com/h2oai/h2o-3/blob/master/h2o-py/demos/word2vec_craigslistjobtitles.ipynb
- R: https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.word2vec.craigslistjobtitles.R

.. example-code::
   .. code-block:: r
	
	library(h2o)
	h2o.init()
	
	# Create four simple, single-column R data frames by inputting values.
	s1 <- as.character(as.h2o(" this is a string "))
	s2 <- as.character(as.h2o("this is another string"))
	s3 <- as.character(as.h2o("this is a longer string"))
	s4 <- as.character(as.h2o("this is tall, this is taller"))

	# Combine the datasets into a single dataset. 
	ds <- h2o.rbind(s1, s2, s3, s4)
	ds
	                            C1
	1            this is a string 
	2       this is another string
	3      this is a longer string
	4 this is tall, this is taller

	# Tokenize the dataset.
	# Notice that tokenized sentences are separated by <NA>.
	tokenized <- h2o.tokenize(ds, " ")
	tokenized
	      C1
	1       
	2   this
	3     is
	4      a
	5 string
	6   <NA>

	[24 rows x 1 column]

   .. code-block:: Python

    import h2o
    h2o.init()

    # Create four simple, single-column Python data frames by inputting values
    df1 = h2o.H2OFrame.from_python({'String':[' this is a string ']})
    df1 = df1.ascharacter()
    df2 = h2o.H2OFrame.from_python({'String':['this is another string']})
    df2 = df2.ascharacter()
    df3 = h2o.H2OFrame.from_python({'String':['this is a longer string']})
    df3 = df3.ascharacter()
    df4 = h2o.H2OFrame.from_python({'String':['this is tall, this is taller']})
    df4 = df4.ascharacter()

    # Combine the datasets into a single dataset. 
    combined = df1.rbind([df2, df3, df4])
    combined
    String
    ----------------------------
    this is a string
    this is another string
    this is a longer string
    this is tall, this is taller

    # Tokenize the dataset.
    # Notice that tokenized sentences are separated by empty rows.
    tokenized = combined.tokenize(" ")
    tokenized.describe
    C1
    -------

    this
    is
    a
    string

    this
    is
    another
    string

    [24 rows x 1 column]