Tokenize Strings

A tokenize function is available in H2O-3, which converts strings into tokens, then stores the tokenized text into a single column, making it easier for additional processing.

Simple Tokenize Example

Below is a simple example showing strings from frames tokenized into a single column. Refer to the following demos for a more extensive demo using tokenized text in Word2Vec:

  • r
  • Python
library(h2o)
h2o.init()

# Create four simple, single-column R data frames by inputting values.
s1 <- as.character(as.h2o(" this is a string "))
s2 <- as.character(as.h2o("this is another string"))
s3 <- as.character(as.h2o("this is a longer string"))
s4 <- as.character(as.h2o("this is tall, this is taller"))

# Combine the datasets into a single dataset.
ds <- h2o.rbind(s1, s2, s3, s4)
ds
                            C1
1            this is a string
2       this is another string
3      this is a longer string
4 this is tall, this is taller

# Tokenize the dataset.
# Notice that tokenized sentences are separated by <NA>.
tokenized <- h2o.tokenize(ds, " ")
tokenized
      C1
1
2   this
3     is
4      a
5 string
6   <NA>

[24 rows x 1 column]