Importing time series data

Importing time series data from R to sparklyr.flint is fairly simple and straightforward. It is probably best illustrated through some small examples.

Firstly, one needs to establish a Spark connection by calling sparklyr::spark_connect, e.g.,

library(sparklyr)
sc <- spark_connect(master = "yarn-client", spark_home = "/usr/lib/spark")

to connect to a Spark cluster in YARN client mode, or

library(sparklyr)
sc <- spark_connect(master = "local")

to connect to Spark in local mode.

For those unfamiliar with Spark connections, chapter 7 of “Mastering Spark with R” by Javier Luraschi, Kevin Kuo, and Edgar Ruiz contains some very helpful explanations of several modes of connecting to Spark from sparklyr.

Next, the time series data needs to be imported into a Spark dataframe. This can be accomplished with methods such as sparklyr::spark_read_csv, sparklyr::spark_read_json, etc if data source is a file on disk, e.g.,

sdf <- spark_read_csv(sc, "/tmp/data.csv", header = TRUE)

or alternatively, using sparklyr::copy_to if data is in a R dataframe, e.g.,

example_time_series_data <- data.frame(
  t = c(1, 3, 4, 6, 7, 10, 15, 16, 18, 19),
  v = c(4, -2, NA, 5, NA, 1, -4, 5, NA, 3)
)
sdf <- copy_to(sc, example_time_series_data, overwrite = TRUE)

Finally, in order to unambiguously interpret the time series data we have provided in sdf so far, the Flint time series library will have to be informed about the name and the unit of the time column, and also whether all rows in the Spark dataframe from above are sorted by time already. All of this information will be encapsulated in a TimeSeriesRDD object derived from sdf, as shown below:

ts_rdd <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")

At this point, ts_rdd contains all data and metadata necessary for Flint to perform various analyses on example_time_series_data, and results from those analyses will also be returned to us in separate TimeSeriesRDD objects.