Importing time series data from R to sparklyr.flint is fairly simple and straightforward. It is probably best illustrated through some small examples.
Firstly, one needs to establish a Spark connection by calling sparklyr::spark_connect
, e.g.,
library(sparklyr)
<- spark_connect(master = "yarn-client", spark_home = "/usr/lib/spark") sc
to connect to a Spark cluster in YARN client mode, or
library(sparklyr)
<- spark_connect(master = "local") sc
to connect to Spark in local mode.
For those unfamiliar with Spark connections, chapter 7 of “Mastering Spark with R” by Javier Luraschi, Kevin Kuo, and Edgar Ruiz contains some very helpful explanations of several modes of connecting to Spark from sparklyr
.
Next, the time series data needs to be imported into a Spark dataframe. This can be accomplished with methods such as sparklyr::spark_read_csv
, sparklyr::spark_read_json
, etc if data source is a file on disk, e.g.,
<- spark_read_csv(sc, "/tmp/data.csv", header = TRUE) sdf
or alternatively, using sparklyr::copy_to
if data is in a R dataframe, e.g.,
<- data.frame(
example_time_series_data t = c(1, 3, 4, 6, 7, 10, 15, 16, 18, 19),
v = c(4, -2, NA, 5, NA, 1, -4, 5, NA, 3)
)<- copy_to(sc, example_time_series_data, overwrite = TRUE) sdf
Finally, in order to unambiguously interpret the time series data we have provided in sdf
so far, the Flint time series library will have to be informed about the name and the unit of the time column, and also whether all rows in the Spark dataframe from above are sorted by time already. All of this information will be encapsulated in a TimeSeriesRDD
object derived from sdf
, as shown below:
<- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t") ts_rdd
At this point, ts_rdd
contains all data and metadata necessary for Flint to perform various analyses on example_time_series_data
, and results from those analyses will also be returned to us in separate TimeSeriesRDD
objects.