Introducing sparklyr.flint: A time-series extension for sparklyr

We are pleased to announce that sparklyr.flint, a sparklyr extension for analyzing time series at scale with Flint, is now available on CRAN. Flint is an open-source library for working with time-series in Apache Spark which supports aggregates and joins on time-series datasets.

Yitao Li (RStudio)https://www.rstudio.com
09-07-2020

In this blog post, we will showcase sparklyr.flint, a brand new sparklyr extension providing a simple and intuitive R interface to the Flint time series library. sparklyr.flint is available on CRAN today and can be installed as follows:


install.packages("sparklyr.flint")

The first two sections of this post will be a quick bird’s eye view on sparklyr and Flint, which will ensure readers unfamiliar with sparklyr or Flint can see both of them as essential building blocks for sparklyr.flint. After that, we will feature sparklyr.flint’s design philosophy, current state, example usages, and last but not least, its future directions as an open-source project in the subsequent sections.

Quick Intro to sparklyr

sparklyr is an open-source R interface that integrates the power of distributed computing from Apache Spark with the familiar idioms, tools, and paradigms for data transformation and data modelling in R. It allows data pipelines working well with non-distributed data in R to be easily transformed into analogous ones that can process large-scale, distributed data in Apache Spark.

Instead of summarizing everything sparklyr has to offer in a few sentences, which is impossible to do, this section will solely focus on a small subset of sparklyr functionalities that are relevant to connecting to Apache Spark from R, importing time series data from external data sources to Spark, and also simple transformations which are typically part of data pre-processing steps.

Connecting to an Apache Spark cluster

The first step in using sparklyr is to connect to Apache Spark. Usually this means one of the following:

Importing external data to Spark

Making external data available in Spark is easy with sparklyr given the large number of data sources sparklyr supports. For example, given an R dataframe, such as


dat <- data.frame(id = seq(10), value = rnorm(10))

the command to copy it to a Spark dataframe with 3 partitions is simply


sdf <- copy_to(sc, dat, name = "unique_name_of_my_spark_dataframe", repartition = 3L)

Similarly, there are options for ingesting data in CSV, JSON, ORC, AVRO, and many other well-known formats into Spark as well:


sdf_csv <- spark_read_csv(sc, name = "another_spark_dataframe", path = "file:///tmp/file.csv", repartition = 3L)
# or
sdf_json <- spark_read_json(sc, name = "yet_another_one", path = "file:///tmp/file.json", repartition = 3L)
# or spark_read_orc, spark_read_avro, etc

Transforming a Spark dataframe

With sparklyr, the simplest and most readable way to transformation a Spark dataframe is by using dplyr verbs and the pipe operator (%>%) from magrittr.

Sparklyr supports a large number of dplyr verbs. For example,


sdf <- sdf %>%
  dplyr::filter(!is.null(id)) %>%
  dplyr::mutate(value = value ^ 2)

Ensures sdf only contains rows with non-null IDs, and then squares the value column of each row.

That’s about it for a quick intro to sparklyr. You can learn more in sparklyr.ai, where you will find links to reference material, books, communities, sponsors, and much more.

What is Flint?

Flint is a powerful open-source library for working with time-series data in Apache Spark. First of all, it supports efficient computation of aggregate statistics on time-series data points having the same timestamp (a.k.a summarizeCycles in Flint nomenclature), within a given time window (a.k.a., summarizeWindows), or within some given time intervals (a.k.a summarizeIntervals). It can also join two or more time-series datasets based on inexact match of timestamps using asof join functions such as LeftJoin and FutureLeftJoin. The author of Flint has outlined many more of Flint’s major functionalities in this article, which I found to be extremely helpful when working out how to build sparklyr.flint as a simple and straightforward R interface for such functionalities.

Readers wanting some direct hands-on experience with Flint and Apache Spark can go through the following steps to run a minimal example of using Flint to analyze time-series data:


    +-------------------+-----+---------+
    |               time|value|value_sum|
    +-------------------+-----+---------+
    |1970-01-01 00:00:01|    1|      1.0|
    |1970-01-01 00:00:02|    4|      5.0|
    |1970-01-01 00:00:03|    9|     14.0|
    |1970-01-01 00:00:04|   16|     29.0|
    +-------------------+-----+---------+
     In other words, given a timestamp t and a row in the result having time equal to t, one can notice the value_sum column of that row contains sum of values within the time window of [t - 2, t] from ts_rdd.

Intro to sparklyr.flint

The purpose of sparklyr.flint is to make time-series functionalities of Flint easily accessible from sparklyr. To see sparklyr.flint in action, one can skim through the example in the previous section, go through the following to produce the exact R-equivalent of each step in that example, and then obtain the same summarization as the final result:

Why create a sparklyr extension?

The alternative to making sparklyr.flint a sparklyr extension is to bundle all time-series functionalities it provides with sparklyr itself. We decided that this would not be a good idea because of the following reasons:

So, considering all of the above, building sparklyr.flint as an extension of sparklyr seems to be a much more reasonable choice.

Current state of sparklyr.flint and its future directions

Recently sparklyr.flint has had its first successful release on CRAN. At the moment, sparklyr.flint only supports the summarizeCycle and summarizeWindow functionalities of Flint, and does not yet support asof join and other useful time-series operations. While sparklyr.flint contains R interfaces to most of the summarizers in Flint (one can find the list of summarizers currently supported by sparklyr.flint in here), there are still a few of them missing (e.g., the support for OLSRegressionSummarizer, among others).

In general, the goal of building sparklyr.flint is for it to be a thin “translation layer” between sparklyr and Flint. It should be as simple and intuitive as possibly can be, while supporting a rich set of Flint time-series functionalities.

We cordially welcome any open-source contribution towards sparklyr.flint. Please visit https://github.com/r-spark/sparklyr.flint/issues if you would like to initiate discussions, report bugs, or propose new features related to sparklyr.flint, and https://github.com/r-spark/sparklyr.flint/pulls if you would like to send pull requests.

Acknowledgement

Thanks for reading!

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Li (2020, Sept. 7). RStudio AI Blog: Introducing sparklyr.flint: A time-series extension for sparklyr. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2020-09-07-sparklyr-flint/

BibTeX citation

@misc{sparklyr.flint-0.1.1,
  author = {Li, Yitao},
  title = {RStudio AI Blog: Introducing sparklyr.flint: A time-series extension for sparklyr},
  url = {https://blogs.rstudio.com/tensorflow/posts/2020-09-07-sparklyr-flint/},
  year = {2020}
}