News from the sparkly-verse

Packages/Releases Spark R

Highlights to the most recent updates to sparklyr and friends

Edgar Ruiz (Posit)https://www.posit.co/
2024-04-22

Highlights

sparklyr and friends have been getting some important updates in the past few months, here are some highlights:

pysparklyr 0.1.4

spark_apply() now works on Databricks Connect v2. The latest pysparklyr release uses the rpy2 Python library as the backbone of the integration.

Databricks Connect v2, is based on Spark Connect. At this time, it supports Python user-defined functions (UDFs), but not R user-defined functions. Using rpy2 circumvents this limitation. As shown in the diagram, sparklyr sends the the R code to the locally installed rpy2, which in turn sends it to Spark. Then the rpy2 installed in the remote Databricks cluster will run the R code.

Diagram that shows how sparklyr transmits the R code via the rpy2 python package, and how Spark uses it to run the R code

Figure 1: R code via rpy2

A big advantage of this approach, is that rpy2 supports Arrow. In fact it is the recommended Python library to use when integrating Spark, Arrow and R. This means that the data exchange between the three environments will be much faster!

As in its original implementation, schema inferring works, and as with the original implementation, it has a performance cost. But unlike the original, this implementation will return a ‘columns’ specification that you can use for the next time you run the call.

spark_apply(
  tbl_mtcars,
  nrow,
  group_by = "am"
)

#> To increase performance, use the following schema:
#> columns = "am double, x long"

#> # Source:   table<`sparklyr_tmp_table_b84460ea_b1d3_471b_9cef_b13f339819b6`> [2 x 2]
#> # Database: spark_connection
#>      am     x
#>   <dbl> <dbl>
#> 1     0    19
#> 2     1    13

A full article about this new capability is available here: Run R inside Databricks Connect

sparkxgb

The sparkxgb is an extension of sparklyr. It enables integration with XGBoost. The current CRAN release does not support the latest versions of XGBoost. This limitation has recently prompted a full refresh of sparkxgb. Here is a summary of the improvements, which are currently in the development version of the package:

remotes::install_github("rstudio/sparkxgb")

library(sparkxgb)
library(sparklyr)

sc <- spark_connect(master = "local")
iris_tbl <- copy_to(sc, iris)

xgb_model <- xgboost_classifier(
  iris_tbl,
  Species ~ .,
  num_class = 3,
  num_round = 50,
  max_depth = 4
)

xgb_model %>% 
  ml_predict(iris_tbl) %>% 
  select(Species, predicted_label, starts_with("probability_")) %>% 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 5
#> Database: spark_connection
#> $ Species                <chr> "setosa", "setosa", "setosa", "setosa", "setosa…
#> $ predicted_label        <chr> "setosa", "setosa", "setosa", "setosa", "setosa…
#> $ probability_setosa     <dbl> 0.9971547, 0.9948581, 0.9968392, 0.9968392, 0.9…
#> $ probability_versicolor <dbl> 0.002097376, 0.003301427, 0.002284616, 0.002284…
#> $ probability_virginica  <dbl> 0.0007479066, 0.0018403779, 0.0008762418, 0.000…

sparklyr 1.8.5

The new version of sparklyr does not have user facing improvements. But internally, it has crossed an important milestone. Support for Spark version 2.3 and below has effectively ended. The Scala code needed to do so is no longer part of the package. As per Spark’s versioning policy, found here, Spark 2.3 was ‘end-of-life’ in 2018.

This is part of a larger, and ongoing effort to make the immense code-base of sparklyr a little easier to maintain, and hence reduce the risk of failures. As part of the same effort, the number of upstream packages that sparklyr depends on have been reduced. This has been happening across multiple CRAN releases, and in this latest release tibble, and rappdirs are no longer imported by sparklyr.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Ruiz (2024, April 22). Posit AI Blog: News from the sparkly-verse. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2024-04-22-sparklyr-updates/

BibTeX citation

@misc{sparklyr-updates-q1-2024,
  author = {Ruiz, Edgar},
  title = {Posit AI Blog: News from the sparkly-verse},
  url = {https://blogs.rstudio.com/tensorflow/posts/2024-04-22-sparklyr-updates/},
  year = {2024}
}