safetensors 0.1.0

Packages/Releases R

Announcing safetensors, a new R package allowing for reading and writing files in the safetensors format.

Daniel Falbel (Posit)https://www.posit.co/
2023-06-15

safetensors is a new, simple, fast, and safe file format for storing tensors. The design of the file format and its original implementation are being led by Hugging Face, and it’s getting largely adopted in their popular ‘transformers’ framework. The safetensors R package is a pure-R implementation, allowing to both read and write safetensor files.

The initial version (0.1.0) of safetensors is now on CRAN.

Motivation

The main motivation for safetensors in the Python community is security. As noted in the official documentation:

The main rationale for this crate is to remove the need to use pickle on PyTorch which is used by default.

Pickle is considered an unsafe format, as the action of loading a Pickle file can trigger the execution of arbitrary code. This has never been a concern for torch for R users, since the Pickle parser that is included in LibTorch only supports a subset of the Pickle format, which doesn’t include executing code.

However, the file format has additional advantages over other commonly used formats, including:

There are additional advantages compared to other file formats common in this space, and you can see a comparison table here.

Format

The safetensors format is described in the figure below. It’s basically a header file containing some metadata, followed by raw tensor buffers.

Diagram describing the safetensors file format.

Basic usage

safetensors can be installed from CRAN using:

install.packages("safetensors")

We can then write any named list of torch tensors:

library(torch)
library(safetensors)

tensors <- list(
  x = torch_randn(10, 10),
  y = torch_ones(10, 10)
)

str(tensors)
#> List of 2
#>  $ x:Float [1:10, 1:10]
#>  $ y:Float [1:10, 1:10]

tmp <- tempfile()
safe_save_file(tensors, tmp)

It’s possible to pass additional metadata to the saved file by providing a metadata parameter containing a named list.

Reading safetensors files is handled by safe_load_file, and it returns the named list of tensors along with the metadata attribute containing the parsed file header.

tensors <- safe_load_file(tmp)
str(tensors)
#> List of 2
#>  $ x:Float [1:10, 1:10]
#>  $ y:Float [1:10, 1:10]
#>  - attr(*, "metadata")=List of 2
#>   ..$ x:List of 3
#>   .. ..$ shape       : int [1:2] 10 10
#>   .. ..$ dtype       : chr "F32"
#>   .. ..$ data_offsets: int [1:2] 0 400
#>   ..$ y:List of 3
#>   .. ..$ shape       : int [1:2] 10 10
#>   .. ..$ dtype       : chr "F32"
#>   .. ..$ data_offsets: int [1:2] 400 800
#>  - attr(*, "max_offset")= int 929

Currently, safetensors only supports writing torch tensors, but we plan to add support for writing plain R arrays and tensorflow tensors in the future.

Future directions

The next version of torch will use safetensors as its serialization format, meaning that when calling torch_save() on a model, list of tensors, or other types of objects supported by torch_save, you will get a valid safetensors file.

This is an improvement over the previous implementation because:

  1. It’s much faster. More than 10x for medium sized models. Could be even more for large files. This also improves the performance of parallel dataloaders by ~30%.

  2. It enhances cross-language and cross-framework compatibility. You can train your model in R and use it in Python (and vice-versa), or train your model in tensorflow and run it with torch.

If you want to try it out, you can install the development version of torch with:

remotes::install_github("mlverse/torch")

Photo by Nick Fewings on Unsplash

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Falbel (2023, June 15). Posit AI Blog: safetensors 0.1.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/

BibTeX citation

@misc{safetensors,
  author = {Falbel, Daniel},
  title = {Posit AI Blog: safetensors 0.1.0},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/},
  year = {2023}
}