The text package attempts to provide user-friendly access and pipelines to HuggingFace’s transformer language models in R.
AI-based language analysis has recently gone through a “paradigm shift” (Bommasani et al., 2021, p. 1), thanks in part to a new technique referred to as transformer language model (Vaswani et al., 2017, Liu et al., 2019). Companies, including Google, Meta, and OpenAI have released such models, including BERT, RoBERTa, and GPT, that have achieved unprecedented large improvements across most language tasks such as web search and sentiment analysis. While these language models are accessible in Python, and for typical AI tasks through HuggingFace, the R package text
makes HuggingFace and state-of-the-art transformer language models accessible as social scientific pipelines in R.
We developed the text
package (Kjell, Giorgi & Schwartz, 2022) with two objectives in mind:
To serve as a modular solution for downloading and using transformer language models. This, for example, includes transforming text to word embeddings as well as accessing common language model tasks such as text classification, sentiment analysis, text generation, question answering, translation and so on.
To provide an end-to-end solution that is designed for human-level analyses including pipelines for state-of-the-art AI techniques tailored for predicting characteristics of the person that produced the language or eliciting insights about linguistic correlates of psychological attributes.
This blog post shows how to install the text
package, transform text to state-of-the-art contextual word embeddings, use language analysis tasks as well as visualize words in word embedding space.
The text
package is setting up a python environment to get access to the HuggingFace language models. The first time after installing the text
package you need to run two functions: textrpp_install()
and textrpp_initialize()
.
# Install text from CRAN
install.packages("text")
library(text)
# Install text required python packages in a conda environment (with defaults)
textrpp_install()
# Initialize the installed conda environment
# save_profile = TRUE saves the settings so that you do not have to run textrpp_initialize() again after restarting R
textrpp_initialize(save_profile = TRUE)
See the extended installation guide for more information.
The textEmbed()
function is used to transform text to word embeddings (numeric representations of text). The model
argument enables you to set which language model to use from HuggingFace; if you have not used the model before, it will automatically download the model and necessary files.
# Transform the text data to BERT word embeddings
# Note: To run faster, try something smaller: model = 'distilroberta-base'.
word_embeddings <- textEmbed(texts = "Hello, how are you doing?",
model = 'bert-base-uncased')
word_embeddings
comment(word_embeddings)
The word embeddings can now be used for downstream tasks such as training models to predict related numeric variables (e.g., see the textTrain() and textPredict() functions).
(To get token and individual layers output see the textEmbedRawLayers() function.)
There are many transformer language models at HuggingFace that can be used for various language model tasks such as text classification, sentiment analysis, text generation, question answering, translation and so on. The text
package comprises user-friendly functions to access these.
classifications <- textClassify("Hello, how are you doing?")
classifications
comment(classifications)
generated_text <- textGeneration("The meaning of life is")
generated_text
For more examples of available language model tasks, for example, see textSum(), textQA(), textTranslate(), and textZeroShot() under Language Analysis Tasks.
Visualizing words in the text
package is achieved in two steps: First with a function to pre-process the data, and second to plot the words including adjusting visual characteristics such as color and font size.
To demonstrate these two functions we use example data included in the text
package: Language_based_assessment_data_3_100
. We show how to create a two-dimensional figure with words that individuals have used to describe their harmony in life, plotted according to two different well-being questionnaires: the harmony in life scale and the satisfaction with life scale. So, the x-axis shows words that are related to low versus high harmony in life scale scores, and the y-axis shows words related to low versus high satisfaction with life scale scores.
word_embeddings_bert <- textEmbed(Language_based_assessment_data_3_100,
aggregation_from_tokens_to_word_types = "mean",
keep_token_embeddings = FALSE)
# Pre-process the data for plotting
df_for_plotting <- textProjection(Language_based_assessment_data_3_100$harmonywords,
word_embeddings_bert$text$harmonywords,
word_embeddings_bert$word_types,
Language_based_assessment_data_3_100$hilstotal,
Language_based_assessment_data_3_100$swlstotal
)
# Plot the data
plot_projection <- textProjectionPlot(
word_data = df_for_plotting,
y_axes = TRUE,
p_alpha = 0.05,
title_top = "Supervised Bicentroid Projection of Harmony in life words",
x_axes_label = "Low vs. High HILS score",
y_axes_label = "Low vs. High SWLS score",
p_adjust_method = "bonferroni",
points_without_words_size = 0.4,
points_without_words_alpha = 0.4
)
plot_projection$final_plot
This post demonstrates how to carry out state-of-the-art text analysis in R using the text
package. The package intends to make it easy to access and use transformers language models from HuggingFace to analyze natural language. We look forward to your feedback and contributions toward making such models available for social scientific and other applications more typical of R users.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/OscarKjell/ai-blog, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kjell, et al. (2022, Oct. 4). Posit AI Blog: Introducing the text package. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/
BibTeX citation
@misc{kjell2022introducing, author = {Kjell, Oscar and Giorgi, Salvatore and Schwartz, H Andrew}, title = {Posit AI Blog: Introducing the text package}, url = {https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/}, year = {2022} }