---
title: "Getting started"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Getting started}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{css, echo=FALSE}
.custom_note {
border: solid 3px #0b788e;
background-color: #08505e;
padding: 5px;
margin-bottom: 10px;
border-radius: 3px;
}
.custom_note > p, .custom_note > p > code {
color: white;
background: #08505e;
}
.custom_note > p > code > a:any-link {
text-decoration-color: white;
}
```
Using `tidypolars` requires importing data as Polars `DataFrame`s or
`LazyFrame`s. You can read files with the [various `read_*_polars()` functions](https://tidypolars.etiennebacher.com/reference/#import-data)
(such as `read_parquet_polars()`) to import them as `DataFrame`s, or with
`scan_*_polars()` functions (such as `scan_parquet_polars()`) to import them as
`LazyFrame`s. There are several functions to import various file formats, such
as CSV, Parquet, or JSON.
Note: in examples or some tutorials, the functions as_polars_df()
and as_polars_lf()
are sometimes used to convert an existing R
data.frame to a Polars DataFrame or LazyFrame. Those are merely convenience
functions to quickly convert an existing dataset to Polars, which is
useful for showcase purposes. However, this conversion from R to Polars has
some cost and it hurts the performance. In real-life usecases, be sure to load
the data with the read_\*()
or the scan_\*()
functions
mentioned above.
Here, we're going to use the `who` dataset that is available in the `tidyr`
package. I import it both as a classic R `data.frame` and as a Polars `DataFrame`
so that we can easily compare `dplyr` and `tidypolars` functions.
```{r setup}
library(polars)
library(tidypolars)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
who_df <- tidyr::who
who_pl <- as_polars_df(tidyr::who)
```
`tidypolars` provides methods for `dplyr` and `tidyr` S3 generics. In simpler words, it
means that you can use the same functions on a Polars `DataFrame` or `LazyFrame`
as in a classic `tidyverse` workflow and it should just work (if it doesn't,
please [open an issue](https://github.com/etiennebacher/tidypolars/issues)).
Note that you still need to load `dplyr` and `tidyr` in your code.
Here's an example of some `dplyr` and `tidyr` code on the classic R `data.frame`:
```{r}
who_df |>
filter(year > 1990) |>
drop_na(newrel_f3544) |>
select(iso3, year, matches("^newrel(.*)_f")) |>
arrange(iso3, year) |>
rename_with(.fn = toupper) |>
head()
```
We can simply use our Polars dataset instead:
```{r}
who_pl |>
filter(year > 1990) |>
drop_na(newrel_f3544) |>
select(iso3, year, matches("^newrel(.*)_f")) |>
arrange(iso3, year) |>
rename_with(.fn = toupper) |>
head()
```
If you use Polars lazy API, you need to call `compute()` at the end of the
chained expression to evaluate the query:
```{r}
who_pl_lazy <- as_polars_lf(tidyr::who)
who_pl_lazy |>
filter(year > 1990) |>
drop_na(newrel_f3544) |>
select(iso3, year, matches("^newrel(.*)_f")) |>
arrange(iso3, year) |>
rename_with(.fn = toupper) |>
compute() |>
head()
```
Note: Several functions trigger the evaluation of a lazy query:
`compute()`, `collect()`, `as.data.frame()`, and `as_tibble()`. If you want
to return a Polars DataFrame, use `compute()`. If you want to return a
standard R data.frame, for example to use it in statistical analysis, use any
of the three other functions. Be warned that if the dataset is too big compared
to your available memory, this will crash the R session.
`tidypolars` also supports many functions from `base`, `lubridate` or `stringr`.
When these are used inside `filter()`, `mutate()` or `summarize()`, `tidypolars`
will automatically convert them to use the Polars engine under the hood. Take
a look at the vignette ["R and Polars expressions"](https://tidypolars.etiennebacher.com/articles/r-and-polars-expressions) for more information.