Doing More with Bigger Data: An Introduction to Arrow for R Users
1 Setup
2 Getting the data
We use a dataset of item checkouts from Seattle public libraries, available online at Seattle.
The following code will get you a cached copy of the data. The data is quite big, so it will take some time to download. I highly recommend using curl::multi_download()
to get very large files as it’s built for exactly this purpose: it gives you a progress bar and it can resume the download if its interrupted.
3 Reading the data as a CSV file
If we attempt to read the dataset using read_csv()
, it could result in a lengthy processing time or potentially fail altogether (In my case, I have to force quit R due to no responding). This is primarily due to the sheer size of the dataset, which contains 41,389,465 rows, 12 columns, and occupies 9.21 GB of storage space.
```{r}
#| eval: false
#| include: false
read_csv("~/Github/data/seattle-library-checkouts.csv") |>
nrow()
```
Let’s use the open_dataset()
function from the arrow
package to read the dataset. This function is designed to read large datasets efficiently by only reading the metadata and not the entire dataset. This allows us to work with large datasets without running into memory issues.
Code
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: string
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
Even so, it still took a much longer time (about 35 seconds) to get the rows, which is 41,389,465 rows.
Take a glimpse at the dataset also took a long time (about 20 seconds).
FileSystemDataset with 1 csv file
41,389,465 rows x 12 columns
$ UsageClass <string> "Physical", "Physical", "Digital", "Physical", "Physi…
$ CheckoutType <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Horizo…
$ MaterialType <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOOK",…
$ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2, 3,…
$ Title <string> "Super rich : a guide to having it all / Russell Simm…
$ ISBN <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
$ Creator <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim Par…
$ Subjects <string> "Self realization, Conduct of life, Attitude Psycholo…
$ Publisher <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Dial …
$ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c2005.…
14.27 sec elapsed
Code
15.567 sec elapsed
And it took about 20 seconds to count the Checkouts by year.
4 Reading the data as a Partquet file
Thanks to arrow, this code will work regardless of how large the underlying dataset is. But it’s currently rather slow: on mycomputer, it took 20 seconds or longer to run. That’s not terrible given how much data we have, but we can make it much faster by switching to a better format.
4.1 Rewriting the Seattle library data as a Parquet file
```{r}
seattle_csv |>
group_by(CheckoutYear) |>
write_dataset(path = "~/GitHub/data/seattle-library-checkouts", format = "parquet")
```
It took about 30 seconds to write the dataset as a Parquet file, which is 4,42 GB in size, less than half the size of the original CSV file. Let’s take a look at what we just produced:
Code
Our single 9GB CSV file has been rewritten into 18 parquet files. The file names use a “self-describing” convention used by the Apache Hive project. Hive-style partitions name folders with a “key=value” convention, so as you might guess, the CheckoutYear=2005 directory contains all the data where CheckoutYear is 2005. Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file. This is as we expect since parquet is a much more efficient format.
4.2 Using dplyr with arrow
Now we’ve created these parquet files, we’ll need to read them in again. We use open_dataset() again, but this time we give it a directory:
Now we can write our dplyr pipeline. For example, we could count the total number of books checked out in each month for the last five years:
Writing dplyr code for arrow data is conceptually similar to dbplyr, Chapter 21: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call collect(). If we print out the query object we can see a little information about what we expect Arrow to return when the execution takes place. And we can get the results by calling collect():
FileSystemDataset (query)
CheckoutYear: int32
CheckoutMonth: int64
TotalCheckouts: int64
* Grouped by CheckoutYear
* Sorted by CheckoutYear [asc], CheckoutMonth [asc]
See $.data for the source Arrow object
Let’s compare the time it took to count the Checkouts by year using the CSV file and the Parquet file and see if it worth the trouble.
Code
user system elapsed
17.798 3.286 16.653
Code
user system elapsed
0.391 0.049 0.108
Parquet file took 0.108 seconds while csv file took 16.653 seconds. The Parquet file is extremely faster, about 100 times, than the CSV file. Totally worth the trouble.
5 Using duckdb with arrow
There’s one last advantage of parquet and arrow — it’s very easy to turn an arrow dataset into a DuckDB
database by calling arrow::to_duckdb()
:
Code
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
0.454 sec elapsed
It took a little longer than without transition. However, the neat thing about to_duckdb()
is that the transfer doesn’t involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.