Dictionaries

library(zstdlite)
library(bench)

Dictionary-based compression

Using a dictionary can be beneficial when compressing lots of small objects with similar structure or content.

Instead of the compressor starting from scratch for every new object, a dictionary can be trained and used such that there is some starting information common to all objects - it’s like giving the compressor a bit of a head-start.

The following notes presented below are from zstd dictionary documentation:

Why should I use a dictionary?

Zstd can use dictionaries to improve compression ratio of small data. Traditionally small files don’t compress well because there is very little repetition in a single sample, since it is small. But, if you are compressing many similar files, like a bunch of JSON records that share the same structure, you can train a dictionary on ahead of time on some samples of these files. Then, zstd can use the dictionary to find repetitions that are present across samples. This can vastly improve compression ratio.

When is a dictionary useful?

Dictionaries are useful when compressing many small files that are similar. The larger a file is, the less benefit a dictionary will have. Generally, we don’t expect dictionary compression to be effective past 100KB. And the smaller a file is, the more we would expect the dictionary to help.

How do I train a dictionary?

Gather samples from your use case. These samples should be similar to each other. If you have several use cases, you could try to train one dictionary per use case. If the dictionary training function fails, that is likely because you either passed too few samples, or a dictionary would not be effective for your data.

How large should my dictionary be?

A reasonable dictionary size, the dictBufferCapacity, is about 100KB. The zstd CLI defaults to a 110KB dictionary. You likely don’t need a dictionary larger than that. But, most use cases can get away with a smaller dictionary. The advanced dictionary builders can automatically shrink the dictionary for you, and select the smallest size that doesn’t hurt compression ratio too much. See the shrinkDict parameter. A smaller dictionary can save memory, and potentially speed up compression.

How many samples should I provide to the dictionary builder?

We generally recommend passing ~100x the size of the dictionary in samples. A few thousand should suffice. Having too few samples can hurt the dictionaries effectiveness. Having more samples will only improve the dictionaries effectiveness. But having too many samples can slow down the dictionary builder.

How do I determine if a dictionary will be effective?

Simply train a dictionary and try it out.

When should I retrain a dictionary?

You should retrain a dictionary when its effectiveness drops. Dictionary effectiveness drops as the data you are compressing changes. Generally, we do expect dictionaries to “decay” over time, as your data changes, but the rate at which they decay depends on your use case. Internally, we regularly retrain dictionaries, and if the new dictionary performs significantly better than the old dictionary, we will ship the new dictionary.

Example

The following shows that using a dictionary for this specific example gives ~35% smaller files in ~75% of the time.

set.seed(2024)
countries <- rownames(LifeCycleSavings)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create 'test' and 'train' datasets
# In this example consider the case of having a named vector of rankings of 
# countries.  Each ranking will be compressed separately and stored (say in a database)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
train_samples <- lapply(
  1:1000, 
  \(x) setNames(sample(length(countries)), countries)
)

test_samples <- lapply(
  1:1000, 
  \(x) setNames(sample(length(countries)), countries)
)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create a dictionary
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dict <- zstd_train_dict_serialize(train_samples, size = 5000, optim = FALSE)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Setup Compression/Decompression contexts to use this dictionary
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cctx_nodict <- zstd_cctx(level = 3) # No dictionary. For comparison
cctx_dict   <- zstd_cctx(level = 3, dict = dict)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# When using the dictionary, what is the size of the compressed data compared
# to not using a dicionary here?
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
s1 <- lapply(test_samples, \(x) zstd_serialize(x, cctx = cctx_nodict)) |> lengths() |> sum()
s2 <- lapply(test_samples, \(x) zstd_serialize(x, cctx = cctx_dict  )) |> lengths() |> sum()
cat(round(s2/s1 * 100, 1), "%")
#> 63.1 %

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Simple benchmark to test speed when using dicionary.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bench::mark(
  "No Dict" = lapply(test_samples, \(x) zstd_serialize(x, cctx = cctx_nodict)),
  "Dict"    = lapply(test_samples, \(x) zstd_serialize(x, cctx = cctx_dict  )),
  check = FALSE
)[, 1:5]
#> # A tibble: 2 × 5
#>   expression      min   median `itr/sec` mem_alloc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 No Dict        17ms   17.8ms      55.6      18MB
#> 2 Dict         12.9ms   13.1ms      75.6      18MB