--- title: "diseasystore: quick start guide" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{diseasystore: quick start guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(diseasystore) ``` ```{r hidden_options, include = FALSE} if (rlang::is_installed("withr")) { withr::local_options("tibble.print_min" = 5) withr::local_options("tibble.print_max" = 5) withr::local_options("diseasystore.verbose" = FALSE) withr::local_options("diseasystore.DiseasystoreGoogleCovid19.n_max" = 1000) } else { opts <- options( "tibble.print_min" = 5, "tibble.print_max" = 5, "diseasystore.verbose" = FALSE, "diseasystore.DiseasystoreGoogleCovid19.n_max" = 1000 ) } # We have a "hard" dependency for duckdb to render parts of this vignette suggests_available <- rlang::is_installed("duckdb") not_on_cran <- interactive() || as.logical(Sys.getenv("NOT_CRAN", unset = "false")) ``` # Available diseasystores To see the available `diseasystores` on your system, you can use the `available_diseasystores()` function. ```{r available_diseasystores} available_diseasystores() ``` This function looks for `diseasystores` on the current search path. By default, this will show the `diseasystores` bundled with the base package. If you have [extended](extending-diseasystore.html) `diseasystore` with either your own `diseasystores` or from an external package, then attaching the package to your search path will allow it to show up as available. Note: `diseasystores` are found if they are defined within packages named `diseasystore*` and are of the class `?DiseasystoreBase`. Each of these `diseasystores` may have their own vignette that further details their content, use and/or tips and tricks. This is for example the case with `?DiseasystoreGoogleCovid19`. # Using a diseasystore To use a `diseasystore` we need to first do some configuration. The `diseasystores` are designed to work with data bases to store the computed features in. Each `diseasystore` may require individual configuration as listed in its documentation or accompanying vignette. For this Quick start, we will configure a `?DiseasystoreGoogleCovid19` to use a local `{duckdb}` data base Ideally, we want to use a faster, more capable, data base to store the features in. The `diseasystores` uses `{SCDB}` in the back end and can use any data base back end supported by `{SCDB}`. ```{r google_setup_hidden, include = FALSE, eval = suggests_available && not_on_cran} # The files we need are stored remotely in Google's API google_files <- c("by-age.csv", "demographics.csv", "index.csv", "weather.csv") remote_conn <- diseasyoption("remote_conn", "DiseasystoreGoogleCovid19") # In practice, it is best to make a local copy of the data which is # stored in the "vignette_data" folder. # This folder can either be in the package folder # (preferred, please create the folder) or in the tempdir(). local_conn <- purrr::detect( "vignette_data", checkmate::test_directory_exists, .default = tempdir() ) # Then we download the first n rows of each data set of interest try({ purrr::discard( google_files, ~ checkmate::test_file_exists(file.path(local_conn, .)) ) |> purrr::walk(\(file) { paste0(remote_conn, file) |> readr::read_csv(n_max = 1000, show_col_types = FALSE, progress = FALSE) |> readr::write_csv(file.path(local_conn, file)) }) }) # Check that the files are available after attempting to download files_missing <- purrr::some( google_files, ~ !checkmate::test_file_exists(file.path(local_conn, .)) ) if (files_missing) { data_available <- FALSE } else { data_available <- TRUE } ds <- DiseasystoreGoogleCovid19$new(target_conn = DBI::dbConnect(duckdb::duckdb()), source_conn = local_conn, start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-15")) ``` ```{r google_setup, eval = FALSE, eval = not_on_cran && suggests_available && data_available} ds <- DiseasystoreGoogleCovid19$new( target_conn = DBI::dbConnect(duckdb::duckdb()), start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-15") ) ``` When we create our new `diseasystore` instance, we also supply `start_date` and `end_date` arguments. These are not strictly required, but make getting features for this time interval simpler. Once configured we can query the available features in the `diseasystore` ```{r google_available_features, eval = not_on_cran && suggests_available && data_available} ds$available_features ``` These features can be retrieved individually (using the `start_date` and `end_date` we specified during creation of `ds`): ```{r google_get_feature_example_1, eval = not_on_cran && suggests_available && data_available} ds$get_feature("n_hospital") ``` Notice that features have associated "key_*" and "valid_from/until" columns. These are used for one of the primary selling points of `diseasystore`, namely [automatic aggregation](#automatic-aggregation). Go get features for other time intervals, we can manually supply `start_date` and/or `end_date`: ```{r google_get_feature_example_2, eval = not_on_cran && suggests_available && data_available} ds$get_feature("n_hospital", start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-02")) ``` # Dynamically expanded The `diseasystore` automatically expands the computed features. Say a given "n_hospital" has been computed between 2020-03-01 and 2020-03-15. In this case, the call `$get_feature("n_hospital", start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-20")` only needs to compute the feature between 2020-03-16 and 2020-03-20. # Time versioned Through using `{SCDB}` as the back end, the features are stored even as new data becomes available. This way, we get a time-versioned record of the features provided by `diseasystore`. The features being computed is controlled through the `slice_ts` argument. By default, `diseasystores` uses today's date for this argument. The dynamical expansion of the features described above is only valid for any given `slice_ts`. That is, if a feature has been computed for a time interval on one `slice_ts`, `diseasystore` will recompute the feature for any other `slice_ts`. This way, feature computation can be implemented into continuous integration (requesting features will preserve a history of computed features). Furthermore, post-hoc analysis can be performed by computing features as they would have looked on previous dates. # Automatic aggregation The real strength of `diseasystore` comes from its built-in automatic aggregation. We saw above that the features come with additional associated "key_*" and "valid_from/until" columns. This additional information is used to do automatic aggregation through the `?DieasystoreBase$key_join_features()` method (see [extending-diseasystore](extending-diseasystore.html) for more details). To use this method, you need to provide the `observable` that you want to aggregate and the `stratification` you want to apply to the aggregation. To see which features are considered "observables" and which are considered "stratifications" you can use the included methods: ```{r available_observables, eval = not_on_cran && suggests_available && data_available} ds$available_observables ``` ```{r available_stratifications, eval = not_on_cran && suggests_available && data_available} ds$available_stratifications ``` Lets start with an simple example where we request no stratification (`NULL`): ```{r google_key_join_features_example_1, eval = not_on_cran && suggests_available && data_available} ds$key_join_features(observable = "n_hospital", stratification = NULL) ``` This gives us the same feature information as `ds$get_feature("n_hospital")` but simplified to give the observable per day (in this case, the number of people hospitalised). To specify a level of `stratification`, we need to supply a list of `quosures` (see `help("topic-quosure", package = "rlang")`). ```{r google_key_join_features_example_2, eval = not_on_cran && suggests_available && data_available} ds$key_join_features(observable = "n_hospital", stratification = rlang::quos(country_id)) ``` The `stratification` argument is very flexible, so we can supply any valid R expression: ```{r google_key_join_features_example_3, eval = not_on_cran && suggests_available && data_available} ds$key_join_features(observable = "n_hospital", stratification = rlang::quos(country_id, old = age_group == "90+")) ``` # Dropping computed features Sometimes, it is need to clear the compute features from the data base. For this purpose, we provide the `drop_diseasystore()` function. By default, this deletes all stored features in the default `diseasystore` schema. A `pattern` argument to match tables by and a `schema` argument to specify the schema to delete from[^1]. ```{r drop_diseasystore_example_1, eval = not_on_cran && suggests_available && data_available} SCDB::get_tables(ds$target_conn, show_temporary = FALSE) ``` ```{r drop_diseasystore_example_2, eval = not_on_cran && suggests_available && data_available} drop_diseasystore(conn = ds$target_conn) SCDB::get_tables(ds$target_conn, show_temporary = FALSE) ``` # diseasystore options `diseasystores` have a number of options available to make configuration easier. These options all start with "diseasystore.". To see all options related to `diseasystore`, we can use the `diseasyoption()` function without arguments. ```{r diseasyoption_list} diseasyoption() ``` This returns all options related to `diseasystore` and its sister package `{diseasy}`. If you want the options for a specific `package`, you can use the `namespace` argument. Notice that several options are set as empty strings (""). These are treated as `NULL` by `diseasystore`[^2]. Importantly, the options are _scoped_. Consider the above options for "source_conn": Looking at the list of options we find "diseasystore.source_conn" and "diseasystore.DiseasystoreGoogleCovid19.source_conn". The former is a general setting while the latter is specific setting for `?DiseasystoreGoogleCovid19`. The general setting is used as fallback if no specific setting is found. This allows you to set a general configuration to use and to overwrite it for specific cases. To get the option related to a scope, we can use the `diseasyoption()` function. ```{r diseasyoption_example_1} diseasyoption("source_conn", class = "DiseasystoreGoogleCovid19") ``` As we saw in the options, a `source_conn` option was defined specifically for `?DiseasystoreGoogleCovid19`. If we try the same for the hypothetical `DiseasystoreDiseaseY`, we see that no value is defined as we have not yet configured the fallback value. ```{r diseasyoption_example_2} diseasyoption("source_conn", class = "DiseasystoreDiseaseY") ``` If we change our general setting for `source_conn` and retry, we see that we get the fallback value. ```{r diseasyoption_example_3} options("diseasystore.source_conn" = file.path("local", "path")) diseasyoption("source_conn", class = "DiseasystoreDiseaseY") ``` Finally, we can use the `.default` argument as a final fallback value in case no option is set for either general or specific case. ```{r diseasyoption_example_4} diseasyoption( "non_existent", class = "DiseasystoreDiseaseY", .default = "final fallback" ) ``` [^1]: If using `SQLite` as the back end, it will instead prepend the schema specification to the pattern before matching (e.g. "ds\\..*"). [^2]: R's `options()` does not allow setting an option to `NULL`. By setting options as empty strings, the user can see the available options to set. ```{r cleanup, include = FALSE} if (exists("ds")) rm(ds) gc() if (!rlang::is_installed("withr")) { options(opts) } ```