Title: | A Data Model Package for the MLinHCT Project |
---|---|
Description: | Extracts, Loads, and Transforms data from the SQL Server containing HCT data to a `dm` object with cleaned tables. |
Authors: | Jesse Smith [aut, cre] |
Maintainer: | Jesse Smith <[email protected]> |
License: | AGPL (>= 3) |
Version: | 1.0.0 |
Built: | 2024-11-13 05:30:52 UTC |
Source: | https://github.com/jesse-smith/dmhct |
Connect to SQL Server Where IRB_MLinHCT Data is Stored
con_irb_mlinhct( server = "SVWPBMTCTDB01", database = "IRB_MLinHCT", trusted_connection = TRUE, dsn = NULL, ... )
con_irb_mlinhct( server = "SVWPBMTCTDB01", database = "IRB_MLinHCT", trusted_connection = TRUE, dsn = NULL, ... )
server |
|
database |
|
trusted_connection |
|
dsn |
|
... |
Additional named arguments to pass to |
[Microsoft SQL Server]
An ODBC connection object
Connect to SQL Server Where Data is Stored
con_sql_server(dbname = c("IRB_MLinHCT", "EDW"))
con_sql_server(dbname = c("IRB_MLinHCT", "EDW"))
dbname |
|
[Microsoft SQL Server]
An ODBC connection object
Connect to SQL Server Where EDW Data is Stored
con_stjude_edw( server = "stjude-edw.database.windows.net", database = "EDW", authentication = "ActiveDirectoryIntegrated", ... )
con_stjude_edw( server = "stjude-edw.database.windows.net", database = "EDW", authentication = "ActiveDirectoryIntegrated", ... )
server |
|
database |
|
authentication |
|
... |
Additional named arguments to pass to |
[Microsoft SQL Server]
An ODBC connection object
dm
ObjectCollect All Tables in a dm
Object
dm_collect(dm_remote, data_table = FALSE)
dm_collect(dm_remote, data_table = FALSE)
dm_remote |
|
data_table |
|
[dm]
A new dm
object containing the collected (local) tables
Combine Table with Like Information
dm_combine(dm_std = dm_standardize(), quiet = FALSE)
dm_combine(dm_std = dm_standardize(), quiet = FALSE)
dm_std |
A standardized |
quiet |
Should update messages be suppressed? |
The updated dm
object
dm
ObjectCompute All Tables in a dm
Object
dm_compute(dm_remote, quiet = TRUE)
dm_compute(dm_remote, quiet = TRUE)
dm_remote |
|
quiet |
|
[dm]
The updated object with tables computed
dm
Object from the Remote ServerDisconnect a dm
Object from the Remote Server
dm_disconnect(dm)
dm_disconnect(dm)
dm |
|
[dm]
The dm
(invisibly)
dm_elt()
encompasses the entire legacy dmhct pipeline; however, this
pipeline is deprecated no longer under active development. While this function
will be retained for backwards compatibility, it is strongly recommended that
new code use the new pipeline instead.
dm_elt(dm_remote = dm_sql_server(), reset = FALSE, close = NULL)
dm_elt(dm_remote = dm_sql_server(), reset = FALSE, close = NULL)
dm_remote |
|
reset |
|
close |
|
[dm]
A dm
object
dm_extract()
extracts and (optionally) loads the remote database housing
the MLinHCT into the current R session. Unless .legacy = TRUE
, column and
table names are standardized during extraction, but no other operations are
performed. When .legacy = TRUE
, the legacy version of dm_extract()
is
used; see details for this behavior. Note that legacy behavior will be
deprecated and eventually removed in future releases, so it is strongly
recommended that any new code use .legacy = FALSE
.
dm_extract( dm_remote = dm_sql_server(), ..., .collect = TRUE, .legacy = FALSE, .reset = FALSE, .excl_dsmb = FALSE, .quiet = FALSE, reset = .reset )
dm_extract( dm_remote = dm_sql_server(), ..., .collect = TRUE, .legacy = FALSE, .reset = FALSE, .excl_dsmb = FALSE, .quiet = FALSE, reset = .reset )
Legacy behavior is more opinionated than the current version of dm_extract()
.
First, only a subset of tables and columns are extracted. Second, HLA tables
and Cerner tables are combined into a single HLA table and a single Cerner
table. Third, some column standardization occurs (though it is limited to
simple as()
transformations, trimws(toupper(x))
on character variables,
and replacement of implicit missing values with explicit NA
s.) Finally,
some filtering of "uninformative" observations may occur. In the current
pipeline, these changes are deferred to later steps to give more control
to the user.
[dm]
A dm
object with all tables and columns extracted from the
remote source.
dm_extract_legacy()
is a previous, less extensible version of dm_extract()
.
It selects tables and columns of potential interest. It combines all
Cerner tables into one and joins HLA tables.
dm_extract_legacy(dm_remote = dm_sql_server(), collect = TRUE, reset = FALSE)
dm_extract_legacy(dm_remote = dm_sql_server(), collect = TRUE, reset = FALSE)
dm_remote |
|
collect |
|
reset |
|
[dm]
The updated dm
object
dm_hct()
chains together dm_extract()
, dm_standardize()
, and
dm_combine()
to provide a single wrapper function for data preparation.
dm_hct(dm_remote = dm_sql_server(), ..., .excl_dsmb = FALSE, .quiet = FALSE)
dm_hct(dm_remote = dm_sql_server(), ..., .excl_dsmb = FALSE, .quiet = FALSE)
The prepared dm
object
dm
Object is Connected to a Remote ServerCheck Whether a dm
Object is Connected to a Remote Server
dm_is_remote(dm)
dm_is_remote(dm)
dm |
The |
[lgl(1)]
Whether the dm
is remote or not
Pivot Tables in Entity-Attribute-Value Format
dm_pivot(dm_cmb = dm_combine(), quiet = FALSE)
dm_pivot(dm_cmb = dm_combine(), quiet = FALSE)
dm_cmb |
A |
quiet |
Should update messages be suppressed? |
The updated dm
object
dm
Object of HCT Data from Connection to SQL ServerCreate a dm
Object of HCT Data from Connection to SQL Server
dm_sql_server(con = con_sql_server(), quiet = FALSE)
dm_sql_server(con = con_sql_server(), quiet = FALSE)
con |
|
quiet |
Should update messages be suppressed? |
[dm]
A dm
containing HCT data
dm
for MLinHCTdm_standardize()
takes a local version of the SQL server as input and
standardizes all columns across all tables. Standardization procedures are
based on both column type and the typing prefix of the column name. Specifically,
columns are standardized using the following workflow:
Columns with type character
or chr/cat/lgl/mcat/intvl prefixes are
passed to std_chr()
Columns with type logical
or the lgl prefix are passed to std_lgl()
Columns with type numeric
or integer
, or num/pct prefixes, are passed
to std_num()
Columns with the intvl prefix are passed to std_intvl()
Column with types Date
, POSIXct
, or POSIXlt
, or with dt/dttm/date
prefixes, are passed to std_date()
After standardization, tables are sorted into alphabetical order before returning.
dm_standardize(dm_local = dm_extract(), quiet = FALSE)
dm_standardize(dm_local = dm_extract(), quiet = FALSE)
dm_local |
A local |
quiet |
Whether to suppress progress messages |
The input dm
with standarized column values
A previous version of the dmhct pipeline performed all transformations of
tables simultaneously; to ensure backwards compatibility, this behavior has
been retained in dm_transform()
. However, it is strongly recommended that
new code not use dm_transform()
and instead use the updated pipeline.
dm_transform(dm_local = dm_extract_legacy(), reset = FALSE)
dm_transform(dm_local = dm_extract_legacy(), reset = FALSE)
dm_local |
|
reset |
|
[dm]
The updated dm
object
intvl_to_matrix()
converts interval representation standardized by
std_invl()
to a 4-column numeric
matrix. Columns represent open or
closed bounds and the location of those bounds.
intvl_to_matrix(x)
intvl_to_matrix(x)
x |
A |
A 4-column numeric matrix:
left_closed
: Whether the left bound is closed or open
left_bound
: The left bound of the interval
right_bound
: The right bound of the interval
right_closed
: Whether the right bound is closed or open
na_patterns
is a collection of regular expression that commonly represent
missing data, especially when the character vector should be converted to
something else. These are designed to match strings that have already been
standardized.
na_patterns
na_patterns
An object of class character
of length 8.
numeric
non_numeric()
is designed primarily for interactive checking of numeric
conversions. It helps quickly determine what values in a vector cannot be
converted to numeric
(either directly or via std_num()
); this is
particularly useful for checking steps of a data cleaning pipeline.
non_numeric(x, unique = TRUE, sort = unique, std_num = FALSE)
non_numeric(x, unique = TRUE, sort = unique, std_num = FALSE)
x |
A vector |
unique |
Whether unique values should be returned; if |
sort |
Whether return values should be sorted; most useful when
|
std_num |
Whether to use |
The values of x
that resulted in NA_real_
after conversion; this
includes any NA
values in x
before conversion
character
Vectorsstd_chr()
standardizes character
vectors to ASCII text with no unnecessary
whitespace and a given case. By default, it will retain newlines inside text,
though it will condense consecutive newlines and any carriage returns into a
single newline.
std_chr( x, case = c("upper", "lower", "title", "sentence"), keep_inner_newlines = TRUE, na = "^$" )
std_chr( x, case = c("upper", "lower", "title", "sentence"), keep_inner_newlines = TRUE, na = "^$" )
x |
A character vector |
case |
The case to convert to. |
keep_inner_newlines |
Whether to retain line breaks inside text. |
na |
Regex patterns to consider |
The standardized character vector
std_date
standardizes a date vector and returns a vector in Date
or
POSIXct
format, depending on whether there is sub-daily information
available in the data.
std_date( x, force = c("none", "dt", "dttm"), orders = c("mdy", "dmy", "ymd", "mdyr", "dmyr", "ymdr", "mdyR", "dmyR", "ymdR", "mdyT", "dmyT", "ymdT", "mdyTz", "dmyTz", "ymdTz", "Tmdyz", "Tdmyz", "Tymdz", "mdyRz", "dmyRz", "ymdRz", "mdyrz", "dmyrz", "ymdrz", "Tmdy", "Tdmy", "Tymd", "Tmdyz", "Tdmyz", "Tymdz"), tz_heuristic = c(5L, 6L), warn = TRUE, train = TRUE, na = na_patterns, range_value = c("start", "end", "na"), range_sep = c("-", "to", ","), ... )
std_date( x, force = c("none", "dt", "dttm"), orders = c("mdy", "dmy", "ymd", "mdyr", "dmyr", "ymdr", "mdyR", "dmyR", "ymdR", "mdyT", "dmyT", "ymdT", "mdyTz", "dmyTz", "ymdTz", "Tmdyz", "Tdmyz", "Tymdz", "mdyRz", "dmyRz", "ymdRz", "mdyrz", "dmyrz", "ymdrz", "Tmdy", "Tdmy", "Tymd", "Tmdyz", "Tdmyz", "Tymdz"), tz_heuristic = c(5L, 6L), warn = TRUE, train = TRUE, na = na_patterns, range_value = c("start", "end", "na"), range_sep = c("-", "to", ","), ... )
x |
A vector of |
force |
Whether to force conversion to |
orders |
A |
tz_heuristic |
Hours to consider in determining presence of sub-daily information. Only exact hours (i.e. 5:00:00) will be combined. The default corresponds to accidental encoding of the CST-UTC offset as hours. |
warn |
Should warnings be thrown when necessary? |
train |
|
na |
Regular expressions to convert to |
range_value |
The value to use if the date is given as a range; can be
the start date, the end date, or fill with |
range_sep |
Separators used for date ranges |
... |
Additional arguments to pass to
|
A Date
or POSIXct
vector
std_invl()
standardizes the various representations of numeric intervals
found in the ML in HCT dataset. These intervals are assumed to be in percentage
values and thus lie between 0 and 100. Explicit intervals with upper and lower
bounds, as well as implicit intervals using < and >, are handled (<= and >=
are currently not supported). The return value simplifies to </>/<=/>= or
a single numeric value if possible and uses standard interval notation if not.
std_intvl( x, less_than = c("LESS THAN", "[A-Z ]*NOTHING TO SUGGEST[A-Z ]*SENSITIVITY[A-Z ]*(?=[0-9])"), greater_than = c("GREATER THAN"), na = na_patterns, std_chr = TRUE, warn = TRUE, ... )
std_intvl( x, less_than = c("LESS THAN", "[A-Z ]*NOTHING TO SUGGEST[A-Z ]*SENSITIVITY[A-Z ]*(?=[0-9])"), greater_than = c("GREATER THAN"), na = na_patterns, std_chr = TRUE, warn = TRUE, ... )
x |
A |
less_than |
Regex patterns to consider |
greater_than |
Regex patterns to consider |
na |
Regex patterns to consider |
std_chr |
Whether to standarize the strings before parsing |
warn |
Whether to emit a warning when potential numeric values are not able to be converted to an interval |
... |
Arguments passed on to
|
A character
vector
logical
Representations in Various Formatsstd_lgl()
converts other classes to logical
vectors. All but character
use as.logical()
; character
vectors are converted by first (optionally)
standardizing with std_chr
and then assigning logical value based on the
regular expression in true
, false
, and na
.
std_lgl( x, true = c("^TRUE$", "^1$", "^YES", "^POS", "^ALIVE", "^ON THERAPY"), false = c("^FALSE$", "^0$", "^NO", "^NEG", "^EXPIRED", "^DECEASED", "^OFF THERAPY"), na = na_patterns, std_chr = TRUE, warn = TRUE )
std_lgl( x, true = c("^TRUE$", "^1$", "^YES", "^POS", "^ALIVE", "^ON THERAPY"), false = c("^FALSE$", "^0$", "^NO", "^NEG", "^EXPIRED", "^DECEASED", "^OFF THERAPY"), na = na_patterns, std_chr = TRUE, warn = TRUE )
x |
A vector to convert |
true |
Regex patterns to consider |
false |
Regex patterns to consider |
na |
Regex patterns to consider |
std_chr |
Whether to standardized a |
warn |
Whether to warn if |
A logical
vector
std_num()
converts all base classes, as well as int64
, factor
, Date
,
and POSIXt
vectors to the simplest numeric form possible.
std_num(x, na = na_patterns, std_chr = TRUE, warn = TRUE, ...)
std_num(x, na = na_patterns, std_chr = TRUE, warn = TRUE, ...)
x |
A vector to convert to numeric |
na |
Regex patterns to consider |
std_chr |
Whether to standardize a |
warn |
Whether to warn when strings cannot be converted; passed to |
... |
Arguments passed on to
|
character
vectors are standardized using std_chr()
by default, then
converted. factor
s are treated as character
vectors, rather than using
the underlying integer representation. double
and int64
vectors will be
converted to integer
if this does not cause overflow or loss of precision.
Date
is converted to integer
, and POSIXt
is converted to integer
if
the range allows, otherwise double
.
A numeric
vector