Coalescing refers to the process of checking the value of each input parameter in the order in which they are listed and returns the first non-missing value. In SAS, the behavior of each of the coalesce functions depends on the processing context, either in a data step or in a PROC SQL statement. These differences are important to acknowledge, especially when the code is migrated from PROC SQL to a data step.
Suppose, for example, that the following piece of code needs to be modified into a data step:
proc sql; create table foo2 as select coalesce(colum1, column2) as coalesce_column from foo; quit;
The intuitive approach would be to simply rewrite the code as follows:
data foo2; set foo; coalesce_column = coalesce(colum1, column2); run;
Although this approach seems to be correct at first glance, it may cause a potential issue: the coalesce function in the context of a data step treats parameters as numeric values. More specifically, if column1 and column2 are characters, the automatic SAS conversion kicks in by trying to convert these values to numerics. If such a conversion fails, a null value will be passed to the function. Simply switching to the character-compatible function, coalescec , may still have some side-effects as we will demonstrate in the following sections.
This article is therefore intended to review the behavior of the various coealesce functions in SAS, and to outline the hurdles when using these functions. All the examples will be driven from the following base table:
%let c_long_var_length = 300; %let n_short_var_length = 3; data root(drop=i character_to_append); *character variables; c_var_short = 'ABCDEFGHIJKLMN'; *create and populate a string which has a size greater than 200; length c_var_long $&c_long_var_length; do i = 1 to &c_long_var_length; character_to_append = '*'; if i = 1 then do; character_to_append = '>'; end; else if i = &c_long_var_length then do; character_to_append = '
The base table will allow us to study the behaviors of the various coalesce functions with variables of different lengths and types as summarized as follows:
Variable | Type | Length |
---|---|---|
c_var_long | Char | 300 |
c_var_short | Char | 14 |
n_var_long | Num | 8 |
n_var_short | Num | 3 |
In order to coalesce data in a data step, SAS offers two options:
It is important to note that if the coalescec function returns a value to a variable that has not previously been assigned a length, then that variable is given a length of 200 bytes. As a consequence, if one of the arguments has a length above 200 characters, the result may be truncated. To avoid any data truncation, the length of the output variable needs to be set ahead of the function call.
data coalescec_with_data_step; set root(drop=n_var:); /*The resulting variable has a length of 200. Note that it has not the same length of the parameters of the function*/ coalescec_empty_with_c_short = coalescec('', c_var_short); *the resulting variable is truncated as it has a length of 200 by default; coalescec_empty_with_c_long = coalescec('', c_var_long); length coalescec_empty_with_c_long2 $&c_long_var_length; coalescec_empty_with_c_long2 = coalescec('', c_var_long); run;
The resulting dataset has the following structure:
Variable | Type | Length |
---|---|---|
c_var_long | Char | 300 |
c_var_short | Char | 14 |
coalescec_empty_with_c_long | Char | 200 |
coalescec_empty_with_c_long2 | Char | 300 |
coalescec_empty_with_c_short | Char | 200 |
Using the coalescec function with a parameter of size greater than 200 (here, c_var_long ) will return by default a result with a length of 200. Setting explicitly the size of the output variable through the length statement is the only way to prevent a possible truncation.
The coalesce function accepts only numerics and returns by default a numeric of length 8, even if the underlying parameters have a shorter length. Unlike its sibling character-based function, there is no risk of truncation as the maximum length for a numeric in SAS is **8. Specifying the length of the return value in advance allows however the user to have full control of the length of the result.
data coalesce_with_data_step; set root(drop=c_var:); *the result has a length of 8 (default); coalesce_empty_with_n_short = coalesce('', n_var_short); coalesce_empty_with_n_long = coalesce('', n_var_long); *the resulting variable is truncated; length coalesce_empty_with_n_long2 3; coalesce_empty_with_n_long2 = coalesce('', n_var_long); run;
Note that the third variable of the output table ( coalesce_empty_with_n_long2 ) is not equal anymore to n_var_long as the numeric has been truncated. Truncating non-integers is not recommended due to the lost of precision.
Name | Value | Type | Length |
---|---|---|---|
coalesce_empty_with_n_short | 10 | Num | 8 |
coalesce_empty_with_n_long | 10.236455 | Num | 8 |
coalesce_empty_with_n_long2 | 10.236328 | Num | 3 |
The coalescec function is still available in the context of a PROC SQL statement. Similarly to the data step, setting explicitly the length through the column attribute may avoid an undesired truncation.
proc sql; create table coalescec_with_sql_step as select coalescec('', c_var_short) as coalescec_empty_with_c_short, /*the data will be truncated to 200 characters, like in the DATA-step*/ coalescec('', c_var_long) as coalescec_empty_with_c_long, /*in order to avoid truncation, the length needs to be specified*/ coalescec('', c_var_long) as coalescec_empty_with_c_long2 length=&c_long_var_length from root ; quit;
The output table is structured as follows:
Variable | Type | Length |
---|---|---|
coalescec_empty_with_c_long | Char | 200 |
coalescec_empty_with_c_long2 | Char | 300 |
coalescec_empty_with_c_short | Char | 200 |
It is recommended to use the coalesce SQL function for both numeric and character variables as SAS determines then the type of the output column based on the types of the arguments. While you need to use two functions, one for each data type, to coalesce variables in a data step, only one function is required in SQL.
Choosing the coalesce function over the coalescec function shifts the responsibility of defining the result type to the SQL processor. This behavior is in line with the SQL standard.
proc sql; create table coalesce_with_sql_step as select coalesce('', c_var_short) as coalesce_empty_with_c_short, coalesce('', c_var_long) as coalesce_empty_with_c_long, coalesce('', c_var_short) as coalesce_empty_with_c_short2 length=5, coalesce(., n_var_long) as coalesce_empty_with_n_long, coalesce(., n_var_short) as coalesce_empty_with_n_short, coalesce(., n_var_long) as coalesce_empty_with_n_long2 length=3, coalesce(., n_var_short) as coalesce_empty_with_n_short2 length=3 from root ; quit;
Note that the length of the ouput column is equal to the maximum length of the input parameters.
Variable | Type | Length |
---|---|---|
coalesce_empty_with_c_long | Char | 300 |
coalesce_empty_with_c_short | Char | 14 |
coalesce_empty_with_c_short2 | Char | 5 |
coalesce_empty_with_n_long | Num | 8 |
coalesce_empty_with_n_long2 | Num | 3 |
coalesce_empty_with_n_short | Num | 8 |
coalesce_empty_with_n_short2 | Num | 3 |
We have covered so far the use of the coalesce functions when all the parameters are of the same type. Let’s review the behaviours when there is a type mismatch.
In a data-step, the arguments are converted if required. More specifically, if an numeric argument is used with the coalescec function, SAS will convert it to a character variable. Such a conversion is straightforward and is unlikely to fail. SAS will issue however a note in the log. With the coalesce function, SAS will try to convert a character variable into a numeric variable. If such a conversion fails, the argument will be set to NULL.
data coalesce_type_issues; set root; *SAS is converting the numerical parameter as a string and issue a note; coalescec_c_short_with_n_short = coalescec(c_var_short, n_var_short); /*SAS is trying to convert the character parameter as a numerical. The conversion fails, and the converted value is set to null */ Therefore the returned value is equal to the value of the second parameter; coalesce_c_short_with_n_short = coalesce(c_var_short, n_var_short); run;
Running the previous piece of code will produce the following entry in the log:
NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 89:40 NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 92:31 NOTE: Invalid numeric data, c_var='string value', at line 92 column 31. c_var=string value n_var=10.3 c_long_var=>****************************************************************************************** ****************************************************************************************************** ****************************************************************************************************** ****< coalescec_c_with_n=string value coalesce_c_with_n=10.3 _ERROR_=1 _N_=1
While SAS will attempt to convert the parameters to the appropriate types, the coalesce function generates an error if there is a mismatch.
proc sql; select coalesce(c_var_short, n_var_short) as coalesce_c_with_n from root ; select coalesce(n_var_short, c_var_short) as coalesce_n_with_c from root ; select coalescec(n_var_short, n_var_short) as coalescec_n_with_n from root ; quit;
The previous piece of code will issue three consecutive errors of the same type:
ERROR: The COALESCE function requires its arguments to be of the same data type.
The following table summarizes the differences between coalesce and coalescec depending on the context of the processing step. These differences are critical to keep in mind especially when code is being re-factored to use a different processing context.
Context | DATA Step | SQL Step | ||
---|---|---|---|---|
Function | coalesce | coalescec | coalesce | coalescec |
Type of the Parameters | Numeric | Character | Character or Numeric. All the parameters need to be of the same type. | Character |
Result Length | 8 by default unless specified otherwise | 200 by default unless specified otherwise | The length is adjusted based on the length of the parameters | 200 by default unless specified otherwise |
Risk if length is not specified | No loss of precision | Possible truncation | No loss of precision | Possible truncation |
Type Conversion | Conversion from character to numeric may fail. Generates a note in the log. | Automatic conversion. Generates a note in the log. | Generates an error if there is an issue with conversion. | Automatic conversion. Generates a note in the log. |