AutoReg

Type of variables

The indipendent variables are divided into categories to optimize the creation of the model by using more precisely the potential of the data.
Below we describe the possible choices for users.


Response variable (R).
This (identified by letter R) is the variable that will be used as a target in the estimation of the model.
It is mandatory and unique.


Qualitative variable (C).
A qualitative variable (C, formatted as a number or character) consists of classes; these classes cannot be joined together according to a numerical logic: an example can be the variable "color of eyes" (brown, blue, green, black, ...).
These variables will be treated as categorical variables (using the class option in the genmod sas procedure) without any adjusting.
In correlation analysis, in this case, we use a derived value of Simpson concentration index: the procedure compares the C variables one at a time with other Q - O - X - C variables, obtaining a 2-way matrix (m x n, where m and n are the numbers of modalities of the two variables) containing the numerosity of different modalities.
Using this matrix, we calculate the Simpson concentration index on the marginal probability distribution of row and column: so we have (m + n) indexes.
Then we calculate the weighted (on marginal row numerosity) average of the m row indexes and the weighted (on marginal column numerosity) average of the n column indexes, and, at the end, we calculate the average of the two obtained macro-index: we use this value as a pseudo-correlation. If this value is higher than our threshold (input parameter taglio_correlazione) we consider the two variables correlated.
Since this value is not really a correlation, we added an input parameter (simpson): if you set the parameter to zero (0, the default value is 1), the use of this pseudo-correlation index is inhibited.


Qualitative variable to compact by concentration (K).
If we have a qualitative variable that we want to group by concentration of the target variable, we can use this option (K). This option is particularly useful in case the variable has too many modalities and we want homogeneous groups.
Both when we compress the variable (before and in the regression) and when we calculate the correlation, we have two different ways to use K variables: if the number of classes is higher than the initial number of classes chosen for compression of X variables (see below), the procedure treats it like them; otherwise it is grouped as an ordinal variable (O, see below).
In both cases, the order is given by concentration of target variable in sets.


Ordinal variable (O).
Ordinal variables (O) are those already grouped into defined classes and which have an internal order, differently from the previous ones. An example can be a variable that identifies the level of education (0 = none, 1 = elementary, 2 = medium, 3 = high, 4 = university, ...).
To test if an ordinal variable is correlated to another variable, we use the Spearman correlation (except when it is compared with a nominal variable).
During the model regression building phase, ordinal variables are managed with a complex process:

Note that if you want to use a variable with this feature, the column must be numeric.
Before starting the procedure, it is important to know the number of different classes for the variable. Similarly to X variables, you can read this help page.


Numeric variable to compact (X).
If among the data there are quantitative variables, these may be used in two ways: either trying to insert those variables in the model without any conversion (see below), or trying to compact them in a qualitative variable using percentiles.
On one hand, the compression requires less effort from the user, who doesn't need to check monotonicity of concentration of the target variable for current variable, on the other hand, the model estimated in this way can be misleading (overfitting of data) and may increase significantly processing time (see, similar to O variables, this help page ).
Note that user can decide the percentage extension of the classes setting the input parameter passo.
For correlation analysis, X variables are evaluated in different ways depending on the type of variable which are compared to: if second variable is ordinal, we'll use Spearman correlation; if it is nominal, we'll use Simpson concentration index; in all others cases (Q or X variables), we'll use Pearson correlation.


Quantitative variable (Q).
If the variable we use is numerical, we can try to insert it into the model without compression: before calculation, the user must check the monotonicity of concentration of target for current variable. If there isn't monotonicity, the user must correct the problem by changing data (a typical example is a variable that have highest concentration at both tails of distribution: in these cases, the original variable is usually flipped to obtain a new one where the two tails are overlapping).
In correlation analysis, Q variables are evaluated similarly to X ones, so we use different indexes depending on the type of variable which are compared to: if the second item is ordinal, we'll use Spearman correlation; if it is nominal, we'll use Simpson concentration index; in all others cases (Q or X variables), we'll use Pearson correlation.


Identifier variable (I).
If we want to keep a variable in all output files without using it in the regression (for example because it is a key of the table), we must label it with I.


Variable not to be used (N).
If we don't want to use a variable, it must be labeled N.
Note that the process will not consider a variable without any label.


Correlation summary.
For a better understanding of the method used to evaluate the correlation between variables, have a look at the summary table below.


O (Ordinal var.) X (Numeric var. to compact) Q (Quantitative var.) C (Qualitative var.)
O (Ordinal var.) Spearman (S) Spearman (S) Spearman (S) Simpson (C)
X (Numeric var. to compact) Spearman (S) Pearson (P) Pearson (P) Simpson (C)
Q (Quantitative var.) Spearman (S) Pearson (P) Pearson (P) Simpson (C)
C (Qualitative var.) Simpson (C) Simpson (C) Simpson (C) Simpson (C)




  Main index     Programs index     Autoreg index  
Vai alla versione Italiana

Creation date: 17 Sep 2010
Translation date: 30 Dec 2012
Last change: 17 May 2013

Translation reviewed by Giulia Di Lallo