# AutoReg

## Choosing the number of classes for a variable to group

When we use X or O variables,
we must know what is the computational weight for the machine.

The macro variable passo is the parameter with the strongest influence:
the value given in input to the process expresses the range (in percentage) of each generated class.

The default value of the macro input variable is 10, so the program will try to create classes that contain
10% of the population: unless a specific concentration of values occurs, the new variable will probably
have 9-11 different cases.

The algorithm first uses these classes individually, then it groups them so that it is possible to estimate
the potential number of regressions as the number of classes changes (we can estimate such value in terms
of the available class grouping according to the stepwise method til we reach a final, whole grouping,
excluding any backwise steps).

If, for example, we have a single class (not a good thing in statistics), the procedure calculates only
one regression because there aren't classes to merge or split.

In case of two classes (*a* and *b*), we have two regressions (the first, *a-b*, with two
separated classes and the second *a-a*, after the compression of two classes in one).

If we decide to divide the variable into three classes, the number of potential regressions rises to four
(*a-b-c / a-a-c / a-b-b / a-a-a*); the number increases to seven if the classes are four
(*a-b-c-d / a-a-c-d / a-b-b-d / a-b-c-c / a-a-a-d / a-a-c-c / a-a-a-a*, with input parameter
*passo* equals 25-30).

By increasing the number of modalities, the number of potential regressions increases as well; if we think of it
as a function of the number of classes, we can notice that it follows this trend: *f(n) = f(n-1) + (n-1)*,
so the number of potential regressions for a variable with *n* classes is the number of potential
regressions for a *(n-1)* variable classes plus *(n-1)*.

The growth rate decreases while increasing the number of classes.
This can be noticed by leaving the default value
for the *passo* parameter, so that we have 46 regressions for each X or
O variable at each step of the procedure.

In order to have a clearer idea of the numbers above, here find a summary table.

Num.Classes | Val.*Passo* |
Potent.Reg. |

1 | 100 | 1 |

2 | 50 | 2 |

3 | 34 | 4 |

4 | 25 | 7 |

5 | 20 | 11 |

6 | 17 | 16 |

7 | 15 | 22 |

8 | 13 | 29 |

9 | 11 | 37 |

10 | 10 | 46 |

11 | 9 | 56 |

12 | 8 | 67 |

13 | 8 | 79 |

14 | 7 | 92 |

15 | 7 | 106 |

16 | 6 | 121 |

17 | 6 | 137 |

18 | 6 | 154 |

19 | 5 | 172 |

20 | 5 | 191 |

* Creation date: 17 Sep 2010 *

* Translation date: 30 Dec 2012 *

* Last change: 14 Apr 2013 *

* Translation reviewed by
Giulia Di Lallo*