question about training samples for logit



Hi all.

It has been a long time since I've done analysis like the following, so

I am a bit rusty. been searching the web and various news groups and
don't see too much applicable information, so I thought I would throw
this question out to you.....


I have a 16-million record dataset. Each record is a person.
The 16 million people are broken down into 4 groups with the following
distribution:
62% group 1
1.5% group 2
3.5% group 3
33% group 4


38,000 of the 16-million have done X. The rest have not.
The distribution of the Xers into the four groups are:
25% group 1
8.5% group 2
25% group 3
41.5% group 4


I want to predict the probability of doing X.
I have a set of variables that I believe will work well to predict X
--- above and beyond the variable that places someone in group 1,2,3 or

4.
I want to run the models separately on each group, because I believe
that their error terms are unrelated.


Is there a "proper" way to construct a training set for this
experiment?


One thought was to randomly sample 70% of the Xers in group 1 (for
example), leaving 30% as a holdout sample, and then select a random
sample of the same size of non-Xers in group 1.


Any advice, directions, examples, experience, etc. would be much
appreciated.


-jen

.