Re: about PCA and variability??



On Jun 26, 12:19 am, cprice <cpr...@xxxxxxxxx> wrote:
From your original set of many X variables,PCAlets you create a new
set of variables. This new set will have just as many variables as
your original set, but the point here is that, with only a few of
them, you could still capture almost all of the variance of the
original X variables.

As an exaggerated example, your outcome might be that the 1st new
variable might capture 98% of the variance of all of the original
variables. This means that this one new variable does a good job of
representing the entire set of original variables. That is why there
is so much focus on wanting your new variable to have a large
variance.

As another example, if instead your outcome was that the 1st new
variable only accounted for 4% of the variance of the original
variables, we could say it does not do a good job of representing the
set of original variables.

Now, for a new PC variable that does capture a large percentage of the
original variance, does this mean this variable is good at predicting
some other response variable? The usual answer here is no, the new
variables are only chosen to explain the variance of some set of
variables, and do not take into account how any of the varibles relate
to some response variable. InPCA, there are no response or predictor
variables.

However, I am aware of another opinion, which I like. If you are using
some set of predictor variables to model some response variable, then
right off the bat, the reason you are doing this in the first place is
because you think this set of predictor variables is reasonable to try
and predict your response variable. If you then find one or two new PC
variables that can represent your original X variables, then it is
hardly any more of a stretch to use them for prediction, than it
already was to use your original X variables.

This could be a moot point though, because other methods do exist that
will create new variables that do explicitly take into account the
covariance between a response variable and some other set of
variables, such as partial least squares. I think there have been some
good posts on this topic in this group already.

Also, at the risk of bringing up more confusion, if your original
variables are in different units, and have greatly varying variances
for each of them, you will want to do yourPCAfrom the correlation
matrix of your original X variables, as opposed to the covariance
matrix of your original X variables. This should be as simple as
checking off some option on whatever software you are using. The
reasoning is that PC's from a covariance matrix will be greatly skewed
towards the original variables with large variances.

-CP

I trying to think about it like this. Suppose there is novariability
in a given variable say X. i.e. it is constant. I guess we would say
that such a variable would have no usefulness in predicting the value
of a response variable. I mean if X is always 5 then knowing what X is
will give us no chance to have any insight into the value of our
response variable (say Y).

But the converse of this idea, that if X is extremely variable -has a
lot ofvariability, then ..... what?  I cannot go to the next step
here.

Any help would be appreciated!
Thank you.



Hi and thanks to all you have responded to my query. My question is, I
suppose, say in regard to the above post, why are we interested in
whether one of the original variables or one of the new derived
variables might capture 98% of the variance of all of the original
variables or not. I mean, I guess this is a very basic question, but
why are we interested in the variation of the variables in the first
place.
I recall from regression that we talk about the explained versus
unexplained variation. This is, as I recall, what percent of the total
variation of the response variable is explained by the explanatory
variable(s). So this is in the context of predicting the outcome of
the response variable and the higher the explained variation the
better the model (less error in the prediction).
So I am assuming that in PCA, the reason we are interested in how much
of the overall variation is covered by a variable is something to do
with how well it will work for predicting some response variable
outcome. But this does not seem to be right because we are not talking
about how much of the variation in the explained variable (if there is
an explained variable) is explained by our variable or variables
(either the original variables or the derived components). Rather we
are just talking about how much of the overall variation in our entire
set of variables is accounted for by one variable or one component or
a set of variables or a set of components. But I don't understand why
we are interested in this overall variation in the first place?
THANKS
.



Relevant Pages

  • Re: about PCA and variability??
    ... The various forms of factor analysis can be used for "data reduction", sometimes for finding a few latent constructs underlying more numerous very specific measures. ... So that instead of having 20 variables each of which measure spelling of particular words, you use a summarization of them to represent general spelling achievement. ... why are we interested in the variation of the variables in the first ... in predicting Y because ...
    (sci.stat.consult)
  • Re: Fuzzy Time (#99)
    ... track the number of minutes the fuzzy time remained the same, ... change interval close to 10 minutes but with a large variance. ... Variation: 10.37% ahead, 20.75% behind ... def to_s ...
    (comp.lang.ruby)
  • Re: kinds of variance analysis
    ... variation among averages is larger than the variation we ... expressed in terms of "analysis of variance", ... an honest partitioning of the total ... ANOVA)? ...
    (sci.stat.math)
  • Re: Worst single change from 3.0 to 3.5?
    ... D&Dland so why should normal physics behave anything like normal RL ... I've halved the variation. ... In combat there's variance, as there should be. ... Jim or Sarah Davies, but probably Jim ...
    (rec.games.frp.dnd)
  • Re: PCA help
    ... > the principal components. ... > larger number of components explain the same amount of variation, ... > and measure the cumulative proportion of the variance accounted for by ... Scree test? ...
    (sci.stat.math)

Loading