Re: Principal Component Analysis
- From: "Gaj Vidmar" <gaj.vidmar@xxxxxxxxxxxx>
- Date: Thu, 15 May 2008 14:59:34 +0200
Let me start with the last issue, i.e., similarity/distance measure for
mixed data.
There's enough research on that (especially recently); I'll just mention
Gower's index (General Coefficient of Similarity, if I recall correctly) and
its extensions (to include ordinal data etc.).
If you just have numerical and binary data, Pearson correlation will not be
such a terible idea (please note that I'm writing from a strictly applied,
"help-the-client--considering-all-the-tradeoffs" perspective).
---
To add to another poster championing PLS, Minitab also does it.
---
On a general note, to summarise things simplistically:
- There is a point in the abovementioned poster's concern with loosing the
information/variables that are most relevant to predicting the outcome if
reducing the dimensionality of the predictors without regard to the outcome
(the whole critique of Principal Components Regression by people immensly
more qualified than me, including a whole book, is also related to that).
- But there is also a point that with "too many variables" [with regard to
the number of cases], in order to avoid capitalisation on chance (overfit,
lack of generalisability or however-you-call-it) while avoiding learning
stuff too advanced for "only semi-smart people" (regularised discriminant
analysis, shrinkage a la Prof. Harrell etc.), you can get valid results with
simple methods precisely and only by ignoring the outcome when reducing the
dimensionality first (with PCA, clustering [followed by selection of
"representatives" or producing a score on each "varible group"], CATPCA
[after wise discretisation of numeric variables], even FA, or some other
way), doing, of course, everything cum grano salis (i.e., with subject
matter knowledge, "feeling" for data etc.).
---
Yeah, as they say, it all depends.
Best regards,
Gaj Vidmar, PhD
Institute for Rehabilitytion, Republic of Slovenia
& Univ. of Ljubljana, Fac. of Medicine, Inst. of Biomedical Informatics
"David" <david_arteta@xxxxxxxxxxx> wrote in message
news:d6e6fb0f-6a55-4fae-9d44-fe4c4c166f0a@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On May 15, 8:32 am, John Uebersax <jsueber...@xxxxxxxxx> wrote:
Hi David,
Some suggestions:
1. If your categorical variables are ordered-categorical, then you
can calcualte:
a. Pearson correlations between each pair of continuous variables.
b. Polyserial correlations between continuous and ordered-
categorical variables
c. Polychoric correlations between ordered-categorical variables
then place these in a single matrix and analyze that matrix by PCA. A
program like LISREL/Prelis will do all this for you more-or-less
automatically.
2. Although I agree with what others have posted, personally I prefer
the approach you originally suggested: to approach data-reduction and
the modeling of your response variable as two separate steps.
3. Since you just want to select a subset of non-redundant variables,
you have other options besides PCA. For example, you can use
hierarchical cluster analyis on the correlation matrix. That will
divide your variables into clusters. Then you can pick 'exemplars'
from each cluster and use those in your data model. This gives you
more flexibility, because you can use other measures of similarity/
redundancy among your variables besides correlation coefficients. For
example, if your categorical variables are non-ordered (i.e., purely
nominal variables), you can calculate the canonical correlation
between each pair of them. Then you can cluster analyze the matrix of
canonical correlation coefficients to divide the variables into
separate groups, and then select exemplars from each group.
Possibly you can include the canonical correlations in the overall
matrix as described in point 1 above -- I'm not sure, becuase they
might tend to run lower overall than Pearson correlations.
Hope this helps.
John Uebersax PhD
On May 8, 6:21 pm, David <david_art...@xxxxxxxxxxx> wrote:
Dear list,
is it possible to use PCA on categorical data?
I have a group of 30 continuous and categorical data and would like to
select a subset for modeling a response variable. I read that PCA
would help me doing this data reduction, but all the examples I have
seen involve continuous data.
Thanks for your help,
D.- Hide quoted text -
- Show quoted text -
Thankyou all for your input. Here are some comments:
- Art, you suggest some PCA methods, but my initial worry about using
PCA is losing iterpreatability
- Paige, you suggest PLS, but is PLS not doing effectively what PCA
does or Principal Component Regression? I have just read through it
quickly, and had a look at Faraway`s "Practical Regression and ANOVA
using R" and it says "On the other hand, PLS is virtually useless for
explanation purposes". So how can I trace back my regressors after
doing PLS?
- John, you suggest calculating a correlation matrix for all pairwise
comparison of my variables and then performing hierarchical clustering
to select a representative of each of the groups. That sounds very
interesting. So if I have 30 variables, should I end up with a 30x30
correlation matrix that could be fed to a clustering algorithm? My
categorical variables are generally non-ordered, like "family history"
yes-no. What kind of correlation measurement could I use for non-
ordered categorical variables?
Thanks for your useful comments
D.
.
- Follow-Ups:
- Re: Principal Component Analysis
- From: Paige Miller
- Re: Principal Component Analysis
- References:
- Principal Component Analysis
- From: David
- Re: Principal Component Analysis
- From: John Uebersax
- Re: Principal Component Analysis
- From: David
- Principal Component Analysis
- Prev by Date: Re: Principal Component Analysis
- Next by Date: Re: More (me vs.) multiple regression
- Previous by thread: Re: Principal Component Analysis
- Next by thread: Re: Principal Component Analysis
- Index(es):
Relevant Pages
|
Loading