# Re: Categorical Data Help

Mean and Mode

As a far as I know, "mean" has no interpretation for categorical data.
The mode is just the category that appears most often.

Covariance

It would be easy to dismiss your question by just saying that
"covariance and variance do not apply to categorical data." But, in
fact, it's a very interesting question. One can legitimately ask: is
variation in one categorical variable associated with variation in
another. So, the concept of association being valid here, it's just a
matter of measuring association in a way interpretable as a covariance.

Perhaps this could be done by replacing both categorical
variables with dichotomous dummy variables. A categorical variable
with K levels can be replaced with (K-1) binary dummy variables.

In the example shown below, categorical variable X1 is replaced
with dummy variables d1 and d2. Categorical variable
X2 is replaced with dummy variable d3.

X1 d1 d2 X2 d3
--------------------
A 0 0 yes 1
C 1 1 no 0
B 1 0 no 0
B 1 0 yes 1
C 1 1 yes 1

One can calculate a coefficient of correlation between
two *sets* of variables--it's called the canonical correlation. This is
analogous to the Pearson correlation. Said differently, the Pearson
correlation is a special case of the canonical correlation when both
sets of variables have only one variable each. The canonical
correlation can be understood as measuring the extent to which
variation in one set of variables is associated with variation in the
other.

In the example above, one could calculate the canonical correlation
between set {d1, d2} and set {d3}. This could be understood as a
"correlation" of the categorical variables X1 and X2 (but it might be
safer to call it the canonical correlation of dummy-coded versions of
X1 and X2).

Variance

You might be able to get a variance analog following the
dummy coding approach. For example, in the case of X1 above, you
could consider something like the average squared distance of
paired values {d1, d2} from the centroid--the centroid being
{mean(d1), mean(d2)}.

These are just rough ideas. Likely someone has considered this
subject in publications.

Hope this helps.

--
John Uebersax, PhD

.