Re: mixture distribution
- From: Greg Heath <heath@xxxxxxxxxxxxxxxx>
- Date: Sun, 25 May 2008 19:15:33 -0700 (PDT)
On May 22, 4:40 pm, W <wzhang...@xxxxxxxxx> wrote:
Hello:
I have a group of diagnosis test data, 40 of which were from positive
samples and 200 of which were from negative samples.
How many measurement variables?
I want to
calculate a cutoff value to separate the pos and negs so that I can
classify the future data points to either positive or negative group.
Pretending I didn't know the label of samples, I used a mixed normal
model and optimized a cutoff value but the seperation wasn't very
good.
No. You shold use the class-conditional normal model and use the
labels (supervised learning) to optimize the cutoff value.
What, exactly, do you want to minimize?
Is the 5:1 ratio a good estimate for out of sample data?
If not, you will have to adjust for the expected a priori
probabililities.
I don't think average error, equal error, or equal error rate
is appropriate for your problem. You probably should try to minimize
the false postive error rate given a stringent false negative error
rate. If the resulting false positive error rate is too high then you
have to use 2 cutoffs ... one for each class. Any measurement
resulting in a discriminant value lying between the two cutoffs should
be assigned to a "NOT CLASSIFIED" category.
Plotting both errors vs cutoff (ROC) will allow you
to do this rather easily.
When I went back to validate using the sample labels, I found
too many (30%) negative values were classified as positive values
(sensitivity was high but specificity was low).
In general, you should partition your data into 3 subsets:
a training set (in-sample) to calculate the discriminant, a
validation
set (in-sample) to determine the cutoff and a test set (out-of-sample)
to estimate the error rates of unseen data. However, often the
training set doubles as a validation set and there is no
indepependednt validation set. Yjis is not recommended for serious
work. If the performance estimated by the test set is not
quite satisfactory you should repartition the data and repeat instead
of biasing the estimate by twiddling parameters to improve the test
set error. In particular, the test set is not to be used, in any
way, to determine either the discriminant or the cutoff. It is only
used to estimate performance on unseen data.
In order to reduce the variance of the test set estimate you may
need to average over many trials with different random partitions.
This can be done using 10-fold cross validation or bootstrapping.
Hope this helps.
Greg
The code I used was
writen in R. I first did a guess what the mean and sd could be for the
positive and negtive values could be, and then do optim() for
optimization. Is there any better way to do this?
Thanks much!
lmix3 <- function(paras,x)
{
p<-paras[1]
u1<-paras[2]
s1<-paras[3]
u2<-paras[4]
s2<-paras[5]
y<- (-log(p*dnorm((x-u1)/s1)/s1 + (1-p)*dnorm((x-u2)/s2)/s2))
sum(y)
}
tmpp0<-c(p=0.9, u1=tmpu1, s1=tmps1, u2=tmpu2, s2=tmps2)
tmpoptim.rs<-optim(tmpp0, lmix3,
x=tmpmixdata,control=list(maxit=2000))
.
- References:
- mixture distribution
- From: W
- mixture distribution
- Prev by Date: Re: Statistics of cross-correlation
- Next by Date: Re: SAS programmers
- Previous by thread: Re: mixture distribution
- Next by thread: Re: Looking for knowledgeable instructors with practical experience
- Index(es):