Re: Parsing large CSV data file.
- From: "Ashish Uthama" <first.last@xxxxxxxxxxxxx>
- Date: Fri, 29 May 2009 16:48:23 -0400
On Wed, 27 May 2009 03:42:01 -0400, arun <aragorn168b@xxxxxxxxx> wrote:
Hi,
I have a 180MB csv file. It has about 360 columns and 240000 rows.
Every cell has 1 character 'A', 'C', 'G', 'T' (nucleotides). I tried
to buffer the whole file using fread as follows (as it was suggested
to be the fastest in this forum).
fid = fopen('250k.csv','r');
str = fread(fid, inf, '*char').';
This takes roughly about 7-23 seconds. Then, because this gives out a
character vector, I have to get the array of values from this using
'\n' as the separator. So, I 'regexp' this vector str as,
rc = regexp(str, '\n', 'split') %takes roughly about 1-3 minutes!!
Now, my only option is to go line by line over these 240000 rows and
do the manipulations necessary. Every row has only 2 of the 4
characters. My objective is to find how many of those 2 char. are
present out of 360 and replace the individual cells with 0 (most
occurring char.) and 1 (least occurring char). Suppose AT are the 2
char in a row and they occur at 280 and 80, A = 0 and T = 1 in this
row. Because the size of the matrix is too huge, I decided to use
'sparse' command on every row as I create and add to the current
sparse matrix. I also had to use a 'regexprep' expression to remove
the 'commas' and 'numbers' from the row before doing this... So, the
whole loop looks like,
********
while length(rc) >=3, %I reduce the size of rc after reading each row
coz of memory issues.
cur = regexprep(rc{i},'[0123456789,]',''); %remove commas and
numbers, otherwise 'unique' gives the comma and numbers as well
uniq = unique(cur); %search for unique values
pos.snp1=[];pos.snp2=[];
if length(uniq) == 2, %find which positions do each of the 2 char.
occurs
pos.snp1 = findstr(uniq(1),cur);
pos.snp2 = findstr(uniq(2),cur);
else
error('Error in unique(cur) expression: more than 2 SNP''s
found');
end
%assign values
if length(pos.snp1) >= length(pos.snp2),
temp(pos.snp1) = 0;
temp(pos.snp2) = 1;
else
temp(pos.snp2) = 0;
temp(pos.snp1) = 1;
end
%this row is not necessary from rc anymore, so remove it.
rc(i) = '';
%sparse the temp vector, store in data. this keeps growing... but
couldn't find better alternative.
data = [data;sparse(temp)];
end
*******
My problem is that, this has been running for the last 9 hours. Is
there any other faster way to perform this loop operation of
accomplishing this entire task?
thank you very much,
arun.
Arun,
Couldnt you do this processing by reading in one line at a time?
This was interesting enough for me to give it a try:
in.csv:
a,t,a,t,a,a
g,c,g,c,g,c
a,g,c,t,a,g
a,g,a,g,g,a
fid=fopen('in.csv','rt');
curLine=fgetl(fid); %read one line at a time
finalMat=zeros(4,6); %ensure you update this
rowInd=1; %you could use a FOR loop instead..
while(curLine~=-1)
curLine=curLine(1:2:end);%Remove commas, update this if you have this format: 'a','c' instead
[uniqueNucleoTides uNInds reqInds]=unique(curLine); %the third output is very useful later on..
if(length(uniqueNucleoTides)~=2)
%error/warn..
finalMat(rowInd,:)=nan(1,6); %update '6'
else
reqInds(reqInds==1)=0;
reqInds(reqInds==2)=1;
if(length(find(reqInds==0)) >= length(find(reqInds==1)) )
%do nothing; Check if this behavior is ok for '=' condition above
else
reqInd=~reqInds;
end
finalMat(rowInd,:)=reqInds;
end
curLine=fgetl(fid);
rowInd=rowInd+1;
end
fclose(fid);
finalMat
.
- Prev by Date: Re: Creating a 3d plot from 3 vectors
- Next by Date: Help! reading binary data from .rf file
- Previous by thread: ifft treat column&row vector differently?
- Next by thread: Help! reading binary data from .rf file
- Index(es):
Relevant Pages
|