Re: Parsing large CSV data file.



On Wed, 27 May 2009 03:42:01 -0400, arun <aragorn168b@xxxxxxxxx> wrote:

Hi,
I have a 180MB csv file. It has about 360 columns and 240000 rows.
Every cell has 1 character 'A', 'C', 'G', 'T' (nucleotides). I tried
to buffer the whole file using fread as follows (as it was suggested
to be the fastest in this forum).

fid = fopen('250k.csv','r');
str = fread(fid, inf, '*char').';

This takes roughly about 7-23 seconds. Then, because this gives out a
character vector, I have to get the array of values from this using
'\n' as the separator. So, I 'regexp' this vector str as,

rc = regexp(str, '\n', 'split') %takes roughly about 1-3 minutes!!

Now, my only option is to go line by line over these 240000 rows and
do the manipulations necessary. Every row has only 2 of the 4
characters. My objective is to find how many of those 2 char. are
present out of 360 and replace the individual cells with 0 (most
occurring char.) and 1 (least occurring char). Suppose AT are the 2
char in a row and they occur at 280 and 80, A = 0 and T = 1 in this
row. Because the size of the matrix is too huge, I decided to use
'sparse' command on every row as I create and add to the current
sparse matrix. I also had to use a 'regexprep' expression to remove
the 'commas' and 'numbers' from the row before doing this... So, the
whole loop looks like,

********
while length(rc) >=3, %I reduce the size of rc after reading each row
coz of memory issues.
cur = regexprep(rc{i},'[0123456789,]',''); %remove commas and
numbers, otherwise 'unique' gives the comma and numbers as well
uniq = unique(cur); %search for unique values
pos.snp1=[];pos.snp2=[];
if length(uniq) == 2, %find which positions do each of the 2 char.
occurs
pos.snp1 = findstr(uniq(1),cur);
pos.snp2 = findstr(uniq(2),cur);
else
error('Error in unique(cur) expression: more than 2 SNP''s
found');
end
%assign values
if length(pos.snp1) >= length(pos.snp2),
temp(pos.snp1) = 0;
temp(pos.snp2) = 1;
else
temp(pos.snp2) = 0;
temp(pos.snp1) = 1;
end
%this row is not necessary from rc anymore, so remove it.
rc(i) = '';
%sparse the temp vector, store in data. this keeps growing... but
couldn't find better alternative.
data = [data;sparse(temp)];
end
*******
My problem is that, this has been running for the last 9 hours. Is
there any other faster way to perform this loop operation of
accomplishing this entire task?


thank you very much,
arun.

Arun,
Couldnt you do this processing by reading in one line at a time?

This was interesting enough for me to give it a try:

in.csv:
a,t,a,t,a,a
g,c,g,c,g,c
a,g,c,t,a,g
a,g,a,g,g,a


fid=fopen('in.csv','rt');

curLine=fgetl(fid); %read one line at a time

finalMat=zeros(4,6); %ensure you update this

rowInd=1; %you could use a FOR loop instead..
while(curLine~=-1)

curLine=curLine(1:2:end);%Remove commas, update this if you have this format: 'a','c' instead
[uniqueNucleoTides uNInds reqInds]=unique(curLine); %the third output is very useful later on..

if(length(uniqueNucleoTides)~=2)
%error/warn..
finalMat(rowInd,:)=nan(1,6); %update '6'
else
reqInds(reqInds==1)=0;
reqInds(reqInds==2)=1;
if(length(find(reqInds==0)) >= length(find(reqInds==1)) )
%do nothing; Check if this behavior is ok for '=' condition above
else
reqInd=~reqInds;
end
finalMat(rowInd,:)=reqInds;
end
curLine=fgetl(fid);
rowInd=rowInd+1;
end
fclose(fid);

finalMat

.



Relevant Pages

  • Parsing large CSV data file.
    ... I have a 180MB csv file. ... My objective is to find how many of those 2 char. ... the 'commas' and 'numbers' from the row before doing this... ... there any other faster way to perform this loop operation of ...
    (comp.soft-sys.matlab)
  • importing csv files- entire record goes into each column
    ... This used to work before but now whenever I import a .csv file entire records are being put into each column variable but the commas are still there. ... char was a comma... ... Anyplace else I can check? ...
    (microsoft.public.excel.misc)
  • Re: How do I allow blank columns in a .CSV file?
    ... I am using the csv file to upload data from excel to a website - the website ... "Dave Peterson" wrote: ... not buying much by making sure that there are extra commas in the file. ...
    (microsoft.public.excel.misc)
  • Re: Creating CSV Comma Delimited File
    ... You may be able to eliminate the lines with just commas on it by resetting the ... new workbook as a .csv file. ... don't care about those extra columns. ... MsgBox "Upload File Saved for TECHNICIANS" ...
    (microsoft.public.excel.misc)
  • Re: How to import a large list of contacts?
    ... > may be possible to export the Excel file as a CSV file that will be almost ... > example of the text with commas in the CSV file. ... > aka Kuay Tim ... > I've got an excel file with about 40 e-mail address's. ...
    (microsoft.public.windows.inetexplorer.ie6_outlookexpress)