Speed up large file reading
- From: "Benjamin Kraus" <bkraus@xxxxxx>
- Date: Thu, 7 Jun 2012 02:08:07 +0000 (UTC)
I'm trying to speed up two pure m-file functions that I wrote for reading pieces of data from a binary data file. The data files range in size from about 800MB to as large as 2GB+.
I have the functions working, but they are horribly slow. I think this is mostly due to the way the file is organized, forcing a lot of separate calls to fread, rather than one call to read a large chunk of the file. The majority of the file (and hence the reading time) is millions of "data blocks". Each data block has a fixed size header, followed by variable size amount of data. The data block header has the following definition in C:
struct DataBlockHeader
{
short Type; // Data type; 1=spike, 4=Event, 5=continuous
unsigned short MSBTimestamp; // Upper 8 bits of the 40 bit timestamp
unsigned long LSBTimeStamp; // Lower 32 bits of the 40 bit timestamp
short Channel;
short Unit;
short NumberOfWaveforms;
short NumberOfWordsInWaveform;
}; // 16 bytes
The "NumberOfWaveforms" and "NumberOfWordsInWaveform" multiplied together gives the size of the data following the header.
I'm writing two different functions. The first is supposed to simply count the number of data blocks of each 'Type', 'Channel', and 'Unit' (this information is *not* available in the file header, nor is the order of the data predictable in any way). The second function is to extract all the data that has a specific 'Type', and has a 'Channel' and 'Unit' within a supplied list.
Because of the variable size data blocks, and a different meanings of 'Channel' and 'Unit' depending on the 'Type', my implementation of the first function looks something like this:
count1 = zeros(96,26); % These sizes are determined from file header
count4 = zeros(255,1);
count5 = zeros(64,1);
dsz = 0; % Data block size
dh = fread(fid, 8, '*short');
while(~feof(fid))
if(dh(dzs+1)==1);
count1(dh(dsz+5),dh(dsz+6))=count1(dh(dsz+5),dh(dsz+6))+1;
elseif(dh(dzs+1)==4);
count4(dh(dsz+5))=count4(dh(dsz+5))+1;
elseif(dh(dzs+1)==5);
count5(dh(dsz+5))=count5(dh(dsz+5))+1;
end
dsz = dh(dzs+7)*dh(dsz+8);
dh = fread(fid, dsz+8, '*short');
end
The company that created this format (and wrote the software producing the data) has a (Windows only) software utility that can do this on the order of seconds. They also distribute a (closed source, Windows only, and extremely buggy) MEX library that can do this in on the order of seconds (it crashed MATLAB three times while I was trying to take that measurement). My function takes on the order of minutes.
For example, on one unusually small file I've been using for testing (141MB), it took their MEX library about 2 seconds, their standalone GUI client about 2 seconds, and my function about 60 seconds.
Can anybody think of ways to improve the execution time of these functions so that my function behaves at least on the same order of magnitude as the closed source versions? I'm trying to avoid a MEX file implementation (I'm trying to make this readily cross-platform, and my C is much more rusty than my MATLAB, and I want to avoid some of the bugginess of the closed source version), but I'll go that route if necessary.
- Ben
.
- Prev by Date: fft2 vs analytical dispersion relation
- Next by Date: Re: run a function as a script so that all assigned variables in the function will be at workspace, without declaring them as the function output
- Previous by thread: fft2 vs analytical dispersion relation
- Next by thread: Re: run a function as a script so that all assigned variables in the function will be at workspace, without declaring them as the function output
- Index(es):