Re: comparison of large data sets
- From: "Ed Prochak" <ed.prochak@xxxxxxxxxxxxxxxxxx>
- Date: 26 Sep 2005 13:42:44 -0700
Darek wrote:
> Hi
> I have a problem:
> There are 2 systems with, i.e two tables
> System A:
> tableA1 10 mln rows (about 5GB)
> tableA2 50 mln rows (about 15GB)
>
> System B:
> tableB1 10 mln rows (about 5GB)
> tableB2 50 mln rows (about 15GB)
>
>
> I describe in one select, what I want:)
> select b1.*, b2.*
> FROM tableA1 a1, tableA2, a2, tableB1 b1, tableB2 b2
> WHERE a1.col1 = a2.col2
> and b1.col1 = b2.col2
> AND b1.col1 = SUBSTR(a1.col1)
> and b1.state <> a1.state
>
> So I want to find any differences between system A and B
>
> Question is:
> Is this reasonable to do that through database (I have csv flat files)
> ?
Yes a database will likely do it faster than a simple standalone
program. A good choice would be (if available) one of the better DBMS
products like ORACLE, DB2, SQL Server, FIrebird/Interbase, and a few
others. THey all should be able to handle this volume of data.
> Can any database handle this comparison in 12 hours?
The comparison only? sure. Even including the time to load the
database, it should still beat 12hours.
> I wonder about comparing the flat files, without database, to itereate
> once over every of the sorted files.
Ever consider looking up sort algorithms? (What do the schools teach
these younger programmers? what happened to fundamental algorithms?)
>
> Sorting of each flat file takes about 15min. But there's a problem of
> joining them later to denormalize each of the systems, or simply to
> compare data from 4 files.
ever hear of merge sort? The join would be similar.
>
> Do you have any point of view, what shell I use to do this
If you really are using shell scripts, I'd say avoid those. If you
really want to "script" this, use PERL.
> Can you give some piece of advice, what way I should choose (database
> or flat files)?
>
> Cheers
> Darek
I'd prefer a database system first, and only if forced to would I
program the flat file processing, choosing PERL.
But is this a one-time job, or will something like this run multiple
times? in the later case, I might end up choosing a compiled language.
Let us know what you pick in the end.
HTH,
ed
.
- Follow-Ups:
- Re: comparison of large data sets
- From: Darek
- Re: comparison of large data sets
- References:
- comparison of large data sets
- From: Darek
- comparison of large data sets
- Prev by Date: comparison of large data sets
- Next by Date: Re: comparison of large data sets
- Previous by thread: comparison of large data sets
- Next by thread: Re: comparison of large data sets
- Index(es):
Relevant Pages
|
|