Re: comparison of large data sets




Darek wrote:
> Hi
> I have a problem:
> There are 2 systems with, i.e two tables
> System A:
> tableA1 10 mln rows (about 5GB)
> tableA2 50 mln rows (about 15GB)
>
> System B:
> tableB1 10 mln rows (about 5GB)
> tableB2 50 mln rows (about 15GB)
>
>
> I describe in one select, what I want:)
> select b1.*, b2.*
> FROM tableA1 a1, tableA2, a2, tableB1 b1, tableB2 b2
> WHERE a1.col1 = a2.col2
> and b1.col1 = b2.col2
> AND b1.col1 = SUBSTR(a1.col1)
> and b1.state <> a1.state
>
> So I want to find any differences between system A and B
>
> Question is:
> Is this reasonable to do that through database (I have csv flat files)
> ?

Yes a database will likely do it faster than a simple standalone
program. A good choice would be (if available) one of the better DBMS
products like ORACLE, DB2, SQL Server, FIrebird/Interbase, and a few
others. THey all should be able to handle this volume of data.

> Can any database handle this comparison in 12 hours?

The comparison only? sure. Even including the time to load the
database, it should still beat 12hours.

> I wonder about comparing the flat files, without database, to itereate
> once over every of the sorted files.

Ever consider looking up sort algorithms? (What do the schools teach
these younger programmers? what happened to fundamental algorithms?)

>
> Sorting of each flat file takes about 15min. But there's a problem of
> joining them later to denormalize each of the systems, or simply to
> compare data from 4 files.

ever hear of merge sort? The join would be similar.

>
> Do you have any point of view, what shell I use to do this

If you really are using shell scripts, I'd say avoid those. If you
really want to "script" this, use PERL.

> Can you give some piece of advice, what way I should choose (database
> or flat files)?
>
> Cheers
> Darek

I'd prefer a database system first, and only if forced to would I
program the flat file processing, choosing PERL.

But is this a one-time job, or will something like this run multiple
times? in the later case, I might end up choosing a compiled language.

Let us know what you pick in the end.
HTH,
ed

.



Relevant Pages

  • Re: another way
    ... > Is it possible to store persistant data in anything other than a ... store data in a continuous loop. ... tell if that is a database or flat files, but it seems to persist out ...
    (comp.databases.oracle.server)
  • Re: Fragile Fences
    ... That's how Visual Age for Java operated. ... I don't think Eclipse relies on the flat files. ... it keeps private pre-parsed copy in a hidden database somewhere. ...
    (comp.lang.java.programmer)
  • Re: How to copy a database?
    ... if you have access to the database, you can copy it or you can use the import wizard. ... copying a ZIP file (which contain the flat files) could help you. ... can deliver Ascii files or facttables and dimensiontables. ...
    (microsoft.public.sqlserver.datawarehouse)
  • Re: comparison of large data sets
    ... Darek wrote: ... Is this reasonable to do that through database? ... I wonder about comparing the flat files, without database, to itereate once over every of the sorted files. ... joining them later to denormalize each of the systems, or simply to compare data from 4 files. ...
    (comp.databases)
  • Re: Database type independence
    ... > This system stores it's information in flat files. ... loading all company customers or all products in the memory.... ... question was about DATABASE independence. ...
    (comp.object)