Re: general question about how best to cache expensive query results
- From: "j.k." <spaecious@xxxxxxxxx>
- Date: 7 May 2007 04:31:08 -0700
On May 4, 5:12 am, Ed Prochak <edproc...@xxxxxxxxx> wrote:
On May 2, 8:12 pm, "j.k." <spaeci...@xxxxxxxxx> wrote:
What are general approaches to this kind of problem?
Don't let pub_date be null. Seriously, you are making workarounds for
failing to meet what is the obvious business requirement.
I wish that I could not let pub_date be null, but I have no control
over that. Third parties -- with whom I have no relation -- create the
feeds (which contain entries) and sometimes don't include the field
that becomes pub_date.
Your
assumption about the ID's is bogus because someone could back to fill
in publication date with a value that does not fit your assumption.
You did not mention any guarantee that the date eventually provided
will fit your model.
To use Robert's sample data:
id pub_date
1 2007-01-01
2 2007-01-09
3 (NULL)
4 2007-02-01
what prevents someone from updating id=3 to pub_date=2007-01-06?
The data could never be updated (nothing touches the entry after it is
inserted, so pub_date would never be updated), but you are correct
that the pub_dates are not sequential. I hadn't thought things
through when I sent the original email. The original query does not
work for other reasons. That was my first attempt to limit the entries
that are returned to those within the last 5 days or so, but it fails
horribly when, for example, I've just inserted a 5 day-old entry that
would be the result of that original query. And it fails for nulls too
in more cases than I envisioned originally.
At present, I'm not sure how I'm going to deal with it. The problem is
what you identified further down.
So can you provide a default date when a new article is logged in this
table?
In some cases, I can guess a good enough default date or create one
that will work, but unfortunately not in all cases.
If the ID's are guaranteed to be sequential with pub_date, then
possible defaults are either the system date when the article is
loaded, or the pub_date (or maybe pub_date+1day?) of the previous
article. Lacking rules for a default pub_date, then the data entry
needs to require the pub_date before inserting the article.
The ids are guaranteed to be sequential, but the pub_dates are not
sequential. Your suggestions are both good, and they would both work
for some cases, but there are still some cases where I couldn't use
either one.
If you really must continue to use NULL on pub_date, then you must
consider whether the rows with NULL really belong in your result set.
Yes, I'm not sure. The actual query is something like "get all entries
for a given feed in descending order of pub_date, limited to 100 or
so," except that some feeds don't have a pub_date, and I can't
determine a good guess in all cases.
You really have two result sets: one set fitting the date requested,
and one for all the articles with NULL pub_date. what of there was a
row:
0 NULL
if you selected for pub_date 2007-01-01, WHY is it not in your result
set? What business rule? That rule should suggest wht the real
criteria is on your query, since it is not just pub_date. By pub_date
alone, either all the NULL dates should be included or all shouled be
excluded. You need to know and state the rules. Your current query is
imbuing the id attribute with properties it doesn't explicitly have.
You may need to rethink your data model, not just your query.
The only reason the pub_date would be part of the query I gave in
prose above is to provide a sort order so that if there are too many
entries and the results are limited to 100, it is the 100 most recent
ones. So the question reduces to where nulls go in the sort order, and
both first and last are problematic, as you noted.
The only thing I can think to do at present is to have an additional
column that specifically provides the sort criterion. For entries
within a given feed (but not across feeds), it should have the
property that sort_criterion_A is less than sort_criterion_B if and
only if entry A should be sorted after entry B. I don't know the
pub_date in some cases, but since the entries come to me in order from
most recent to least recent (within a single feed), I can insert them
with a sort_criterion value that preserves the ordering.
The primary id actually fulfills this right now, since it is auto-
generated from a sequence, incremented once for each new entry, and
the entries for a given feed are inserted from least recent to most
recent. I have to sleep on this a little, but the explicit sort
criterion idea might work.
One obvious solution to me is to create a table for storing just this
value, but something seems wrong to me about potentially lots of
little 1-column, 1-row tables.
This is not obvious. I do not see this as a "1-column, 1-row table".
What I was thinking was that I need to store entry id = 144194 (for
example) somewhere, update it once an hour, and select it many times a
second. The 1-column, 1-row table I was thinking of was:
create table cache (
cached_id integer not null
);
And thus the other questions I gave of what to use for primary key,
etc.
If there are multiple of this values that need to periodically
updated, then they could all be in 1 miscellaneous table, but then I
would have the problem of deciding what to make the primary key for
each row, and it would have to be hard-coded in the update and select
queries to distinguish this cached value from other cached values in
the same table, as well as dealing with values having different types.
the last part of this ^^^ paragraph doesn't make sense. What "cache"
are you talking about?
Well, I said 'cached value' and not 'cache', meaning any value that is
stored some place and read from that place rather than being re-
calculated. I was asking what the options are for persisting some
value rather than recalculating it; i.e., do all such values belong in
one table, or in multiple tables, etc.
Queries deal with tables and columns.
Agreed. My perhaps poorly phrased questions were about what form the
tables and columns should take.
IF I were to add another table, I think it would have the default date
I was describing above, so it would need the id attribute as well.
IOW, it would be a pub_date search table, where the attributes are the
date column (constrained to NOT NULL) and the id for that date. The PK
would be the composite date and id. It still has the issue of
defining the default date!
I will try to fully normalize the last remaining holdouts in the
entries table, including the pub_date. Everything else is very
normalized and no nulls are possible, but I allowed a few potentially
null columns to remain in entries (one of which was pub_date) because
I was worried about performance. That table is already the bottleneck
due to the extent that it is already normalized; the query requires 5
or so joins, and there are many in the table, and new entries are
frequently inserted. Maybe I will start another topic when I have it
fully normalized if the performance is inadequate. (Yeah, I know;
should have been BCNF first before thinking about performance. Forgive
me Knuth or Hoare.)
There is nothing "hard-coded" here, just an extra level of indirection
on SELECT and triggers on INSERT and UPDATE (pgsql does have triggers,
right?)
The hard-coding I was thinking of was the name of the 'cache' table
(and the relevant column) I showed above. The insert, update, and
select would all need to know that. What I was originally hoping was
that there was some standard way of being able to specify the ideal
query that I would like to perform and specify additionally that it
shouldn't be performed more than once every hour, and the database
would somehow take care of the rest. I guess I was hoping for
something like a view that just updates once an hour.
The obvious solution I was alluding to was to create a table that
stores the value, and have a trigger that updates it once an hour, and
a stored procedure that retrieves the value if there, creating it if
not. This seemed messy to me.
Most of it was clear enough. The best strategy is always to meet the
business requirements. Since you haven't given us all of them
(understandably in the limited space of a newsgroup posting) we must
make assumptions. I hope I made my assumptions clear.
Thanks very much for taking the time to respond so thoroughly. I
apologize for my lack of clarity. I was focusing on the perceived
solution and just wanted the quick means to that end, rather than
presenting the underlying problem and the business requirements. I've
had to think things through a lot more thanks to your response though.
Ed
PS (I personally dislike generic ID attributes as the PK of entities.
When not carefully designed into the data model, it too often leads to
pain and sorrow.)
I usually prefer a more meaningful id as well, but wasn't able to
identify anything suitable among the other attributes.
Cheers,
joseph
.
- References:
- Prev by Date: Howto embed select statements into other select statements
- Next by Date: Re: Schema for Full Text DB Using Stemmer
- Previous by thread: Re: general question about how best to cache expensive query results
- Next by thread: Re: escaping table/field names and values
- Index(es):
Relevant Pages
|