[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-devel] ancient DB schema
From: |
Calin A. Culianu |
Subject: |
Re: [Pan-devel] ancient DB schema |
Date: |
Tue, 8 Jun 2004 09:43:08 -0400 (EDT) |
This looks reasonable. Although I would suggest for the purposes of
performance, further normalizing the Articles tables so that it doesn't
contain the actual subject text.
For the binaries newsgroups, I have been able to really increase
performance and minimize disk space usage by doing the following:
If it's a multipart binary (as determined using the heuristics already in
Pan, namely it ends in [xx/yy] or (xx/yy) and it is over 400 lines), then
we can assume that all the subjects are the same, but they differ only in
the xx/yy part. So why not truncate that part, then put all the subjects
in a separate table, and save only the 'subject id' and part and parts in
the Articles table?
In fact, in a typical 1 million+ header group, there are usually only like
1000-2000 unique subjects. So you save a LOT of space by doing this. This
saves a lot of disk space, and makes queries and sorting of the articles
table much faster since less overall disk space needs to be scanned per
query.
Anyway, the stuff I am working on now as far as DB changes aren't as
comprehensive as what you propose here. As an initial first-pass, I am
_only_ changing the bits of pan that deal with article headers, and
putting only that stuff in the DB, as that's where we have really big
problems with memory consumption and that's where we benefit most from
using a DB. This is the lazy man's approach.. I don't want to change pan
too much.. I only want to tweak it to scale better..
I leave it to you guys to decide how to totally metamorphosize Pan into
using a full-fledged DB backend and creating 'virtual groups' or whatever
it was you were discussing..
-Calin
On Fri, 4 Jun 2004, K. Haley wrote:
> I'm attaching an old DB schema I came up with. It is based on the one
> posted by Charles a long time ago. There are still some unanswered
> questions as to where some of the info should go. The biggest one is
> whether or not articles in folders should be in their own table. FYI I
> chose to use integer primary keys for space and speed savings.
>
>