[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-devel] More database thoughts
From: |
K. Haley |
Subject: |
Re: [Pan-devel] More database thoughts |
Date: |
Sun, 20 Jun 2004 00:00:41 -0600 |
User-agent: |
Mozilla Thunderbird 0.7 (Windows/20040616) |
lets see if it gets posted this time.
Tom wrote:
I like Calin's incremental approach. If you add the DB stuff to what's
already there, it makes it easy to cross-check for debugging purposes.
Also, unless you're confident that SQLite will never corrupt the DB,
having the files as a backup is an insurance factor. In any case, adding
a "(re)build DB" menu option someplace might be a good starting point.
Pan currently uses one directory for each server. These directories
contain files for each group for which you've downloaded headers,
containing the article info for that group. This would seem to make
cross-checking complicated. As for corruption, Pan's current setup gets
corupted occasionaly as is. The only real solution here is to use more
than one DB file. The first would hold the server, group, and
group-server tables. The article and article-server stuff could be in
one or more additional tables. It's an interesting tade-off.
If stored in one table then article status would be tracked for
cross-posts, however all article data is lost if the file is corrupted.
If stored in one table per group then only that groups data would be
lost and the user could nuke the file if it's not wanted, however
article status would not be tracked for cross-posts. It would also be
more difficult to implement.
It looks to me that the Article structure is a big memory user. The
thing is, you rarely if ever display anything other than part 0 or 1, so
to me it makes sense to only keep the part 0/1 in memory, and retrieve
from DB/display the others on an as-needed basis. Correct me if I'm
wrong, but I suspect that few (text or binary) groups would have more
than 100,000 "unique" subjects (part 0/1). It seems to me that trying to
truncate the (xx/yy) from the subject string would be a small saving by
comparison.
My idea is to extend the duplicate checking to include authors. This
would offer more space savings in most groups. Whether or not the
subject gets truncated is another matter. The same table that hold the
subjects will hold the authors as well. No need for an additional
authors table since both are used for finding duplicates.
TABLE duplicates
text
ref_cnt
id
Article
subject duplicates:id
author duplicates:id
As for the Article structure usgin a lot of memory, all we really need
is a small cache of 100-200 entries for the visible articles.
signature.asc
Description: OpenPGP digital signature