[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pan-devel] Re: Move to database backend
From: |
Tom Enterline |
Subject: |
[Pan-devel] Re: Move to database backend |
Date: |
Sun, 14 Mar 2004 21:47:28 -0500 |
On Wed, 2004-03-10 at 12:37, Charles Kerr wrote:
> I've written down about eight pages in my notebook on what I think
> would be good. In general I'm trying to keep as much information in
> the database, rather than in memory. Ideally the Article struct
> would go away. article-thread is already redundant because Pan
> threads an article when it's downloaded instead of each time the
> group is loaded.[1]
>
> Right now I'm using SQLite, as it's fast, embeddable and portable.[2]
>
> So far I've got the headers being inserted into the database as
> they are downloaded, and multiparts and plaintext articles are
> all threaded inside of the database so that we don't have to
> rethread the entire group every time we load Pan. This is running
> parallel to file-headers, though, since I don't have articlelist
> reading out of the database yet. But I've got some ideas on how to
> do that.
>
> My big concern right now is speed. Hard Drive speed seems to be crucial
> for keeping the headers in a database -- when I download headers right
> now, disk access is the bottleneck rather than bandwidth. So an
> experienced DB person's opinion on how to tune the tables would be
> great.
>
> I'll transcribe my notes tonight or tomorrow, and mail them and my
> code changes to pan-devel.
Charles,
Glad to hear your progress.
Some general notes:
I don't know what your database experience is, so please excuse me if I
say something you already know.
Converting a program from using files to using a database is sort of
like converting a program from, say, C to Perl. You can do a straight
one-for-one conversion, but you wouldn't be taking advantage of the new
features. The straight conversion would be doing things the "C way", and
not the "Perl way". That said, the easiest way is probably to do the
straight conversion, then take advantage of the new features as you gain
experience.
-------------------------------------------------------
Particulars:
(Disclaimer: I've only looked at a little of the PAN code, so if some of
the below doesn't make sense, ignore it or let me know where it doesn't
make sense.) My thoughts on how PAN might work after a first step of
converting to use a database:
Downloading headers:
Would work pretty much like it does now. Instead of each header being
written as several lines in a file, it is inserted as one record/row in
the database. Committing the inserts in groups of 500-1000 rows should
work well.
Downloading articles:
Again similar to how it works now. One possible improvement would be to
parse the file for the additional header fields, and update the database
to store those fields in the database rows created when the headers were
downloaded. Since the articles are typically larger, committing after
each update would probably be OK for binary groups, maybe OK for text
groups.
Opening a group:
Only retrieve parts 0 or 1 of multi-part files. This obviously should
cut the memory usage compared to the current design.
Reading or saving a multi-part file:
Pull the information for the rest of the parts from the database for
queuing, etc.
-----------------------------------------------------
I look forward to seeing your notes.