[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pan-devel] RFC: Detecting multiparts (was: .94 weirdness with detecting
From: |
Charles Kerr |
Subject: |
[Pan-devel] RFC: Detecting multiparts (was: .94 weirdness with detecting attachments) |
Date: |
Fri, 8 Aug 2003 10:49:09 -0700 |
User-agent: |
Mutt/1.5.4i |
>>> Charles... I'm seeing the same behavior with non-binary groups. i.e.
>>> text-only documents published in multiple posts and labeled as such. For
>>> example, "some subject here 1/5" and "some subject here 2/5" and "some
>>> subject here 3/5" etc. will show up as "broken parts" because Pan is
>>> trying to "decode" non-binary multipart posts and is freaking out because
>>> there is no way to decode these....
>> I don't think this is the same bug: Chris is reporting that Pan is
>> incorrectly treating multiparts as non-multiparts; you're reporting that
>> Pan is incorrectly treating non-multiparts as multiparts. :)
> Oh, Ok... So, it's the "reverse" of Chris' bug :-)
Clearly Pan's letting false positives or false negatives through when
looking for multiparts. Maybe we should revise the multipart detection code.
Here's a rough draft for a better detection scheme. I'm posting it here
so that people can refine it and/or shoot holes in it.
background
----------
* there are no standard headers, other than the Subject: header,
to link multiparts together or even to denote binary attachments.
* we can't thread properly until we've guessed the multipart state,
so looking to other articles for context is problematic.
* the best tools we have are the Subject: header, the group name,
and the number of lines in the article.
tools
-----
* likely_binary_group is true if the newsgroup name contains
any of: "binaries", "fan", "mag", "sex", false otherwise
* likely_binary_subject is true if the Subject: header contains
any of: "jpeg" "jpg" "gif" "tiff" "png", false otherwise
* part = 0, or if either "(x/y)" or "[x/y]" is in Subject:, then x.
(Work backwards from the end of the string, in case someone's
posting a set of multiparts and (x/y) appears in the Subject: twice)
* parts = 0, or if either "(x/y)" or "[x/y]" is in Subject:, then y.
(Work backwards here too)
* lines = number of lines in article
* is_reply = true if Subject: begins with "Re:", false otherwise
* is_binary: true or false. This is what we're trying to guess.
guessing
--------
1. start with is_binary = false.
2. if part > 0,
and parts > 0,
and parts >= part,
set is_binary to true.
3. if is_binary is true,
and we're not in a likely binary group,
and we don't have a likely binary subject,
and parts > 1,
then it's probably a set of text posts like John mentioned above.
set is_binary to false.
3. if is_binary is false,
and we're in a likely_binary_group,
and either lines>500 or we've got a likely binary subject,
and both part and parts are 0,
then it's likely a single-part binary where the user omitted the "(1/1)".
set is_binary to true.
4. if is_binary is true,
and is_reply is true,
and the part is 0 or 1,
then it's probably a follow-up to a multipart (I've never seen a followup
to a part > 1).
set is_binary to false.
UNLESS: once in a blue moon people will post binaries as follow-ups, so
hedge our bets:
leave is_binary as true if lines > 500.
5. if is_binary is true,
and the subject contains any of: "Frequently Asked Questions", "FAQ",
"Weekly", "Monthly",
then it's a FAQ or periodic posting being posted in pieces.
set is_binary to false.
--
cheers,
Charles