
@** Release history.

\font\df=cmbx12
\def\date#1{{\medskip\noindent\df #1\medskip}}
\parskip=1ex
\parindent=0pt

\date{Release 0.1: November 2002}

Initial release.

\date{Release 1.0: February 2003}

First production release.

@** Development log.

\date{2002 August 28}

Created development tree and commenced implementation.

\date{2002 September 1}

Release 0.1 circulated for review.

\date{2002 September 6}

Added the ability to compute descriptive statistics
of the dictionary built by parsing the \.{--mail} and \.{--junk}
folders, using the facilities of the \.{statlib.w} program.
Statistics are written to standard output.

Added a \.{--plot} option to plot a histogram of words in
a newly parsed dictionary (not a lookup dictionary
loaded with \.{--read}).  Creating the plot requires the
\.{GNUPLOT} and \.{PBMPlus} utilities to be installed.

\date{2002 September 7}

Well, after a huge amount of hunkering down and twiddling,
parsing of MIME multi-part messages and decoding of parts
encoded in \.{Base64} and \.{Quoted-Printable} encoding now
seems to be working.  This drastically improves the quality of
parsing, particularly for junk where these forms of encoding are
used as ``stealth'' to evade other content-based filters.

\date{2002 September 8}

Added the ability to read mail folders compressed with
\.{gzip} or other compressors detected by the Autoconf
script.  This saves a lot of space when you're keeping
large training archives around.  This will work only on
systems with suitable decompressors and the |popen|
facility.

\date{2002 September 9}

Added the \.{--pdiag} option to write the parser diagnostics
to a designated file.  Previously this was controlled by
a gnarly |#define|.

Added a ``\.{X-Annoyance-Filter-Decoder}'' line to the
\.{--pdiag} output to indicate the activation of decoders
(including the sink) for MIME parts in the message.  These
lines are not seen by the token parser.

Fixed a bug in parsing of tokens including ISO accented
characters$\ldots$signed characters strike again.

\date{2002 September 10}

Added a \.{--ptrace} option to include the actual tokens
parsed as indented, quoted lines following each line of
parser input in the \.{--pdiag} file.

Added code to |classifyMessage| which appends lines
to the message header in the \.{--pdiag} file giving
the aggregate junk probability and the most significant
words and their individual probabilities.

Separated the mail and junk thresholds, which may now be
set independently by the \.{--threshjunk} and \.{--threshmail}
options.  The \.{--classify} command now writes ``\.{INDT}''
(for ``indeterminate'') if a message falls between the
two thresholds and exits with a return status of 4.

Added the \.{--binwrite} and \.{--binread} options to
export and import a |dictionary| as a portable (assuming
IEEE floating point on all platforms) binary file.  This
will permit easier distribution of dictionary databases
and may be faster to load than the |lookupDictionary|.

Added the \.{--clearjunk} and \.{--clearmail} options
to clear counts of junk and mail.  This can be used,
in conjunction with the \.{binwrite} option, to
prepare databases for use by folks who do not wish to
prepare their own.

\date{2002 September 11}

Added the ability to enforce minimum and maximum length
constraints on tokens returned by |tokenParser|.  The
limits are set to accept tokens from 1 to 255 characters
in the |tokenParser| constructor, and may be changed
at any time with the |setTokenLengthLimits| method.
Note that the length limits are not reset by a call to
|setSource|.

Set the default token parser length limits to accept
tokens between 1 and 64 characters.  This will doubtless
be the subject of yet more command line options before
long.

Modified the code which decides whether a mail folder
is compressed to check for the argument being a symbolic
link.  If so, the link target is tested for the extension
indicating a compressed file.  I only follow links one
level---if this poses a problem, your life is probably
too complicated.

Fixed computation of probability to avoid crashes if
no words are present in a category.  Probabilities don't
make any sense in such circumstances, but you may wish
to create such a database for use with
\.{--binread}.

Added logic to |dictionary :: exportToBinaryFile| and
|dictionary::exportToBinaryFile| to save and restore the
count of messages contributing to the dictionary
in the |messageCount| array in a pseudo-word
called ``\.{\ COUNTS\ }'' (obligatorily) at the start
of the dictionary.  These counts are required should we need
to recompute the probability subsequent to loading the
dictionary.

Added the \.{--newword} and.{--sigwords} options to
specify the probability given to words in a message which
don't appear in the dictionary and the number of ``most
significant'' words whose probabilities are used to
determine the aggregate probability a given message is junk.

\date{2002 September 12}

Added logic to cope with the body of a message being
encoded in a \.{Content-TransferEncoding}.  While 
processing the header, this and the \.{Content-Type}
are parsed as in MIME headers, with their arguments
saved in |bodyContentType|, |bodyContentTypeCharset|, and
|bodyContentTransferEncoding|.  At the end of the header, if
a |bodyContentTransferEncoding| has been specified, the
values are transferred to the corresponding |mime|$\ldots$
variables and |multiPart| is set with an end terminator
of the null string.  The latter disables the decoder's
test for a part end sentinel and the warning for an
unterminated part.

Messages with \.{Subject} lines which contain ISO~8859
encoded characters employ a form of \.{Quoted-Printable}
encoding to permit these characters to appear in a mail
header where only 7 bit ASCII is permitted.  I added
code to |mailFolder| to detect these lines and call a new
|decodeEscapedText| method of |quotedPrintableMIMEdecoder|
to decode them if properly formed.  This will permit parsing
of ISO subject lines, which may prove critical in
discriminating among messages with very short body
copy.

Yikes!  As far as I can determine from the RFCs, what we're
supposed to do with continued header lines is just
concatenate them, discarding all white space on the
continuation even if this runs together tokens on
adjacent lines.  At least, if you don't do this, encoded
words split across continued \.{Subject} lines end up with
nugatory white space in the middle.  So, I fixed
|@<Check for continuation of mail header lines@>| to
``work this way''.  Given our definition of tokens, it's
likely to fix more things than it breaks anyway.

Added documentation to the \.{CWEB} file for yesterday's new
options.

\date{2002 September 13}

\.{Subject} lines can, of course, also contain sequences
encoded in \.{Base64}, tagged with a ``\.{?B?}'' following
the {\it charset} specification.  Added decoding of these
sequences, along with the requisite |decodeEscapedText|
method of |base64MIMEdecoder|.

Made a slight revision to the definition of tokens in
the |tokenParser|.  While ``\.{-}'' and ``\.{'}'' continue
to be considered part of a token if embedded within it, they
can no longer be the first or last characters of a token.
This improves recognition of words in typical text, based
on tests against the big collection.  A new |not_at_ends|
array of |bool| is used to define which characters may not
begin or end a token.

Completely rewrote how the |tokenParser| determines
character types in parsing for tokens.  Previously,
characters were classified by looking them up in a
collection of global arrays of |bool|.  To permit
changing the definition of a token on the fly, I
defined a new class, |tokenDefinition|, which collects
together the lookup tables which determine which
characters constitute a token and indicate the
sets of characters (if any) which cannot exclusively
make up a token and which cannot be the first or
last character of a token.  In addition, the minimum
and maximum acceptable length for tokens are stored
and methods permit testing all of these quantities.
You can initialise the values as you with the methods
provided, or use pre-defined initialiser functions
for ISO-8859 and ASCII alphanumeric sets.

Well, let's declare this a red banner day for
the \PRODUCT!  No, you're not dreaming$\ldots$we're
actually ending this day with {\it fewer} command line
options than those which greeted the dawn, and the whole
concept of the ``lookup dictionary'' has been banished, along
with snowdrifts of prose in the documentation explaining
the difference between a ``dictionary'' and a `lookup
dictionary'' and the things you could or couldn't
do with, or to, them respectively.  The original idea
was that you work with |dictionary| objects when assembling
the database of mail and junk, and then export the results
as a lean and mean lookup dictionary which could be
loaded like lightning to classify subsequent messages.
Well, it turns out that if you use binary I/O for the
|dictionary|, it's just as fast as loading the
lookup dictionary, and all of the confusion is
eliminated.  Further, the user is thereby encouraged to
keep a dictionary on hand which can be updated at any
time to incorporate new examples of mail and junk.
This is all much more the Bayesian spirit of
eternal refinement than settling on a probability
set without subsequent refinement.

Since the lookup dictionary is no more, there's no need
to distinguish the |dictionary| read and write commands
as binary.  Hence, the \.{--binread} and \.{--binwrite}
options have been renamed \.{--read} and \.{--write},
freed up by the lookup dictionary elimination.

\date{2002 September 14}

The direct concatenation of multiple-line header items
added a couple of days ago broke
|@<Process multipart MIME header declaration@>| thanks
to fat-fingered character counting in the recognition of
sentinels.  I fixed this, and modified the code to
perform all parsing on a canonicalised string to avoid
case sensitivity problems.  Note that the
\.{boundary} itself {\it is} and must remain
case sensitive.

Fixed some \.{gcc -Wall} natters which had crept in since
the option was accidentally removed by \.{autoconf}.

Added the ability to read a |mailFolder| from standard input.
If the |fname| argument to the constructor is ``\/{-}''
|cin| is used as the input stream.

Renamed the \.{--csv} option \.{--csvwrite} in keeping with
nefarious plans soon to be disclosed, and added a
pseudo ``\.{\ COUNTS\ }'' word to the start of the
CSV file giving the number of mail and junk messages
in the dictionary as is done in binary dictionary dumps.
Changed the sort order for the CSV file so that words
with identical probabilities are sorted into lexical
order.

Added a \.{--csvread} option to import a dictionary from
a CSV file in the format created by \.{--csvwrite}.  The
CSV file is {\it added} to the existing in-memory dictionary;
multiple \.{--csvread} and \.{--read} command may be used
to assemble a dictionary.  The CSV file imported need not be
sorted in any particular order and may contain comments
whose first nonblank character is ``\.{;}'' or ``\.{\#}''.
In the process, I found and fixed a bug in updating the
message counts which applied to both \.{--csvread} and
the existing \.{--read} code, but which only manifested
itself when loading multiple dictionaries.

Wheels within wheels$\ldots$MIME \.{multipart} messages
can, of course, be nested.  You can be blithely parsing your
way through a message when you trip over a part with a
\.{Content-type} of ``\.{multipart/alternative}'', which
pushes a new part boundary onto the stack, to be popped
when the end sentinel of that nested section is encountered.
What fun.  We consequently introduce a new |partBoundaryStack|
to keep track of the nested part boundary sentinels, along
with all of the defensive code needed to cope with the realities
of real world mail.

\date{2002 September 15}

Loosened up the test for \.{multipart} \.{Content-type} so
that ``\.{multipart/related}'' types will be recognised.

Added the long-awaited \.{--transcript} option.  (Thanks,
Kern, for suggesting it!)  A transcript of the input message
for a \.{--test} or \.{--classify} operation is written to the
argument file name (standard output if the argument is
``\.{-}'', with
\.{X-Annoyance-Filter-Junk-Probability} and
\.{X-Annoyance-Filter-Classification}
items appended to the header indicating the calculated
junk probability and classification according to the
thresholds.

Finished the first cut of multiple byte character set
decoders and interpreters.  A {\it decoder} scans the
mail body (encoded or not), and parses the byte stream
into logical characters up to 32 bits in width.  An
{\it interpreter} expresses these characters in a form
suitable for analysis.  Ideographic languages are typically
interpreted as one word per character, other languages
as one letter per character.  These components must, of
course, be utterly bullet-proof as they will be
subjected to every possibly kind of garbage in the
course of parsing real-world mail.  At the moment,
we have decoders for EUC and Big5, and interpreters for
GB2312 and Big5.

Added a decoder for EUC-encoded Korean (\.{euc-kr}) as
an example of how to handle an alphabetic language with
a non-Western character set.

\date{2002 September 16}

Modified |EUC_MBCSdecoder| to discard the balance of any
encoded line in which an invalid EUC second byte is
encountered.  After encountering such garbage, the
rest of the line is usually junk and there's no
profit in blithering through it.

Added logic to scan \.{application} binary byte streams
for possible embedded tokens.  The new \.{--binword}
option sets the shortest sequence of contiguous
ASCII alphanumeric characters or dollar signs (with
possible embedded hyphens and apostrophes, but
not permitting these character at the start or end of
a token---the default is 5 characters, which is a tad
more discriminating than the \UNIX/ \.{strings} which
defaults to 4 printable characters.  You can disable the
scanning of binary streams entirely by setting
\.{--binword} to zero.  Scanning binary streams might
seem to be a curious endeavour, but it's highly effective
at percolating text embedded in viruses and worm attachments
to junk mail to the top of the junk probability hit parade,
then screening them out when the arrive in incoming
mail.

Although the \.{Subject} line is the most important, any line
in a mail header may actually contain quoted sequences
specifying a character set and \.{Quoted-Printable} or
\.{Base64} encoded characters.  I modified
|@<Check for encoded header line and decode@>| to
no longer restrict decoding to the subject line.

Once decoded, if the \.{charset} specification in a
header line quoted sequence is a character set we
understand, it is not decoded and interpreted.
\.{ISO-8859} sets of all flavours are decoded but
not processed further.

Fixed a few \.{gcc -Wall} quibbles in |tokenDefinition|
which popped up on Solaris compiler but didn't seem to
perturb the almost identical version of \.{gcc} on
Linux.

Modified the \.{--test} option so that if the
\.{--transcript} option has been previously specified
with standard output as the destination (``\.{-}''),
the junk probability is not written to standard output
at the end of the transcript.

\date{2002 September 17}

The \.{Base64} decoder could hang if one of the lines it
was decoding contained white space.  Fixed.

Added logic to detect and discard header items which
begin with our own |Xfile| sentinel.  This shouldn't
happen in the normal course of things, but somebody may
try to spoof a downstream filter by sending mail which
contains a sentinel purporting to be a classification
by of of its legitimacy.  Deleting our own header items also
allow us to process our own transcripts containing them
and reproduce the same results as if they hadn't been
added.

Cleaned up the horrific |@<Activate MIME decoder if required@>|
section which ``jes' grew'' in |mailFolder::nextLine| as
more and more complexities were cranked in to MIME
part decoding, multiple byte character sets, parsing ASCII
strings out of binary data streams, etc.

\date{2002 September 18}

Cleaned up documentation of command line options, clarifying
that they are logically commands which must be specified in
the order in which they are to be executed.  In the process,
I added an example of invoking \PRODUCT\ as a pre-processor
for a mail sorting program such as \.{Procmail} to the
``Quick and dirty user guide''.

Added a new \.{\PRODUCT -run} shell script to
execute the program in default filter mode with the executable
and dictionary installed in the default ``\.{\$HOME/.\PRODUCT}''
directory.  Oh, you haven't hear about that$\ldots$well, stay
tuned$\ldots$details in the next episode.

Incremental refinement of the \.{README} and \.{INSTALL} files,
with many keystrokes to go before we put these documents to sleep.

Added \.{--verbose} tell-tales for the \.{--plot} and
\.{--statistics} options.

Replaced the \.{\PRODUCT.1} manual page with a cop-out which
directs the esteemed reader to the PDF program documentation.
This thing is changing so rapidly that the last thing I need is
to maintain four copies of the bloody command line option
documentation. {\it Four?}  Think about it: the program
(\.{CWEB}), its embedded \.{--help} option text, a Web page
(nonexistent at the moment, thank Bob), and a manual page. 
Keeping all four simultaneously in sync is something which
could appeal only to an accountant. I'm a programmer, not an
accountant---I drink their blood, but I don't do their work.

The code which discards header lines we've generated
attempted to remove lines from the transcript even when no
transcript was being generated, for example, when adding a
message we'd previously processed to the \.{--mail} or
\.{--junk} database.  This caused a  |NULL| pointer reference
in
|@<Check for lines with our sentinel already present in the header@>|---fixed.

Hours of patient, unremunerated toil cleaning up \.{Makefile.in}
to bash things into a distributable form.  I added an
\.{install} target which installs the program in the default
\.{\$HOME/.\PRODUCT} directory, creating a customised
\.{run} program (\.{\PRODUCT -run} in the build directory)
which supplies the home directory which \.{sendmail} doesn't.
Massive clean-up of \.{Makefile.in}, yielding a template which
is far more generic for our next foray into software land.

\date{2002 September 19}

Further testing revealed that the segmentation fault
in |dictionary::purge| which I thought I fixed a week or
so ago was still lurking to bite the unwary soul whose
dictionary contained a large number of words
eligible for purging.  As far as I can determine,
when you |erase| an item from a |set|, not only
does the iterator argument to the |erase| become
invalid, in certain cases (but not always), an
iterator to the {\it previous} item---not
erased, becomes invalid, leading to perdition when
you attempt to pick up the scan for purgable
words from that point.  After a second tussle with
|remove_if|, no more fruitful than the last (for
further detail, see the |dictionary::purge|
implementation, I gave up and rewrote |purge|
to resume the scan from the {\it start} of the
|set| every time it erases a member.  This may not
be efficient, but at least it doesn't crash!
In circumstances where a large percentage of the
dictionary is going to be purged, it would probably
be better to scan for contiguous groups of
words eligible for purging, then |erase| them
with the flavour of the method which takes a start
and end iterator, but given how infrequently
\.{--purge} is likely to be used, I don't think
it's worth the complication.

In a fit of false economy, I accidentally left the door
open to the possibility that with an improbable albeit
conceivable sequence of options we might try to classify
a message without updating the the probabilities in the
dictionary to account for words added in this run.  I
added calls on |updateProbability()| in the appropriate
places to guarantee this cannot happen.  The only
circumstances in which this will result in redundant
computation of probabilities is while building dictionaries,
and the probability computation time is trivial next to the
I/O and parsing in that process.

In the normal course of events the vast majority of runs
of the program will load a single dictionary and use it
to classify a single message.  Since we've guaranteed
that the probabilities will always be updated before
they're written to a file, there's no need to recompute
the probabilities when we're only importing a single
dictionary.  I added a check for this and optimised out
the probability computation.  When merging dictionaries with
multiple \.{--read} and/or \.{--csvread} commands, the
probability is recomputed after adding words to the
dictionary.

If you used a dictionary in which rare words had not been
removed with \.{--purge} to classify a message, you got
screwball results because the $-1$ probability used to
flag rare words was treated as if it were genuine.  It
occurred to me that folks building a dictionary by progressive
additions might want to keep unusual words around on the
possibility they'd eventually be seen enough times to
assign a significant probability.  I fixed
|@<Classify message tokens by probability of significance@>|
to treat words with a probability of $-1$ as if they had
not been found, this simulating the effect of a
\.{--purge}.  Minor changes were also required to CSV
import to avoid confusion between rare words and the
pseudo-word used to store message counts.  Note that it's
still more efficient to \.{--purge} the dictionary you use
on classification runs, but if you don't want to keep
separate purged and unpurged dictionaries around, you don't
need to any more.

Added a new \.{--annotate} option, which takes an argument
consisting of one or more single character flags (case
insensitive) which request annotations to be added to
the \.{--transcript}.  The first such flag is ``\.{w}'', which
adds the list of words and probabilities used to rank the
message in the same form as included in the \.{--pdiag}
report.  To avoid duplication, I broke the code which
generates the word list out into a new
|addSignificantWordDiagnostics| method of |classifyMessage|.

Added a ``\.{p}'' annotation which causes parser
diagnostics to be included in the \.{--transcript}.
This gets rid of all the conditional compilation based
on |PARSE_DEBUG| and automatically copies the
diagnostics to standard error if |verbose| is set.
Parser diagnostics are reported with the
|reportParserDiagnostic| method of |mailFolder|;
other classes which report errors do so via a
pointer to the |mailFolder| they're acting on
behalf of.

Well, my sleazy reset to the beginning trick for |dictionary|
|purge| really was intolerably slow for real world
dictionaries.  I pitched the whole mess and replaced it
with code which makes a |queue| of the words we wish to
leave in the dictionary, then does a |clear| on the
dictionary and re-|insert|s the items which survived.
This is simple enough to entirely avoid |map|
iterator hooliganism and runs like lightning, albeit
using more memory.

Break out the champagne!  The detestable |MIME_DEBUG|
conditional compilation is now a thing of the past, supplanted
by a new ``\.{d}'' \.{--annotate} flag.  No need to
recompile every time you're inclined to psychoanalyse
a message the parser spit up.

Added a |name| method to |MIMEdecoder| and all its
children, then took advantage of that to dispense with
the horrific duplication of decoder diagnostic code in
|@<Verify Content-Transfer-Encoding and activate decoder if necessary@>|.
What was previously dispersed among the several branches of
the decoder activation is now collected together in a single
case after the decoder has been chosen.

Modified \.{Makefile.in} to delete the fussy \.{core.}{\it process}
files Linux has taken to produce.

Fixed \.{configure.in} to specify \.{-Wall} if we're building with
GCC.

\date{2002 September 20}

On Solaris, GCC is prone to hang if invoked with \.{-O2} (at least
as of version 2.95.3).  I twiddled the \.{configure.in} to change
the compile option to \.{-O} for Solaris builds.

\.{ctangle} and \.{cweave} spewed copious warnings on a
GCC \.{-Wall} build.  To avoid modifying these programs, which
are prefectly compliant ANSI C, I changed \.{Makefile.in} to
suppress the \.{-Wall} option for them when the compiler is
detected as GCC.

\.{make dist} didn't do a \.{make distclean} before generating the
distribution archive, which could result in build-specific files
being included in the archive.  Fixed.

\date{2002 September 21}

Added documentation on how to integrate \PRODUCT\ into a
\.{.forward} pipeline to \.{Procmail}, and build
a \.{.procmailrc} rule set for a typical user-level
filtering.  It's 03:40 and I'm going to get some sleep
before proofing this text---at the moment it's something
between a random scribble and a first draft.

Okay, I just couldn't {\it stand it}$\ldots$I just {\it had} to take
another crack at the infernal |dictionary::purge| method.  One
of the many bees in my bonnet buzzed the idea into my ear
that I could avoid both the extra memory consumption of
yesterday's scheme and the risk of instability in the
container by testing the probability of the first item
in the |map|, adding it to the |queue| of survivors if
its probability is significant, then performing
an |erase(begin())|.  Cool, huh?  No iterators, no mess, no
two copies of any word in memory.

The hits just keep on coming$\ldots$the stupid built-in purge
in |dictionary::resetCat| also ran afoul of the ``stale iterator''
problem.  I blew it away---henceforth, it's up to you to do a
\.{--purge} after a \.{--clearmail} or \.{--clearjunk}.  With the
new tolerance for un-purged dictionaries, no great harm will be
done if you forget.

Added a \.{\\subsection} macro to create subheads within
documentation sections.  The section number is automatically
grabbed from the \.{cwebmac.tex} definition, but lower level
numbering is manual, permitting you to add additional
levels of hierarchy with a specification like:\hfill\break
\.{\\subsection\{4.2.1\}\{Twiddling little details\}}.

It turns out that all the cheesy mess I put in to patch the
user's home directory into the \.{\PRODUCT -run} script
wasn't necessary after all since \.{sendmail} is kind
enough to change to the user's home directory before
piping a message to a program.  This means we can just
\.{cd} to \.{.\PRODUCT} relative to the home directory.
This also means one can remove the absolute path name
from the \.{.forward} file, which cleans up the
documentation on integration with \.{Procmail}.

Added a rather tacky \.{check} target to the \.{Makefile.in}
to serve as a ``sanity check'' that doesn't require an
extensive training databases.  The scheme is to train the
program with the source code for \.{\PRODUCT .w} serving
as the mail collection and \.{statlib.w} the junk bucket.
Then those programs themselves are tested, and the transcripts
verified to confirm they were correctly classified.  Astute
observers will ask where I get off using something which isn't
a well-formed mail folder to train the program.  Well, it
works thanks to a gimmick I put into the probability
calculation to keep it from dividing by zero if one or
both of the message counts were zero.  That keeps anything
untoward from happening when we're missing message headers,
and the difference in the word content of the two files is
so extreme that they reliably score correctly.

Added a new Perl gizmo, \.{TestFolder/testfolder.pl}, which walks
through a mail folder, breaks out each message, and passes it
through \PRODUCT\ to obtain the probability and classification.
(The \PRODUCT\ command is defined by a string within the
Perl program, so you can modify as you wish to evaluate the
effects of other settings.)  At the end of the folder, the total
message count, number of messages scored as junk and mail, and
the mean probability of messages in the folder are printed.

Added a ``back'' command to \.{SplitMail/splitmail.pl}.  As
you walk through a mail folder, the start address of each
message you've seen is kept in a stack.  The ``\.{b}''
command pops the stack and backs up to the previous message.
This should reduce the pain when your sorting a folder
and accidentally hit ``\.{d}'' when you meant to save the
message somewhere.  You can even go back after a search
operation.

Moved the \.{splitmail.pl} and \.{testfolder.pl} from their
own dedicated directories into a new \.{utilities}
directory which \.{Makefile.in} includes in the archive.
If and when these utilities require common code, such as
the CSV parser, it will be easier to manage them all in the same
directory.

Added help, requested by the ``\.{?}'' key, to \.{splitmail.pl}
at both the disposition and the ``more'' prompt while viewing
message text.  If you assign additional folder destinations
to disposition keys, they are automatically included in the
help output.

Now that \.{splitmail.pl} is equipped with a ``back'' mechanism,
there's no reason not to interpret a void disposition as
a request to advance to the next message---if it's a fat-finger,
just go back.  Trolling through a target-sparse folder can now be
done at the expense of only one keystroke per message.

\date{2002 September 22}

Went ahead and added code to dereference symbolic links up to
50 deep when deciding whether files are \.{gzip} compressed in
|mailFolder|.  What the heck, it's the solstice (well, it was a
couple of hours ago) and the full Moon to boot---better to
write silly code than trying to balance eggs on their little
ends!

Much work on the documentation today, but little on the code.
Slowly the python peristalsis moves us toward release.

\date{2002 September 23}

We're off to see the lizard, the wonderful lizard of WIN32\null!
Naturally, all of our carefully crafted code to set up pipelines
to decompress dictionaries evaporated under the harsh sun
of WIN32\null.  I added conditional compilation to disable
everything that incompetent empire self-defined by its own {\it limes}
and rusty Gates doesn't comprehend.

Building for WIN32 with DJGPP resulted in a natter about
comparison of the |size_type| of a |multimap| to an
|unsigned int|.  The Linux compiler accepted this without
a quibble.  I added a |static_cast| to clear up the
confusion.

OK, it built on WIN32 with DJGPP 2.953 and even passed the
rudimentary tests I threw at it.  So, I copied the executable
back to the development directory, then discovered and fixed
numerous bugs in the archive creation code in \.{Makefile.in}
when the WIN32 distribution is enabled.  Got better.  A
Zipped WIN32 build is now posted in the Web directory and
linked to from the home page.

The \.{configure.in} script didn't check for the
\.{-lm} math library.  This somehow managed to work
on Linux and Solaris, but failed on FreeBSD.  I added
the necessary \.{AC\_CHECK\_LIB} macro.
(Reported by Neil Darlow).

Fixed several typos in the documentation of |computeJunkProbability|
and reformatted the formula as a stacked fraction so it fits
better on the page.

Added logic to \.{configure.in} to test for the presence
of the \.{system} function and the \.{gnuplot} and
\.{ppmtogif} utilities required by the \.{--plot}
option.  If any of them is missing, the option will be
disabled when the program is compiled.

Added a test to \.{configure.in} for the presence of
\.{readlink} and disabled the code that chases
symbolic links in file name arguments if it's absent.
I also added a ``probable loop'' warning if this
code exceeds the maximum link depth limit.

Added a configurator test for the presence of \.{popen}
and code to disable the ability to read compressed
files if it's not present.  This allowed me to remove the
special case for \.{WIN32} I added last night
to build on \.{DJGPP}---it's now subsumed into the
test for \.{popen}.

Designed this version as ``Release Candidate 1'' and indicated
this by setting |VERSION| to \.{"0.1-RC1"}.

Proofed the program documentation and the formatting of
the code listing and fixed numerous typos and infelicitous
layout.

Defined \.{-t} as a shortcut single-letter option
for \.{--test} and \.{-r} as a shortcut for \.{--read}.

Release 0.1-RC1.

\date{2002 September 24}

Hugh Daniel took a look at the program and had many comments
and suggestions.  Until otherwise noted, the following
items result from them.

Corrected ``vertical interlace'' terminology in the
document to ``vertical retrace''.  I'm forever screwing
that one up.

Renamed \.{--purge} to \.{--prune}, which is a more
precise (and less intimidating) description of what it
does.  For the moment, \.{--purge} is still accepted
to ease the transition.  Fixed the \.{check} target in
\.{Makefile.in} to use \.{--prune}.

Added the hideous logic to \.{Makefile.in} to report
overall pass/fail status for the \.{check} target.

Clarified the infectuous nature of the GPL in \.{COPYING}.
While I was at it, I added information about the public
domain status of DCDFlib.

Okay, back to self-generated items$\ldots$\,.  Changed
the \.{--plot} option to use \.{pnmtopng} to generate
the plot in PNG format instead of GIF.

Release 0.1-RC2.

\date{2002 September 26}

Added the ability to treat a directory as a mail folder
consisting of messages in individual files in the directory.
The contents of the directory are simply logically
concatenated and are not restricted to one message per
file--they may be \UNIX/ mail folders in their own right.

After a huge amount of wasted effort trying to do this
in an ultra-clean \CPP/ fashion by defining an
|idirstream| flavour of |istream| which returns the
concatenated contents of files in a directory (I got
{\it that close}, but couldn't make it work with the
|getline| function for |string| without stooping to
ugliness and making assumptions about the guts of the
|iostream| package I believed unwarranted.  This dead
end is why you see no log entries for yesterday.

So, I ripped all that out and simply added logic to
|mailFolder| to detect when it's passed a directory
and wrap a loop traversing the directory around the main
input loop; when end of file is encountered and we're
traversing a directory, we look for the next file and
commence processing it, declaring a genuine end of file
only at the end of the directory.

This interacts in an interesting way with the MIME
decoders.  Recall that they are passed the actual
|istream| from which the |mailFolder| normally
reads and take charge of it until the end of the
encoded section is reached.  I added {\it no} logic
to them specific to directory traversal---when they
hit the end of the stream, they declare a missing
terminator at the end of the section and bail out.
But that's {\it good}---we don't want a missing
terminator to gobble up the contents of a subsequent
file in the directory folder.  (Although if each file
begins with a ``\.{From\ }'' line, it will cause the
detector to bail out.  This way, it's only after
arriving back from the decoder that we detect we're
at the end of a file in the directory and progress
to the next item, if any, in the directory.

Yes, all of this is conditional on the presence of
|opendir| and |stat|, which are required to detect
and traverse the directory; the whole mess goes away
if \.{configure.in} doesn't detect them.  Yes, files
in the directory may be compressed.  And, yes, files
in the directory may be symbolic links to compressed.
But no, you can't recursively traverse directories;
directories within a directory folder are simply
ignored, which nicely avoids a special case for
``\.{.}'' and ``\.{..}''.

In the process of putting in all this junk, I discovered
that the existing code for decompressing mail folders
failed to call |pclose| to close out the pipeline, which
is unkind.  I added a destructor which makes sure it's
called when necessary.

Added a new \.{fragmail.pl} program to the \.{utilities}
directory.  It splits up a monolithic mail folder into
a directory with one message per file, making up
file names from the message sequence in the input
folder.

Added a new \.{signatures} target to \.{Makefile.in}
which creates \pdfURL{GnuPG}{http://www.gnupg.org/} signatures for
each of the downloadable files and added a command to the
\.{publish} target which copies them to the distribution
directory.

Added code to \.{configure.in} to test for the presence
of \.{pdftotext}, which we will eventually use to crack
PDF files.  Let's be realistic, however.  This is cool
(and will open the door to a general application specific
binary file cracker, which I've been itching to do), but
in terms to the mission statement of \PRODUCT\ and
present day junk mail, is far from important.  I've found
precisely one PDF file in each of my mail and junk
archives, so with a plane to catch tomorrow, I'm
not going to stay up any later tonight worrying about
refinements of this kind.

Release 0.1-RC3.

\date{2002 September 29}

Added logic to \.{Makefile.in} to prepare an HTML version of
\.{man} page automatically from the \PRODUCT\.{.1}
\.{troff} file.  The output will require fixup since
it is intended to be run from a CGI script, but should
eliminate much of the duplication of labour inherent in
maintaining parallel documentation in HTML and \.{man} page
format.

\date{2002 October 1}

Expanded documentation of command line options in conjunction
with preparation of a manual page using the
\.{docutil/options.pl} translator.

Added ``{\mc USAGE}'', ``{\mc EXIT~STATUS}'', and ``{\mc FILES}''
sections to the manual page; all of these are specific
to the man page and are not derived from
\.{\PRODUCT.w}.

\date{2002 October 2}

Much work yesterday and today on automating the generation
of documentation from the \.{CWEB} source file.  I wrote a
Perl program, \.{docutil/options.pl} to compile the options
documentation from \.{\PRODUCT.w} into \.{troff} format
with the \.{-man} macros.  Actually, although containing
special cases for the options, this is reasonably general
and may be deployed for other common documentation in the
future.

The output from \.{man2html} has some infelicitous links and
formatting for HTML intended to be shipped with the product
and included on its Web page.  I wrote a Perl hack,
\.{docutil/fixman2html.pl}, to correct these items,
and modified the \.{Makefile.in} targets to generate
a first draft HTML in \.{\PRODUCT\_man\_raw.html}, which
is post-processed by the fixup program into the
final \.{\PRODUCT\_man.html} file, which is now included
in the distribution by the \.{dist} target and copied
to the Web directory by \.{publish}, both of which targets
generate it if necessary.

Added a \.{mantroff} target to \.{Makefile.in} to preview
the \.{troff} format manual page using ``\.{groff\ -X}'' (if
available on the system---if not, don't do that).

Wrote a \.{docutil/cwebextract.pl} Perl program which searches
a \.{CWEB} file for a named section (which can be a regular
``\.{@@}'' section, so long as the search target appears on the
same line as the ``\.{@@}''.  If the section is found (matching
is case insensitive and the search target given on the command
line matches the first line containing a substring which
it matches), the contents of the documentation section is written
to standard output, trimming leading and trailing blank lines.
The end of the documentation section is the next line which
begins with an at sign or the end of file.

Moved the \TeX\ definitions used to generate the options
list to the top of \.{\PRODUCT.w} so they don't confuse
the automatic extraction and translation process.

Modified \.{docutil/cwebtex2man.pl} to ignore \TeX\
\.{\bslash bigskip} commands, carefully avoiding
generating a nugatory \.{.PP} in the \.{troff}
output due to two consecutive blank lines once the command
has been ignored.

Added the \.{docutil} directory and its contents to the
distribution generation target in \.{Makefile.in}.

Generation of the ``{\mc OPTIONS}'' section of the
\.{\PRODUCT.1} manual page from the corresponding section
of \.{\PRODUCT.w} is now completely {\bf Turbo~Digital}$^{\rm TM}$.
The invariant parts of the manual page are now defined in
the ``manual page macro'' file \.{\PRODUCT.manm}.  The
\.{Makefile.in} now understands that \.{\PRODUCT.1} is
generated by processing this file with \.{docutil/manm\_expand.pl}
which expands \.{\bslash"\%include} statements in the macro
file by extracting the specified section from the named
\.{CWEB} file with \.{docutil/cwebextract.pl}, translating
it into manual page \.{troff} with \.{docutil/cwebtex2man.pl},
and inserting it in the output file in place of the
include statement.  This completely eliminates all manual
labour when updating the options in the manual page
and guarantees that changes to the option documentation
in \.{\PRODUCT.w} are propagated to the manual page
document.  The same mechanism can be used for other common
documentation as the need arises.

\date{2002 October 3}

Subtly obfuscated the E-mail address to which bugs should be
reported in the manual page so the process of transforming it
into HTML won't result in a deadly \.{mailto:} link or a
sniffable address in the page.  Visual fidelity for human readers
is maintained.

Updated the Web document to reflect the existence of the HTML
manual page and added links to it.

Added a reference to the PDF document to the ``{\mc SEE~ALSO}''
section of \.{\PRODUCT.manm}.  Fixed an embarrassing hyphenation
of a file name by prefixing the offending word with the
\.{troff} ``don't hyphenate'' escape ``\.{\bslash\%}''.  (Apparently,
even in \.{nh} mode, \.{troff} will hyphenate a word which contains
an embedded hyphen unless you explicitly forbid it.)

Added the \.{.w} files to the \.{winarch.zip} archive used
to transfer files to build for Win32.  While they aren't
strictly required, they're awfully handy to have should you
encounter compile errors, which are reported with line numbers
from the \.{CWEB} file.  Looking it up while on Windows and patching
the \CPP/ file is a lot quicker than booting back into a real
operating system to explore the problem.

In |@<Check whether folder is a directory of messages@>| there
was an erroneous reference to |dirFolder| not conditional on
|HAVE_DIRECTORY_TRAVERSAL|---fixed.

The |mailFolder| constructor which accepts a file name in a
|string| re-used the |ifstream| |isc|, which was previously
used only when reading compressed files.  This caused compile
errors on systems where |COMPRESSED_FILES| was not defined.
We now unconditionally define |isc| in the |mailFolder| class
definition.

With these fixes, the \.{makew32.bat} build on Win32 now works
once again.

Added a \.{testw32.bat} file which runs a rudimentary test of
the Win32 build similar to the \.{check} target in \.{Makefile.in}.
I added this file to both the \.{dist} and \.{winarch} archive
generation targets in \.{Makefile.in}.

Modified \.{Makefile.in} to replace the hard-coded
\.{/ftp/\PRODUCT} destination with a \.{PUBDEST}
declaration at the top of the file which defaults
to the same directory.  This permits overriding the
default publication destination for use at another
site or for nondestructive testing of new releases
simply by editing the \.{Makefile}.  Some day, it
might make sense to permit overriding this with an
option at \.{./configure} time, but this is not
that day.

Release 0.1-RC4.

\date{2002 October 11}

Integrated the application string parsers for Flash and PDF
formats, which were developed in a separate stand-alone
test program.  These include the classes |applicationStringParser|
(mother of all application parsers), |flashStream|,
|flashTextExtractor|, and |pdfTextExtractor|, the latter
compiled in only if all the utilities it needs to decode
PDF via a pipe to \.{pdftotext} are present.  At the moment, these
aren't hooked up to the mail folder, but are merely exercised
by code in the \.{--jig}.

Integrated Knuth and Levy's \.{CWEB} version 3.64 in the
\.{cweb} directory.  The \.{CWEAVE} and \.{CTANGLE} programs
are built with a change file, \.{common-bigger.ch} which increases
the input line length limit to 400 characters as I did in
the earlier 3.63 release.

Added plumbing to invoke Flash and PDF parsers for attachments
with those application types.  Thanks to the inability to take
a class member function as an unqualified function pointer,
this is somewhat tacky, requiring a pointer to the
|mailFolder| to obtain decoded data.

\date{2002 October 12}

Added decoders and interpreters for Shift-JIS and Unicode
(UCS-2, UTF-8, and UTF-16 encodings).  These are used to
decode and interpret these character sets in Flash animations
whose fonts are so tagged.

Added logic to invoke the new Unicode UTF-8 decoder when
a MIME part's \.{charset=} designates it so encoded.

\date{2002 October 13}

In the process of testing UTF-8 decoding of Unicode messages,
I stumbled over a bug in ignoring HTML comments embedded
within tokens, a common trick in junk mail to evade
na\"\i ve filters, for example, ``\.{remo<!---->ve\ your<!---->self}''.
(Yes, I know a valid HTML comment is supposed to contain a
space after the initial and before the final sentinel,
but junk mail often violates this rule, counting on sloppy
browsers not to enforce the standard, so we must comply in
the interest of ``seeing what the user would''.)  HTML
comments are now completely discarded, even when embedded
within tokens.

The \.{dist} target in \.{Makefile.in} failed to clean
the \.{cweb} directory before including it in the source
archive, which could have the result of leaving objects
and binaries not compatible with the system on which
the user is installing.  I modified the target to descend
into the \.{cweb} directory and \.{make\ clean}.  This
promptly ran into another problem because the \.{CWEB}
\.{Makefile} deletes the \CEE/ source for \.{CWEAVE},
using the bootstrapped \.{CTANGLE} to re-build it.  This
is clean, but runs afoul of my rebuilding both programs
directly in the outer \.{Makefile}.  I saved the original
\.{CWEB} makefile as \.{Makefile.ORIG} and modified
the \.{clean} target in the actual \.{Makefile} to
leave \.{cweave.c} around.  I also modified our own
\.{clean} target to clean the \.{cweb} directory as well.

Attempting to build \.{.dvi} or \.{pdf} targets after
you'd cleaned the \.{cweb} directory failed for lack of
\.{cweave}; I added a dependency to \.{Makefile.in} to
ensure it's rebuilt when needed.

Since certain recent versions of \.{gcc} libraries
have begun to natter if \CPP/ include files specify
the \.{.h} extension (which, for years, was
{\it required} by those self-same libraries), I eliminated
them from our list of includes, which finally seems to work
on \.{gcc} 2.96.  Doubtless this will torpedo somebody using
an earlier version.

Broke up the unreadably monolithic list of include files
into sections which explain what's what.

{\bf Dooooh!}  Forgot to disable the declaration of the
|pdfTextExtractor| in |mailFolder| when
\.{HAVE\_PDF\_DECODER} was not defined, which was the undoing of
the Win32 build; fixed.

Release 0.1-RC5.

\date{2002 October 19}

Added a check in |classifyMessages| to verify that a dictionary
has been loaded before attempting to classify a message.  If
no dictionary is present, a warning is written to standard
error and the junk probability is returned as 0.5.

Added a warning if command line are specified after a
\.{--classify} command.  Since this command always exits
with an exit code indicating the classification, specifying
subsequent arguments is always an error.

Added a bunch of consistency checking for combinations of
options which don't make any sense and suggest the user
doesn't understand in which order they should be specified.
To facilitate this, I modified the code for the
\.{--classify} option to set a new |lastOption| flag
to bail out of the option processing loop and set
|exitStatus| to the classification rather than exiting
directly before the option consistency checks are
performed.  This cleans up the control structure in
any case.

In the process of adding the above code, I discovered that
the |any()| method of |bitset| seems to be broken in the
\.{glibc} which accompanies \.{gcc} 2.96.  I tested
|count()| against zero and that seems to work OK.

Implemented phrase tokens.  You can consider phrases of
consecutive tokens as primitive tokens by specifying the
minimum and maximum words composing a phrase with the
\.{--phrasemin} and \.{phrasemax} options.  These default to
1 and 1, which suppresses all phrase-related flailing around.
If set otherwise, tokens are assembled into a queue and all
phrases within the length bounds are emitted as tokens.
How well this works is a research question we may now
address with the requisite tool in hand.

\date{2002 October 20}

Added code to import a binary dictionary file with the
\.{--read} option using memory-mapped I/O if \.{./configure}
detects that facility and defines \.{HAVE\_MMAP}.  This
isn't a big win on individual runs of the program, but if
you're installing it on a high volume server, multiple
read-only references to the dictionary file (be sure to
make the file read-only, by the way) can simply bring the
file into memory where it is re-used by multiple instances
of the program.  (Of course, if the system has an efficient
file system cache, that may work just as well, but there's
no harm in memory mapping in any case.)  Thanks to the
\CPP/ theologians who deprecated the incredibly useful
|strstream| facility, which is precisely what you need to
efficiently access a block of memory mapped data as a stream,
I included a copy of the definition of this facility in
\.{mystrstream.h} so we don't have to depend on the
\CPP/ library providing it.

I was a little worried about writing phrases in CSV format
without quoting the fields, but I did an experiment with
Excel and discovered it doesn't quote such fields either---it
only uses quotes if the cell contains a comma or a quote
(in which case it forces the quote by doubling it).  Since
our token definition doesn't permit either a comma or a quote
within a token, we're still safe.

\date{2002 October 21}

Added a \.{--phraselimit} option to discard phrases longer than
the specified limit on the fly.  This prevents dictionary bloat
due to ``phrases'' generated by concatenation of gibberish
from headers and strings decoded from binary attachments.  These
will usually be eliminated by a \.{--prune}, but that doesn't
help if the swap file's already filled up with garbage phrases
before reaching the end of the mail folder.  The default
\.{--phraselimit} is 0, which imposes no limit on the length
of phrases.

\date{2002 October 22}

When the default |getNextEncodedLine| of a
|MIMEdecoder| encountered the ``\.{From\ }'' line of the
next message in a mail folder, it failed to store the line
as the part boundary, which in turn caused |mailFolder|
to mis-count the number of messages in a folder being
parsed when training.  I fixed this, and in the process re-wrote
an archaic \CEE/ string test used in
|@<Check for start of new message in folder@>| to use
a proper \CPP/ |string| comparison.

Corrected some ancient URLs in \.{README}, and
added information on the SourceForge project there
and in \.{annoyance-filter.manm}.

Release 0.1-RC6.

\date{2002 October 23}

Modified \.{docutil/fixman2html.pl} to include an absolute
URL for the ``Fourmilab Home Page'' link.  This gets people
back to the site when the resulting manual page is posted
on SourceForge.

Updated the \.{distclean} target in \.{Makefile.in} to get
rid of several intermediate files which had crept in since
the last housecleaning.  These made it more difficult to
detect any new files which required adding to the CVS repository.

Added the \.{utilities/maildir\_filter.pl} utility contributed
by Travis Groth.  This has been added with CVS but not
yet committed.

\date{2002 October 26}

Added a \.{--biasmail} option to set the frequency bias for
words and phrases found in legitimate mail.  Previously this was
fixed at 2, which remains the default.

Added \.{autoconf} plumbing to detect all the myriad stuff
required to support POP3 proxying.  We attempt to distill
all of these detections down to a \.{POP3\_PROXY\_SERVER}
definition which controls all code related to that
capability.

\date{2002 October 27}

Integrated the stand-alone POP3 test article as a new
|POP3Proxy| class with a hard-coded exerciser in the
\.{--jig}.  At the moment, it's purely a proxy---it doesn't
interpose the filter.

\date{2002 October 30}

After much struggling, the POP3 procy now seems to be working,
so it's time to integrate it fully into the program.

Added a \.{--pop3port} option to specify the port on which
the POP3 proxy listens for connections.  If no specified, the
port number defaults to 9110.

Added a \.{--pop3server} option to specify the server
and optionally, port (which defaults to 110 if not given)
to which the POP3 proxy server will connect.  This must
be the last option (a warning is given if it isn't), and
causes the server to immediately begin operation.  I removed
the server test code from the \.{--jig} and physically
moved it to a subsection within the ``POP3 proxy server''
section, following the class definition.

\date{2002 October 31}

Disabled the \.{--jig}, since there's nothing in it
at the moment.

Added proper conditional setting of |POP3_PROXY_SERVER| based
on the capabilities sensed by \.{autoconf} and fixed one
compile problem if the proxy server is disabled.  At the
moment, we assume that if |socket| and |signal| are
defined, everything else we'll need will also be defined

\date{2002 November 1}

Cleaned up POP3 proxy code and added documentation of
the related command line options.  I still need to add a
main document section on how to install and operate a
proxy server.

\date{2002 November 2}

We weren't activating the byte stream parser for spoofed
mail worm attachments which trick Microsoft Outlook into
executing an attachment through the incredibly subtle
strategem of declaring the attachment as an innocuous file
type such as audio or image, but with an extension which
denotes an executable file.  Brain-dead Outlook decides
whether to block or confirm executable content based upon
the former, but then actually executes the file based upon
the latter.  Can you say ``duh''?

Well, thanks to this particular piece of Redmond rot, tens
of millions of these worms continue to pollute the net since,
even though the hole has been plugged, millions of the bottom-feeders
who use such software continue to use unpatched versions and/or
run machines which are already infected and actively propagating
the worm.

All right, enough polemic.  What this means for \PRODUCT\ is
that when we see an attachment with a \.{Content-Type} which
usually denotes something we're not interested in parsing, but
then discover its file name is one of the suspicious executable
Microsoft file types, we need to feed it through the byte
stream parser just as if it were tagged with an
``\.{application}'' file type.  Doing so will extract the
inevitable embedded strings, which will act as a signature for
subsequent encounters with the same or similar worm. 
(SourceForge bug 631503, reported by Neil Darlow.)

Improved diagnostics for parser errors by saving the
``\.{From\ } line and \.{Message-ID} (if any) from the
header and then labeling any parser diagnostics written
to standard error with the \.{--verbose} option with them.
The labels are written only before the first diagnostic for
each message in a folder, and diagnostics are now indented to
better diatinguish them from the labels.

Diagnostics from |MBCSdecoder| objects were written to
standard error without any identification of the message
in which they occurred.  I added the ability to link
an |MBCSdecoder| to its parent |mailFolder| with the
new |setMailFolder| method.  If linked, diagnostics from
the decoder are emitted via the |reportDecoderDiagnostic|
method of the linked folder, permitting them to be labeled
with the message identification as described in the previous
paragraph.  It's still possible to use an |MBCSdecoder|
without linking it to a |mailFolder|---if the link is
|NULL|, diagnostics are sent to standard error as before.

Improved diagnostics from the various |MBCSdecoder| classes.
All reports of invalid two-byte sequences now report both
hexadecimal bytes, and other invalid value diagnostics
report the offending hexadecimal value.

Added the ability to search for a literal substring as well
as a regular expression in \.{utilities/splitmail.pl}.
If the search target begins with ``\.{+}'' (which is invalid
in a regular expression), the balance of the pattern is
searched for with case-insensitive comparison.  Since so
many of the message headers you're likely to be looking for
contain regular expression meta-characters, it's a lot
more convenient to specify an explicit target than remember
what they all are and quote them.

Corrected the diagnostic for an unknown character set in
a header line string to say ``Header line'' rather than the
obsolete and misleading ``Subject line'' it used to say.

Added ``\.{us-ascii}'' to the list of character sets which
require no multi-byte decoding or interpretation when they appear in
header line quoted strings.  Junk mail sometimes encodes even
ASCII subject lines (and sometimes other headers) as
Base64 or Quoted-Printable to hide the text from na\"\i ve
filters.

Added a script to build under Cygwin, \.{makew32.sh}.  Attempting
to link in our own copies of \.{getopt.c} and \.{getopt1.c}
runs afoul of the Cygwin linker ({\it why?}), so I removed
them from the compiles and link done by this script.

Building on Cygwin failed because the library I was using
didn't define \.{in\_addr\_t}.  I'd seen this earlier on
Solaris, but had inadvertently added a new reference since
I'd last tested there.  I changed the offending reference
(in a \.{static\_cast} of all places), to our cop-out type
\.{u\_int32\_t}, which \.{autoconf} guarantees will always be there.
With that fix, the program built {\it and worked} on Cygwin,
including POP3 proxying!

The check for non-white space following a soft line break
in a Quoted-Printable MIME part failed for a POP3 proxy
message containing CR/LF line terminators.  I broadened the
definition of white space in |@<Character is white space@>|
to include carriage return.

\date{2002 November 3}

Scribbled a first cut \.{README.WIN} file to be included in the
Win32 executable archive which explains the issues involving
the included Cygwin DLL\null.  I modified \.{Makefile.in} to
include this file, the DLL, and \.{COPYING.GNU} (the GPL) in
the Win32 archive.

Tested the Win32 archive on a Cygwin-free machine.  Seems to
work OK, including POP3 proxy from another machine on the LAN.

Verified that POP3 proxy on a Cygwin-free machine running
Windows 98 works with the version of Outlook furnished with
that system, which can be configured to retrieve messages from
"localhost" on our default port of 9110.  Note, however, that
one must first configure the account (defaulting to port 110),
then edit the properties of the account, using the ``Advanced''
tab to specify the POP3 port of 9110.

Messages embedded within other messages with the
\.{Content-Type} specification of \.{message/rfc822}
did not have their own MIME parts correctly decoded
because |mailFolder| failed to scan the header of the
embedded message for its own \.{Content-Type} and
\.{boundary} specifications.  Fixed.  This should get rid
of the previously mysterious long gibberish strings which
decoded out of forwarded messages with image and other
binary attachments.  The strings were due to the Base64
decoder not being activated for the embedded message's
attachments.

\date{2002 November 5}

Implemented the first cut of fast dictionary support.  Having
created a dictionary in memory, you can export it to a file
in fast dictionary format with the \.{--fwrite} option.  The
\.{--fread} option loads such a dictionary and, if loaded,
it takes precedence over a regular |dictionary|.  This
permits fast classification of messages without all the
overhead of creating a full-fledged in-memory dictionary.

Added memory-mapping of the fast dictionary when |HAVE_MMAP|
is defined.  In the interest of code commonality, the header
fields are read from an |istrstream| bound to the memory
mapped block, but access to the hash and word tables are
pure pointer-whack.

Fixed a typo in \.{configure.in} which caused a harmless
but ugly warning when running the script.

Disabled static linking for SunOS systems in \.{configure.in}
due to GCC's inability to find the networking libraries
when static linking.

Added a list of optional capabilities detected by
\.{configure} to the \.{--version} output.  This makes
long-distance diagnosis of configuration problems easier.

The check for attempting to start a POP3 proxy server
without having loaded a dictionary didn't test for a
fast dictionary's having been loaded.  Fixed.

The destructor for |fastDictionary| attempted to |delete|
the in-memory dictionary even when it was, in fact, memory
mapped from a file.  I added conditional code to replace the
|delete| with a |munmap| and |close| of the file.  In addition,
I added logic to unmap and close the file if an error was detected
while reading its header.

Modified the ``\.{check}'' target in the \.{Makefile.in}
to use a fast dictionary for the junk test.  This guarantees
the fast dictionary code will be exercised in the normal course
of building and installation.

Added the \.{-x} option to the invocation of the shell in the
Cygwin \.{makew32.sh} script so we can see what's going on
during the build.

\date{2002 November 6}

Created a \.{pop3proxy.pif} file as a skeleton PIF the user
can edit (with ``Properties'' from the right click menu)
to set up an auto-start POP3 proxy server,

Discovered that \.{README.WIN} (the description of
Cygwin related issues for the Windows executable archive)
was missing from the comprehensive source archive.  It
was also missing from the CVS tree.  Both fixed.

Added confirmation messages for exporting and loading
fast dictionary files when \.{--verbose} is set.

Added an option to the \.{tar} command used to create
the source archive to exclude the CVS subdirectories.
This works only with Gnu \.{tar}, but that should be OK,
since we only create distributions on systems so equipped.

Release 1.0-RC1.

\date{2003 January 22}

Added code to the |POPDEBUG| output to echo both status
replies from the server and the body of multi-line reply
messages.

Eliminated some obsolete disabled code in
|@<Read status line from server@>| in POP3 proxy support.

Promoted the POP3 trace facility from conditional compilation
to a full-fledged option, \.{--pop3trace}, which causes the
trace output to be written to |cerr|, tagged with a prefix
of ``\.{POP3:\ }''.  Added trace output to show replies
sent to the client, both status lines and multi-line bodies.

Removed the disables (got that?) of |HAVE_DIRENT_H| and
|HAVE_POPEN| for \.{WIN32} builds, permitting directory
traversal when building dictionaries and expansion of
compressed files (if \.{gzip} is installed on the system).
These were previously disabled when we
built with \.{DJGPP}, which didn't support these features;
Cygwin does.

\date{2003 January 23}

Made the |writeMessageTranscript| methods of |mailFolder|
|const|, as they don't change any member of the class.

Added a new |sizeMessageTranscript| method to |mailFolder|
which computes the size of the file written by |writeMessageTranscript|.
If you intend to export the transcript with a different per-line
overhead than the one byte added by |writeMessageTranscript|, you
can pass a |lineOverhead| argument to specify the overhead; the
default value is one.

I finally figured out what was causing ``hangs'' when transferring
large messages as a POP3 filter on \.{WIN32} platforms (Cygwin
builds).  Well, it {\it wasn't} hung---it had just slowed
down by several factors of a thousand and nobody noticed
the difference.  ``Why?'', you ask.  Well, it turns out that
after all the real work is done, |popFilter| called
|writeMessageTranscript| with an |ostringstream| to create
the reply message body to be returned to the POP3 client.
This apparently trivial operation, which is essentially
instantaneous on a Linux or Solaris box with GCC and its
libraries, runs a tad slower under the Cygwin version of
the very same compiler and libraries.  How much slower?
Well, for a half-megabyte file, about 1500 times slower!
Worse, the slow-down grows much faster than linearly with
the size of the file; I tested a one megabyte file and
gave up after several hours of watching it.  Presumably
there is some idiocy in the allocator used to expand the
|string| within the |ostringstream| which is causing it
to take longer and longer as the string grows.  I rewrote
the code in question to use a trusty |ostrstream|
directed at a dynamically allocated buffer (that's
what |sizeMessageTranscript|, discussed above, is for),
and the whole thing runs too fast to measure under both
Linux and Cygwin now.  Ain't ``source compatibility'' fun?

Moved the include of \.{mystrstream.h} outside the conditional
for |HAVE_MMAP|, as it is now needed by the |popFilter| code
as well.

Added \.{mystrstream.h} to the files included in the \.{WIN32}
transfer archive by the \.{winarch} target in \.{Makefile.in}.

To avoid possible copying of the string containing a large message
body and to make the code consistent, modified
|@<Create mail folder to read reply from POP3 server@>| to use
an |istrstream| directed at the data of the |reply| string
rather than an |istringstream|.  Given the adventures we've had
with |ostringstream|, the less I have to do with these beasts
the better.

Added the ability to limit the size of single |send| calls
writing a multi-line reply body back to a POP3 client in
|@<Relay multi-line reply, if any, to the client@>|.  If
|POP3_MAX_CLIENT_WRITE| is defined, multiple sends no larger
than that value will be used.  Otherwise, all the data will
be sent in a single monolithic |send| as before.  This was
added in the process of chasing down the Cygwin ``hang'' problem,
and for the moment I've left the code in place in case it should
be needed in the future.

The |mailFolder| constructor which takes an |istream| argument
did not clear |dirFolder| when built with |HAVE_DIRECTORY_TRAVERSAL|.
This ran the risk that, at the end of the folder,
we would erroneously call |readdir| to look for
the next file in a nonexistent directory.  This was particularly
a risk for POP3 proxying, where the mail folder is created on
the stack and static initialisation doesn't occur.  I added
an explicit clear of |dirFolder| in the |istream| constructor
of |mailFolder|.

Added a program, \.{fromtest.pl}, to the \.{utilities} directory,
which scans a mail folder and checks for occurrences of the
initial string ``\.{From\ }'' not preceded by the start of
file or a blank line.  Most Unix mail folders obey this convention,
but the original definition of BSD mail folders required {\it every}
occurrence of ``\.{From\ }'' at the start of a line to be quoted
(traditionally with ``\.{>}'').  You can use this program to test your mail
folders and determine which kind your mail system creates.

\date{2003 January 24}

Modified the \.{winarch} target in \.{Makefile.in} to exclude
any \.{CVS} directories it may encounter.

Added |strstream|, |istrstream|, and |ostrstream| to the \CPP/
library type list in \.{cweb/c++lib.w}.

\date{2003 February 15}

Started to dig into compile incompatibilities in the ``new and
improved'' libraries which accompany \.{gcc} 3.2.2.  In
the language lawyering verbiage below, ``Stroustrup'' refers
to ``The \CPP/ Programming Language, Special Edition'' by
Bjarne Stroustrup, ISBN ~0-201-70073-5.

First of all, my local copy of |strstream| in \.{mystrstream.h}
ran afoul of other changes in the ``standard'' library.  I
merged the \.{backward/strstream} and \.{backward/strstream.h}
files from the 3.2.2 library and installed them as
\.{mystrstream-GCC3.h}, which is included if \.{GCC3}
is defined.  I have yet to add the \.{autoconf} logic to
detect this; at the moment I'm specifying this when I
invoke the \.{Makefile}.

An include of the now {\it verboten} \.{iostream.h} remained
in \.{statlib.w}; I pulled the ``\.{.h}''.

In addition, \.{statlib.w} ran afoul of the dreaded ``implicit
typename is deprecated'' warning in GCC 3.2.  I added the required
|typename| qualifier before constructs such as
|dataTable<T>::iterator@, p| in the methods of |dataTable|.
See section {\bf C.13.5} in Stroustrup for details.

Previously, \.{gcc} treated the buffer argument of |ostream::write|
like a \CEE/ |void *| pointer.  Now one must explicitly
coerce it with a |reinterpret_cast<const char *>|.  The same goes
for |istream::read|, where the argument must be coerced
with |reinterpret_cast<const char *>|.  This played havoc with
our binary I/O code in |dictionaryWord| and |fastDictionary|,
requiring ugly casts all around.  I may go back and prettify
these with a macro, but not before I get the sucker past
all the other compile problems.

In days of yore, when everybody knew that an STL |vector|
was just a dynamically sized array, you were allowed to treat
an iterator of the |vector| as a \CEE/ pointer to access
the contents of the object, as long as you made sure all
references were within bounds: no more.  No longer can you,
for example, write the entire contents of a |vector<char>|
to a stream with a single |write|.  Instead, you must painfully
iterate over every element in the |vector|, doing I/O on
each one individually.  This is potentially a huge
performance hit which may motivate abandonment of
the STL |vector| in favour of a \CEE/ array which can be
written in one swell foop.  Fortunately, all the cases
where this occurs in \PRODUCT\ are in exporting
|fastDictionary| objects, which happens so infrequently
we don't care how fast it runs.

\.{Gcc} 3.2 also complains if you declare the values of default
arguments in a method within a class, then repeat them in the
implementation declared subsequently.  I've always written code
this way, considering it to better document what's going on,
particularly since the poor sucker who has to fix the code
later on is probably going to be looking at the implementation
and may be unaware of the default argument values declared
back in the class definition.  Well, it turns out that one can
read section {\bf 7.5} of Stroustrup as prohibiting this pursuant
to the ``default argument cannot be repeated or changed in a
subsequent declaration in the same scope'' prescription and,
indeed, the example of default arguments in class methods in
section 10.2.3 is coded this way.  Okay, what can I do but
``fix'' it, but to my mind this reduces the maintainability
of the code.  I think you should be able to use precisely the
same declaration of the function in its definition and
implementation, including default arguments and attributes
such as |const|.  The compiler should verify that they're
identical, but then both the definition and implementation
serve as stand-alone descriptions of the calling sequence
and method properties.

Oh, {\it come on}, guys!  Now you're telling me I have to
do a |reinterpret_cast<char *>| to |istream::read| into
a bloody |unsigned char|!  You can imagine what this did
to |dictionaryWord::importFromBinaryFile|.  Unfortunately,
I not only had to imagine it, I had to do it.

\date{2003 February 16}

With \.{gcc} 2.96, when you include \.{math.h}, it doesn't
define |abs| for |double|, as it's supposed to do according to
section {\bf 22.3} of Stroustrup.  Consequently, I defined
my own |abs(double)| in the global context to get the job
done.  Well, on 3.2.2, the existence of this function creates
an overloading ambiguity against the built-in one, which has
now been added to \.{math.h}.  It turns out that if you include
\.{cmath} in 2.96, you {\it do} get |abs(double)|, although
that file and \.{math.h} are documented as being identical.
So, I replaced the include of \.{math.h} with \.{cmath} and
eliminated my private copy of |abs|.  Now it compiles on both
of 'em.

They've gone and eliminated |fstream::attach(int fd)| from
the standard---just try and plumb a pipe into your input
or output stream the way you effortlessly used to!  As a
first cut attempt to detour past this off-ramp to oblivion,
I tried building with |HAVE_POPEN| undefined, and promptly
fell into a self-dug abyss: bad conditional declaration
of the file handle used to read compressed mail folders
and messages in |mailFolder|.  I fixed that, and for the
first time, we actually built and passed ``\.{make{ }check}''
under 3.2.2!  Just don't try it with compressed mail folders
quite yet$\ldots$\,.

Now, of course, we must deal with this.  I installed the
\.{fdstream.hpp} package developed by
\pdfURL{Nicolai M. Josuttis}{http://www.josuttis.com/}
in the source directory, extending it to permit declaration
of |fdistream| and |fdostream| objects with a default
file descriptor of zero, which can be specified later by
a new |attach| method, thus requiring fewer changes to existing
code which uses the |fstream::attach| mechanism.  There is
little or no error checking---you can screw things up mightily
by swapping file descriptors on the fly, but then you could
before with |fstream::attach|!

To test this class and dip my toe into the acid bath of
post-|fstream::attach| plumbing, I modified
|pdfTextExtractor| to use |fdistream| to read the pipe from
\.{pfdtotext}, which is a simpler case than the tangle
associated with compressed file decoding.  This worked
the first time, meaning I should look over my shoulder when
migrating the |attach| references in the compressed file
code to the new mechanism.  Note that the existing code
has lots of {\it ad hoc} tweaks, all tagged with \.{OLDWAY},
to enable the currently-working code.  Before we're ready
to ship, all of the OLDWAY dust-bunnies should be cleaned
up and a clean build and regression test run on 2.96 and
3.2.2 parameterised exclusively by the \.{configure} script.

Added code to |mailFolder| to use a new |fdistream| to read the
pipe when decompressing mail folder files and compressed files
in mail directories.

In the \.{gcc} 3.2.2 library, closing and opening an
|ifstream| does not clear |ios::eofbit| in the descriptor
as it used to.  (I consider this a stone bug---when you close
one file and open another, only an idiot would consider the
end of file condition from the previous file still
asserted.)  In any case, I added a |clear()| of the
|ifstream| we use while traversing a directory in
|@<Advance to next file if traversing directory@>|
so this doesn't sabotage reading messages in a directory.

Re-tested directory traversal, with and without compressed
files in the directory, on \.{gcc} 2.96 and 3.2.2 to verify
the modified code works on both.  It does.

\date{2003 February 18}

Added logic to \.{configure.in} to test whether the \CPP/ library
is compatible with the \.{fdstream.hpp} package.  If so, we use
it; otherwise we assume it's an old library which supports
the |attach| method for |fstream| I/O.  The \.{config.h.in}
variable |HAVE_FDSTREAM_COMPATIBILITY| will be defined if
\.{fdstream.hpp} is to be used.

Added a test to \.{configure.in} which determines whether the
\CPP/ library is compatible with the new \.{mystrstream\_new.h}.
If so, it's included.  Otherwise, the earlier \.{mystrstream.h}
is used as before.  If the new |strstream| package works,
|HAVE_NEW_STRSTREAM| is defined in \.{config.h.in}.

With these changes, the source configures and builds correctly
on \.{gcc} 2.96 and 3.2.2 without any tweaks or changes.

As suggested by Kern Sibbald, I changed the default
\.{--phraselimit} to 48 characters.

As reported by Jim Hamilton, some mail systems which store
individual messages as separate files in folder directories
do not prefix each message file with the ``\.{From\ }''
sentinel we were counting to mark message boundaries.  This
resulted in bad message counts, affecting probability
computation and, worse, failure to reset decoder modes, etc.\ after
a mailformed message.  I added a new |expectingNewMessage| flag,
which is set at the start of every new file |mailFolder| reads
(whether a composite mail folder or a file within a directory).
When |expectingNewMessage| is set, the first line of the file with
a nonblank character in the leftmost character position is
considered the start of a new message regardless of its
contents.

\date{2003 February 19}

Added the ability to parse a composite mail folder file using
either pure BSD (``\.{From\ }'' always denotes start of message
and is quoted in every other case) from ``consensus \UNIX/'' format,
where ``\.{From\ }'' only marks the start of a new message when
it appears after a blank line.  Sun ``\.{Content-Length:}''
folders are {\it not} supported, as they were a disastrously
poor idea---you can generally treat them as usual \UNIX/ folders.
By default, folders are parsed using \UNIX/ semantics.  A new
\.{--bsdfolder} option marks the following \.{--mail} or
\.{--junk} folder as following BSD rules.  Note that you
must specify \.{--bsdfolder} before {\it each} BSD-style
folder; it is not modal.  This is a change in default behaviour:
folders were previously parsed using BSD rules, while
\UNIX/ is now the default.

The very large |case| statement which processes command line
options ran afoul of \.{CWEAVE}'s maximum token per scrap
capacity limit.  I added a \.{cweb/cweave-bigger.ch} file to
increase the limit to 5000 tokens (from 2000), and modified
\.{cweb/Makefile} to apply the change file when building
\.{CWEAVE}.  I probably ought to break the option processing
|case| into one piece for each option, but as there's little
or nothing to be said about each one, that really wouldn't
improve the readability of the code.

\date{2003 February 20}

Completed the implementation of \.{--autoprune}.  This new option
permits you to specify a memory size, in bytes, at which a
dictionary to which words are being added with the \.{--mail}
or \.{--junk} options will be automatically be pruned by
discarding all words which appear only once.  A new
|dictionaryWord::estimateMemoryRequirement| method estimates
the memory occupied by an in-memory word, and this is used
to compute the total dictionary size.  |dictionary::purge| has
been extended to accept an optional argument which, if nonzero,
causes the pruning of the dictionary to be based on the number
of occurrences of the word rather than our ability to compute
its probability.

If the user sets \.{--autoprune} too low, we can fall into a
trashing situation when the non-unique words in the dictionary
exceed the pruning threshold.  To keep this from happening, whenever
the dictionary size after an automatic prune exceeds 90\% of
the \.{--autoprune} threshold, the threshold is increased by
25\%.

\date{2003 February 21}

Modified the \.{makew32.sh} script to build with \.{gcc} 3.{\it x}
rather than 2.{\it x}.  Note that this means the source should
be \.{./configure}d for a \.{gcc} 3.{\it x} build before
creating \.{winarch} to transport to the Cygwin machine.

When building on Cygwin with \.{gcc} 3, \.{getopt.h} managed
to get included twice for some reason.  I changed the
condition around our local copy to |__GETOPT_H__| to agree
with the symbol in the library include to prevent this from
happening.

Updated the \.{cygwin.dll} included in the Win32 executable
distribution to the January 24, 2003 version we're currently
using on Ovni.

Release 1.0.

\date{2003 June 24}

As reported by and fixed by Wolfgang Schnerring,
\.{utilities/splitmail.pl} had an assignment statement
in the \.{dispose\_of\_message} subroutine which was
missing the dollar sign before the variable name.  I
integrated his fix.  Thank you!

\date{2003 August 27}

A |pdfTextExtractor| was not restartable---once instantiated,
it could only be used once; calling |close| and then
re-initialising with the parent |applicationStringParser|
class |setMailFolder| left the extractor at end of file.
This required fixes both in |pdfTextExtractor|, where the
|close| method failed to reset |initialised| to |false|,
and in |applicationStringParser|, whose |close| method
did not reset the |eof| and |error| flags.

\date{2003 August 28}

Added a parser diagnostic to |mailFolder::nextLine| to
indicate when an |applicationStringParser| is closed.

The |close| method of |pdfTextExtractor| failed to close
the input stream it used to read the output from the pipe
connected to \.{pdftotext}, which caused (for some bizarre
reason), the raw binary PDF file to be returned, not the
decoded text.  I added the requisite |close| of the
stream.

When |pdfTextExtractor| was transcribing the decoded
attachment to the temporary file to be read by \.{pdftotext},
it checked for end of file but not error conditions.  I
modified it to use |isOK()| to govern the copy loop.

The |flashTextExtractor| and its parent |flashStream| were
not restartable because they did not propagate the
|close| up to the |applicationStringParser| from which
all are derived, and because |flashTextExtractor| did
not reset its own |initialised | and |textOnly| at end
of file.  Fixed.

Because the |flashStream| decoder usually terminates upon
seeing a \.{stagEnd} tag in the input stream, it failed to read
from the MIME decoder until end of file was encountered. This
caused an extraneous blank line to be inserted in the
transcript at the end of the MIME-encoded data and before the
part end sentinel. I added logic to |flashTextExtractor::nextString|
to call |get8()| until an end of file is reported before
returning the logical end of file for the flash stream.

The input stream |close| I added to |pdfTextExtractor::close|
ran afoul of the |fdistream| logic used to cope with \.{gcc} 3
which, helpfully, does not define a |close| method.  I made the
|close| conditional on |HAVE_FDSTREAM_COMPATIBILITY| not
being defined.

This time, our attempt to rebuild the Win32 version was torpedoed
by |getopt| in yet another innovative way.  This time, the
care we took to avoid including our own \.{getopt.h}
stabbed us in the back, because the library's version
(which I still haven't figured out the reason it's
being included) doesn't define the long version of
|getopt|, and wants a different symbol to do so than
our include file.  So, I added |WIN32| conditional code
before the include of our version to force it to be
included and define the long option version of |getopt|.
This GCC/Cygwin ``compatibility'' is turning out to be a running
bad joke.

Release 1.0a.

\date{2003 September 23}

A file whose name contained the string ``\.{.gz}'' (or
whatever other compressed file extension was configured)
would be fed through the decompressor even if the sequence
was embedded in the middle of the file name.  I modified
the tests to deem a file compressed only if the
|Compressed_file_type| string appears at the end of the
file name.  This applies both to files named directly on
the command line and files within directories.

A PDF file which has been marked by its creator as view-only
will not be processed by \.{pdftotext}---no output is generated
and the message ``{\tt Error: Copying of text from this
document is not allowed.}'' is sent to standard output. 
There's nothing we can do about this, absent making a version
of \.{pdftotext} which bypasses the PDF file security
mechanisms.  While there's something to be said for this,
it's well beyond the mandate of \PRODUCT .

An assertion added to |flashStream::ignoreTag| in the process of
debugging problems due to multiple flash attachments could fail
when \.{--bsdfolder} mode was used to scan a mail or junk
folder.  I commented out the assertion.

\date{2003 September 24}

Phil Karn (KA9Q) reported that on the latest Debian distribution,
compilations failed due to a missing definition of |assert|.
As far as I can determine, \.{assert.h} was pulled in by other
includes in earlier libraries, but now must be included explicitly.
I added the requisite includes to \.{annoyance-filter.w} and
\.{statlib.w}.

Release 1.0b.

%%%%%%%%%    Add new entries before this line    %%%%%%%%%
\parskip=0pt plus1pt
\parindent=20pt
