231 lines
9.4 KiB
Groff
231 lines
9.4 KiB
Groff
|
.TH ENT "1" "July 2007" "ent" "http://www.fourmilab.ch/random/"
|
||
|
.SH NAME
|
||
|
\fBent\fR \- pseudorandom number sequence test
|
||
|
.PP
|
||
|
This page describes a program, \fBent\fR, which applies various tests to
|
||
|
sequences of bytes stored in files and reports the results of those tests.
|
||
|
The program is useful for those evaluating pseudorandom number generators
|
||
|
for encryption and statistical sampling applications, compression
|
||
|
algorithms, and other applications where the information density of a file
|
||
|
is of interest.
|
||
|
.SH SYNOPSIS
|
||
|
\fBent\fR [ \-bcftu ] [ \fIinfile\fR ]
|
||
|
.SH DESCRIPTION
|
||
|
\fBent\fR performs a variety of tests on the stream of bytes in \fIinfile\fR (or
|
||
|
standard input if no \fIinfile\fR is specified) and produces output as follows
|
||
|
on the standard output stream:
|
||
|
.PP
|
||
|
.nf
|
||
|
Entropy = 7.980627 bits per character.
|
||
|
|
||
|
Optimum compression would reduce the size
|
||
|
of this 51768 character file by 0 percent.
|
||
|
|
||
|
Chi square distribution for 51768 samples is 1542.26, and randomly
|
||
|
would exceed this value 0.01 percent of the times.
|
||
|
|
||
|
Arithmetic mean value of data bytes is 125.93 (127.5 = random).
|
||
|
Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
|
||
|
Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).
|
||
|
.fi
|
||
|
.PP
|
||
|
The values calculated are as follows:
|
||
|
.PP
|
||
|
Entropy
|
||
|
.PP
|
||
|
The information density of the contents of the file, expressed as
|
||
|
a number of bits per character. The results above, which resulted
|
||
|
from processing an image file compressed with JPEG, indicate that
|
||
|
the file is extremely dense in information -- essentially random.
|
||
|
Hence, compression of the file is unlikely to reduce its size. By
|
||
|
contrast, the C source code of the program has entropy of about
|
||
|
4.9 bits per character, indicating that optimal compression of the
|
||
|
file would reduce its size by 38%. \fB[Hamming, pp. 104-108]\fR
|
||
|
.PP
|
||
|
Chi-square Test
|
||
|
.PP
|
||
|
The chi-square test is the most commonly used test for the
|
||
|
randomness of data, and is extremely sensitive to errors in
|
||
|
pseudorandom sequence generators. The chi-square distribution is
|
||
|
calculated for the stream of bytes in the file and expressed as an
|
||
|
absolute number and a percentage which indicates how frequently a
|
||
|
truly random sequence would exceed the value calculated. We
|
||
|
interpret the percentage as the degree to which the sequence
|
||
|
tested is suspected of being non-random. If the percentage is
|
||
|
greater than 99% or less than 1%, the sequence is almost certainly
|
||
|
not random. If the percentage is between 99% and 95% or between 1%
|
||
|
and 5%, the sequence is suspect. Percentages between 90% and 95%
|
||
|
and 5% and 10% indicate the sequence is "almost suspect". Note
|
||
|
that our JPEG file, while very dense in information, is far from
|
||
|
random as revealed by the chi-square test.
|
||
|
.PP
|
||
|
Applying this test to the output of various pseudorandom sequence
|
||
|
generators is interesting. The low-order 8 bits returned by the
|
||
|
standard Unix rand() function, for example, yields:
|
||
|
.PP
|
||
|
.nf
|
||
|
Chi square distribution for 500000 samples is 0.01, and randomly
|
||
|
would exceed this value 99.99 percent of the times.
|
||
|
.fi
|
||
|
.PP
|
||
|
While an improved generator \fB[Park & Miller]\fR reports:
|
||
|
.PP
|
||
|
.nf
|
||
|
Chi square distribution for 500000 samples is 212.53, and
|
||
|
randomly would exceed this value 95.00 percent of the times.
|
||
|
.fi
|
||
|
.PP
|
||
|
Thus, the standard Unix generator (or at least the low-order bytes
|
||
|
it returns) is unacceptably non-random, while the improved
|
||
|
generator is much better but still sufficiently non-random to
|
||
|
cause concern for demanding applications. Contrast both of these
|
||
|
software generators with the chi-square result of a genuine random
|
||
|
sequence created by timing radioactive decay events.
|
||
|
.PP
|
||
|
.nf
|
||
|
Chi square distribution for 32768 samples is 237.05, and
|
||
|
randomly would exceed this value 75.00 percent of the times.
|
||
|
.fi
|
||
|
.PP
|
||
|
See \fB[Knuth, pp. 35-40]\fR for more information on the chi-square
|
||
|
test. An interactive chi-square calculator is available at this
|
||
|
site.
|
||
|
.PP
|
||
|
Arithmetic Mean
|
||
|
.PP
|
||
|
This is simply the result of summing the all the bytes (bits if
|
||
|
the \fB\-b\fR option is specified) in the file and dividing by the file
|
||
|
length. If the data are close to random, this should be about
|
||
|
127.5 (0.5 for \fB\-b\fR option output). If the mean departs from this
|
||
|
value, the values are consistently high or low.
|
||
|
.PP
|
||
|
Monte Carlo Value for Pi
|
||
|
.PP
|
||
|
Each successive sequence of six bytes is used as 24 bit X and Y
|
||
|
co-ordinates within a square. If the distance of the
|
||
|
randomly-generated point is less than the radius of a circle
|
||
|
inscribed within the square, the six-byte sequence is considered a
|
||
|
"hit". The percentage of hits can be used to calculate the value
|
||
|
of Pi. For very large streams (this approximation converges very
|
||
|
slowly), the value will approach the correct value of Pi if the
|
||
|
sequence is close to random. A 32768 byte file created by
|
||
|
radioactive decay yielded:
|
||
|
.PP
|
||
|
.nf
|
||
|
Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).
|
||
|
.fi
|
||
|
.PP
|
||
|
Serial Correlation Coefficient
|
||
|
.PP
|
||
|
This quantity measures the extent to which each byte in the file
|
||
|
depends upon the previous byte. For random sequences, this value
|
||
|
(which can be positive or negative) will, of course, be close to
|
||
|
zero. A non-random byte stream such as a C program will yield a
|
||
|
serial correlation coefficient on the order of 0.5. Wildly
|
||
|
predictable data such as uncompressed bitmaps will exhibit serial
|
||
|
correlation coefficients approaching 1. See \fB[Knuth, pp. 64-65]\fR for
|
||
|
more details.
|
||
|
.SH OPTIONS
|
||
|
.IP \fB\-b\fR
|
||
|
The input is treated as a stream of bits rather than of 8-bit
|
||
|
bytes. Statistics reported reflect the properties of the
|
||
|
bitstream.
|
||
|
.IP \fB\-c\fR
|
||
|
Print a table of the number of occurrences of each possible byte
|
||
|
(or bit, if the \fB\-b\fR option is also specified) value, and the
|
||
|
fraction of the overall file made up by that value. Printable
|
||
|
characters in the ISO 8859-1 Latin1 character set are shown along
|
||
|
with their decimal byte values. In non-terse output mode, values
|
||
|
with zero occurrences are not printed.
|
||
|
.IP \fB\-f\fR
|
||
|
Fold upper case letters to lower case before computing statistics.
|
||
|
Folding is done based on the ISO 8859-1 Latin1 character set, with
|
||
|
accented letters correctly processed.
|
||
|
.IP \fB\-t\fR
|
||
|
Terse mode: output is written in Comma Separated Value (CSV)
|
||
|
format, suitable for loading into a spreadsheet and easily read by
|
||
|
any programming language. See Terse Mode Output Format below for
|
||
|
additional details.
|
||
|
.IP \fB\-u\fR
|
||
|
Print how-to-call information.
|
||
|
.SH FILES
|
||
|
If no \fIinfile\fR is specified, \fBent\fR obtains its input from standard input.
|
||
|
Output is always written to standard output.
|
||
|
.SH TERSE MODE OUTPUT FORMAT
|
||
|
Terse mode is selected by specifying the \fB\-t\fR option on the command line.
|
||
|
Terse mode output is written in Comma Separated Value (CSV) format, which
|
||
|
can be directly loaded into most spreadsheet programs and is easily read
|
||
|
by any programming language. Each record in the CSV file begins with a
|
||
|
record type field, which identifies the content of the following fields.
|
||
|
If the \fB\-c\fR option is not specified, the terse mode output will consist of
|
||
|
two records, as follows:
|
||
|
.PP
|
||
|
.nf
|
||
|
0,File-bytes,Entropy,Chi-square,Mean,Monte-Carlo-Pi,Serial-Correlation
|
||
|
1,file_length,entropy,chi_square,mean,Pi_value,correlation
|
||
|
.fi
|
||
|
.PP
|
||
|
where the italicised values in the type 1 record are the numerical values
|
||
|
for the quantities named in the type 0 column title record. If the \fB\-b\fR
|
||
|
option is specified, the second field of the type 0 record will be
|
||
|
"File-bits", and the file_length field in type 1 record will be given in
|
||
|
bits instead of bytes. If the \fB\-c\fR option is specified, additional records
|
||
|
are appended to the terse mode output which contain the character counts:
|
||
|
.PP
|
||
|
.nf
|
||
|
2,Value,Occurrences,Fraction
|
||
|
3,v,count,fraction
|
||
|
. . .
|
||
|
.fi
|
||
|
.PP
|
||
|
If the \fB\-b\fR option is specified, only two type 3 records will appear for the
|
||
|
two bit values v=0 and v=1. Otherwise, 256 type 3 records are included,
|
||
|
one for each possible byte value. The second field of a type 3 record
|
||
|
indicates how many bytes (or bits) of value v appear in the input, and
|
||
|
fraction gives the decimal fraction of the file which has value v (which
|
||
|
is equal to the count value of this record divided by the file_length
|
||
|
field in the type 1 record).
|
||
|
.SH BUGS
|
||
|
Note that the "optimal compression" shown for the file is computed from
|
||
|
the byte- or bit-stream entropy and thus reflects compressibility based on
|
||
|
a reading frame of the chosen width (8-bit bytes or individual bits if the
|
||
|
\fB\-b\fR option is specified). Algorithms which use a larger reading frame, such
|
||
|
as the Lempel-Ziv \fB[Lempel & Ziv]\fR algorithm, may achieve greater
|
||
|
compression if the file contains repeated sequences of multiple bytes.
|
||
|
.SH SEE ALSO
|
||
|
\fIIntroduction to Probability and Statistics\fR
|
||
|
.br
|
||
|
http://www.fourmilab.ch/rpkp/experiments/statistics.html
|
||
|
.PP
|
||
|
\fB[Hamming]\fR
|
||
|
.br
|
||
|
Hamming, Richard W. \fICoding and Information Theory.\fR Englewood
|
||
|
Cliffs NJ: Prentice-Hall, 1980.
|
||
|
.PP
|
||
|
\fB[Knuth]\fR
|
||
|
.br
|
||
|
Knuth, Donald E. \fIThe Art of Computer Programming, Volume 2 /
|
||
|
Seminumerical Algorithms\fR. Reading MA: Addison-Wesley, 1969. ISBN
|
||
|
0-201-89684-2.
|
||
|
.PP
|
||
|
\fB[Lempel & Ziv]\fR
|
||
|
.br
|
||
|
Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data
|
||
|
Compression". \fIIEEE Transactions on Information Theory\fR \fB23\fR, 3,
|
||
|
pp. 337-343.
|
||
|
.PP
|
||
|
\fB[Park & Miller]\fR
|
||
|
.br
|
||
|
Park, Stephen K. and Keith W. Miller. "Random Number Generators:
|
||
|
Good Ones Are Hard to Find". \fICommunications of the ACM\fR, October
|
||
|
1988, p. 1192.
|
||
|
.SH COPYING
|
||
|
This software is in the public domain. Permission to use, copy, modify,
|
||
|
and distribute this software and its documentation for any purpose and
|
||
|
without fee is hereby granted, without any conditions or restrictions.
|
||
|
This software is provided "as is" without express or implied warranty.
|
||
|
.SH AUTHOR
|
||
|
John Walker
|
||
|
.br
|
||
|
October 20th, 1998
|