libs/math/doc/sf_and_dist/distributions/hypergeometric.qbk - platform/external/boost - Git at Google

 [section:hypergeometric_dist Hypergeometric Distribution]

 ``#include <boost/math/distributions/hypergeometric.hpp>``

    namespace boost{ namespace math{

    template <class RealType = double,
              class ``__Policy``   = ``__policy_class`` >
    class hypergeometric_distribution;

    template <class RealType, class Policy>
    class hypergeometric_distribution
    {
    public:
       typedef RealType value_type;
       typedef Policy   policy_type;
       // Construct:
       hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
       // Accessors:
       unsigned total()const;
       unsigned defective()const;
       unsigned sample_count()const;
    };

    typedef hypergeometric_distribution<> hypergeometric;

    }} // namespaces

 The hypergeometric distribution describes the number of "events" /k/
 from a sample /n/ drawn from a total population /N/ ['without replacement].

 Imagine we have a sample of /N/ objects of which /r/ are "defective"
 and N-r are "not defective"
 (the terms "success\/failure" or "red\/blue" are also used).  If we sample /n/
 items /without replacement/ then what is the probability that exactly
 /k/ items in the sample are defective?  The answer is given by the pdf of the
 hypergeometric distribution `f(k; r, n, N)`, whilst the probability of
 /k/ defectives or fewer is given by F(k; r, n, N), where F(k) is the
 CDF of the hypergeometric distribution.

 [note Unlike almost all of the other distributions in this library,
 the hypergeometric distribution is strictly discrete: it can not be
 extended to real valued arguments of its parameters or random variable.]

 The following graph shows how the distribution changes as the proportion
 of "defective" items changes, while keeping the population and sample sizes
 constant:

 [graph hypergeometric_pdf_1]

 Note that since the distribution is symmetrical in parameters /n/ and /r/, if we
 change the sample size and keep the population and proportion "defective" the same
 then we obtain basically the same graphs:

 [graph hypergeometric_pdf_2]

 [h4 Member Functions]

    hypergeometric_distribution(unsigned r, unsigned n, unsigned N);

 Constructs a hypergeometric distribution with with a population of /N/ objects,
 of which /r/ are defective, and from which /n/ are sampled.

    unsigned total()const;

 Returns the total number of objects /N/.

    unsigned defective()const;

 Returns the number of objects /r/ in population /N/ which are defective.

    unsigned sample_count()const;

 Returns the number of objects /n/ which are sampled from the population /N/.

 [h4 Non-member Accessors]

 All the [link math_toolkit.dist.dist_ref.nmp usual non-member accessor functions]
 that are generic to all distributions are supported: __usual_accessors.

 The domain of the random variable is the unsigned integers in the range
 \[max(0, n + r - N), min(n, r)\].  A __domain_error is raised if the
 random variable is outside this range, or is not an integral value.

 [caution
 The quantile function will by default return an integer result that has been
 /rounded outwards/.  That is to say lower quantiles (where the probability is
 less than 0.5) are rounded downward, and upper quantiles (where the probability
 is greater than 0.5) are rounded upwards.  This behaviour
 ensures that if an X% quantile is requested, then /at least/ the requested
 coverage will be present in the central region, and /no more than/
 the requested coverage will be present in the tails.

 This behaviour can be changed so that the quantile functions are rounded
 differently using
 [link math_toolkit.policy.pol_overview Policies].  It is strongly
 recommended that you read the tutorial
 [link math_toolkit.policy.pol_tutorial.understand_dis_quant
 Understanding Quantiles of Discrete Distributions] before
 using the quantile function on the Hypergeometric distribution.  The
 [link math_toolkit.policy.pol_ref.discrete_quant_ref reference docs]
 describe how to change the rounding policy
 for these distributions.

 However, note that the implementation method of the quantile function
 always returns an integral value, therefore attempting to use a __Policy
 that requires (or produces) a real valued result will result in a
 compile time error.
 ] [/ caution]


 [h4 Accuracy]

 For small N such that
 `N < boost::math::max_factorial<RealType>::value` then table based
 lookup of the results gives an accuracy to a few epsilon.
 `boost::math::max_factorial<RealType>::value` is 170 at double or long double
 precision.

 For larger N such that `N < boost::math::prime(boost::math::max_prime)`
 then only basic arithmetic is required for the calculation
 and the accuracy is typically < 20 epsilon.  This takes care of N
 up to 104729.

 For `N > boost::math::prime(boost::math::max_prime)` then accuracy quickly
 degrades, with 5 or 6 decimal digits being lost for N = 110000.

 In general for very large N, the user should expect to loose log[sub 10]N
 decimal digits of precision during the calculation, with the results
 becoming meaningless for N >= 10[super 15].

 [h4 Testing]

 There are three sets of tests: our implementation is tested against a table of values
 produced by Mathematica's implementation of this distribution. We also sanity check
 our implementation against some spot values computed using the online calculator
 here [@http://stattrek.com/Tables/Hypergeometric.aspx http://stattrek.com/Tables/Hypergeometric.aspx].
 Finally we test accuracy against some high precision test data using
 this implementation and NTL::RR.

 [h4 Implementation]

 The PDF can be calculated directly using the formula:

 [equation hypergeometric1]

 However, this can only be used directly when the largest of the factorials
 is guaranteed not to overflow the floating point representation used.
 This formula is used directly when `N < max_factorial<RealType>::value`
 in which case table lookup of the factorials gives a rapid and accurate
 implementation method.

 For larger /N/ the method described in
 "An Accurate Computation of the Hypergeometric Distribution Function",
 Trong Wu, ACM Transactions on Mathematical Software, Vol. 19, No. 1,
 March 1993, Pages 33-43 is used.  The method relies on the fact that
 there is an easy method for factorising a factorial into the product
 of prime numbers:

 [equation hypergeometric2]

 Where p[sub i] is the i'th prime number, and e[sub i] is a small
 positive integer or zero, which can be calculated via:

 [equation hypergeometric3]

 Further we can combine the factorials in the expression for the PDF
 to yield the PDF directly as the product of prime numbers:

 [equation hypergeometric4]

 With this time the exponents e[sub i] being either positive, negative
 or zero.  Indeed such a degree of cancellation occurs in the calculation
 of the e[sub i] that many are zero, and typically most have a magnitude
 or no more than 1 or 2.

 Calculation of the product of the primes requires some care to prevent
 numerical overflow, we use a novel recursive method which splits the
 calculation into a series of sub-products, with a new sub-product
 started each time the next multiplication would cause either overflow
 or underflow.  The sub-products are stored in a linked list on the
 program stack, and combined in an order that will guarantee no overflow
 or unnecessary-underflow once the last sub-product has been calculated.

 This method can be used as long as N is smaller than the largest prime
 number we have stored in our table of primes (currently 104729).  The method
 is relatively slow (calculating the exponents requires the most time), but
 requires only a small number of arithmetic operations to
 calculate the result (indeed there is no shorter method involving only basic
 arithmetic once the exponents have been found), the method is therefore
 much more accurate than the alternatives.

 For much larger N, we can calculate the PDF from the factorials using
 either lgamma, or by directly combining lanczos approximations to avoid
 calculating via logarithms.  We use the latter method, as it is usually
 1 or 2 decimal digits more accurate than computing via logarithms with
 lgamma.  However, in this area where N > 104729, the user should expect
 to loose around log[sub 10]N decimal digits during the calculation in
 the worst case.

 The CDF and its complement is calculated by directly summing the PDF's.
 We start by deciding whether the CDF, or its complement, is likely to be
 the smaller of the two and then calculate the PDF at /k/ (or /k+1/ if we're
 calculating the complement) and calculate successive PDF values via the
 recurrence relations:

 [equation hypergeometric5]

 Until we either reach the end of the distributions domain, or the next
 PDF value to be summed would be too small to affect the result.

 The quantile is calculated in a similar manner to the CDF: we first guess
 which end of the distribution we're nearer to, and then sum PDFs starting
 from the end of the distribution this time, until we have some value /k/ that
 gives the required CDF.

 The median is simply the quantile at 0.5, and the remaining properties are
 calculated via:

 [equation hypergeometric6]

 [endsect]

 [/ hypergeometric.qbk
   Copyright 2008 John Maddock.
   Distributed under the Boost Software License, Version 1.0.
   (See accompanying file LICENSE_1_0.txt or copy at
   http://www.boost.org/LICENSE_1_0.txt).
 ]
	[section:hypergeometric_dist Hypergeometric Distribution]

	``#include <boost/math/distributions/hypergeometric.hpp>``

	namespace boost{ namespace math{

	template <class RealType = double,
	class ``__Policy`` = ``__policy_class`` >
	class hypergeometric_distribution;

	template <class RealType, class Policy>
	class hypergeometric_distribution
	{
	public:
	typedef RealType value_type;
	typedef Policy policy_type;
	// Construct:
	hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
	// Accessors:
	unsigned total()const;
	unsigned defective()const;
	unsigned sample_count()const;
	};

	typedef hypergeometric_distribution<> hypergeometric;

	}} // namespaces

	The hypergeometric distribution describes the number of "events" /k/
	from a sample /n/ drawn from a total population /N/ ['without replacement].

	Imagine we have a sample of /N/ objects of which /r/ are "defective"
	and N-r are "not defective"
	(the terms "success\/failure" or "red\/blue" are also used). If we sample /n/
	items /without replacement/ then what is the probability that exactly
	/k/ items in the sample are defective? The answer is given by the pdf of the
	hypergeometric distribution `f(k; r, n, N)`, whilst the probability of
	/k/ defectives or fewer is given by F(k; r, n, N), where F(k) is the
	CDF of the hypergeometric distribution.

	[note Unlike almost all of the other distributions in this library,
	the hypergeometric distribution is strictly discrete: it can not be
	extended to real valued arguments of its parameters or random variable.]

	The following graph shows how the distribution changes as the proportion
	of "defective" items changes, while keeping the population and sample sizes
	constant:

	[graph hypergeometric_pdf_1]

	Note that since the distribution is symmetrical in parameters /n/ and /r/, if we
	change the sample size and keep the population and proportion "defective" the same
	then we obtain basically the same graphs:

	[graph hypergeometric_pdf_2]

	[h4 Member Functions]

	hypergeometric_distribution(unsigned r, unsigned n, unsigned N);

	Constructs a hypergeometric distribution with with a population of /N/ objects,
	of which /r/ are defective, and from which /n/ are sampled.

	unsigned total()const;

	Returns the total number of objects /N/.

	unsigned defective()const;

	Returns the number of objects /r/ in population /N/ which are defective.

	unsigned sample_count()const;

	Returns the number of objects /n/ which are sampled from the population /N/.

	[h4 Non-member Accessors]

	All the [link math_toolkit.dist.dist_ref.nmp usual non-member accessor functions]
	that are generic to all distributions are supported: __usual_accessors.

	The domain of the random variable is the unsigned integers in the range
	\[max(0, n + r - N), min(n, r)\]. A __domain_error is raised if the
	random variable is outside this range, or is not an integral value.

	[caution
	The quantile function will by default return an integer result that has been
	/rounded outwards/. That is to say lower quantiles (where the probability is
	less than 0.5) are rounded downward, and upper quantiles (where the probability
	is greater than 0.5) are rounded upwards. This behaviour
	ensures that if an X% quantile is requested, then /at least/ the requested
	coverage will be present in the central region, and /no more than/
	the requested coverage will be present in the tails.

	This behaviour can be changed so that the quantile functions are rounded
	differently using
	[link math_toolkit.policy.pol_overview Policies]. It is strongly
	recommended that you read the tutorial
	[link math_toolkit.policy.pol_tutorial.understand_dis_quant
	Understanding Quantiles of Discrete Distributions] before
	using the quantile function on the Hypergeometric distribution. The
	[link math_toolkit.policy.pol_ref.discrete_quant_ref reference docs]
	describe how to change the rounding policy
	for these distributions.

	However, note that the implementation method of the quantile function
	always returns an integral value, therefore attempting to use a __Policy
	that requires (or produces) a real valued result will result in a
	compile time error.
	] [/ caution]


	[h4 Accuracy]

	For small N such that
	`N < boost::math::max_factorial<RealType>::value` then table based
	lookup of the results gives an accuracy to a few epsilon.
	`boost::math::max_factorial<RealType>::value` is 170 at double or long double
	precision.

	For larger N such that `N < boost::math::prime(boost::math::max_prime)`
	then only basic arithmetic is required for the calculation
	and the accuracy is typically < 20 epsilon. This takes care of N
	up to 104729.

	For `N > boost::math::prime(boost::math::max_prime)` then accuracy quickly
	degrades, with 5 or 6 decimal digits being lost for N = 110000.

	In general for very large N, the user should expect to loose log[sub 10]N
	decimal digits of precision during the calculation, with the results
	becoming meaningless for N >= 10[super 15].

	[h4 Testing]

	There are three sets of tests: our implementation is tested against a table of values
	produced by Mathematica's implementation of this distribution. We also sanity check
	our implementation against some spot values computed using the online calculator
	here [@http://stattrek.com/Tables/Hypergeometric.aspx http://stattrek.com/Tables/Hypergeometric.aspx].
	Finally we test accuracy against some high precision test data using
	this implementation and NTL::RR.

	[h4 Implementation]

	The PDF can be calculated directly using the formula:

	[equation hypergeometric1]

	However, this can only be used directly when the largest of the factorials
	is guaranteed not to overflow the floating point representation used.
	This formula is used directly when `N < max_factorial<RealType>::value`
	in which case table lookup of the factorials gives a rapid and accurate
	implementation method.

	For larger /N/ the method described in
	"An Accurate Computation of the Hypergeometric Distribution Function",
	Trong Wu, ACM Transactions on Mathematical Software, Vol. 19, No. 1,
	March 1993, Pages 33-43 is used. The method relies on the fact that
	there is an easy method for factorising a factorial into the product
	of prime numbers:

	[equation hypergeometric2]

	Where p[sub i] is the i'th prime number, and e[sub i] is a small
	positive integer or zero, which can be calculated via:

	[equation hypergeometric3]

	Further we can combine the factorials in the expression for the PDF
	to yield the PDF directly as the product of prime numbers:

	[equation hypergeometric4]

	With this time the exponents e[sub i] being either positive, negative
	or zero. Indeed such a degree of cancellation occurs in the calculation
	of the e[sub i] that many are zero, and typically most have a magnitude
	or no more than 1 or 2.

	Calculation of the product of the primes requires some care to prevent
	numerical overflow, we use a novel recursive method which splits the
	calculation into a series of sub-products, with a new sub-product
	started each time the next multiplication would cause either overflow
	or underflow. The sub-products are stored in a linked list on the
	program stack, and combined in an order that will guarantee no overflow
	or unnecessary-underflow once the last sub-product has been calculated.

	This method can be used as long as N is smaller than the largest prime
	number we have stored in our table of primes (currently 104729). The method
	is relatively slow (calculating the exponents requires the most time), but
	requires only a small number of arithmetic operations to
	calculate the result (indeed there is no shorter method involving only basic
	arithmetic once the exponents have been found), the method is therefore
	much more accurate than the alternatives.

	For much larger N, we can calculate the PDF from the factorials using
	either lgamma, or by directly combining lanczos approximations to avoid
	calculating via logarithms. We use the latter method, as it is usually
	1 or 2 decimal digits more accurate than computing via logarithms with
	lgamma. However, in this area where N > 104729, the user should expect
	to loose around log[sub 10]N decimal digits during the calculation in
	the worst case.

	The CDF and its complement is calculated by directly summing the PDF's.
	We start by deciding whether the CDF, or its complement, is likely to be
	the smaller of the two and then calculate the PDF at /k/ (or /k+1/ if we're
	calculating the complement) and calculate successive PDF values via the
	recurrence relations:

	[equation hypergeometric5]

	Until we either reach the end of the distributions domain, or the next
	PDF value to be summed would be too small to affect the result.

	The quantile is calculated in a similar manner to the CDF: we first guess
	which end of the distribution we're nearer to, and then sum PDFs starting
	from the end of the distribution this time, until we have some value /k/ that
	gives the required CDF.

	The median is simply the quantile at 0.5, and the remaining properties are
	calculated via:

	[equation hypergeometric6]

	[endsect]

	[/ hypergeometric.qbk
	Copyright 2008 John Maddock.
	Distributed under the Boost Software License, Version 1.0.
	(See accompanying file LICENSE_1_0.txt or copy at
	http://www.boost.org/LICENSE_1_0.txt).
	]