Back to home page

EIC code displayed by LXR

 
 

    


File indexing completed on 2025-01-18 09:39:41

0001 // Kolmogorov-Smirnov 1st order asymptotic distribution
0002 // Copyright Evan Miller 2020
0003 //
0004 // Use, modification and distribution are subject to the
0005 // Boost Software License, Version 1.0. (See accompanying file
0006 // LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
0007 //
0008 // The Kolmogorov-Smirnov test in statistics compares two empirical distributions,
0009 // or an empirical distribution against any theoretical distribution. It makes
0010 // use of a specific distribution which doesn't have a formal name, but which
0011 // is often called the Kolmogorv-Smirnov distribution for lack of anything
0012 // better. This file implements the limiting form of this distribution, first
0013 // identified by Andrey Kolmogorov in
0014 //
0015 // Kolmogorov, A. (1933) "Sulla Determinazione Empirica di una Legge di
0016 // Distribuzione." Giornale dell' Istituto Italiano degli Attuari
0017 //
0018 // This limiting form of the CDF is a first-order Taylor expansion that is
0019 // easily implemented by the fourth Jacobi Theta function (setting z=0). The
0020 // PDF is then implemented here as a derivative of the Theta function. Note
0021 // that this derivative is with respect to x, which enters into \tau, and not
0022 // with respect to the z argument, which is always zero, and so the derivative
0023 // identities in DLMF 20.4 do not apply here.
0024 //
0025 // A higher order order expansion is possible, and was first outlined by
0026 //
0027 // Pelz W, Good IJ (1976). "Approximating the Lower Tail-Areas of the
0028 // Kolmogorov-Smirnov One-sample Statistic." Journal of the Royal Statistical
0029 // Society B.
0030 //
0031 // The terms in this expansion get fairly complicated, and as far as I know the
0032 // Pelz-Good expansion is not used in any statistics software. Someone could
0033 // consider updating this implementation to use the Pelz-Good expansion in the
0034 // future, but the math gets considerably hairier with each additional term.
0035 //
0036 // A formula for an exact version of the Kolmogorov-Smirnov test is laid out in
0037 // Equation 2.4.4 of
0038 //
0039 // Durbin J (1973). "Distribution Theory for Tests Based on the Sample
0040 // Distribution Func- tion." In SIAM CBMS-NSF Regional Conference Series in
0041 // Applied Mathematics. SIAM, Philadelphia, PA.
0042 //
0043 // which is available in book form from Amazon and others. This exact version
0044 // involves taking powers of large matrices. To do that right you need to
0045 // compute eigenvalues and eigenvectors, which are beyond the scope of Boost.
0046 // (Some recent work indicates the exact form can also be computed via FFT, see
0047 // https://cran.r-project.org/web/packages/KSgeneral/KSgeneral.pdf).
0048 //
0049 // Even if the CDF of the exact distribution could be computed using Boost
0050 // libraries (which would be cumbersome), the PDF would present another
0051 // difficulty. Therefore I am limiting this implementation to the asymptotic
0052 // form, even though the exact form has trivial values for certain specific
0053 // values of x and n. For more on trivial values see
0054 //
0055 // Ruben H, Gambino J (1982). "The Exact Distribution of Kolmogorov's Statistic
0056 // Dn for n <= 10." Annals of the Institute of Statistical Mathematics.
0057 // 
0058 // For a good bibliography and overview of the various algorithms, including
0059 // both exact and asymptotic forms, see
0060 // https://www.jstatsoft.org/article/view/v039i11
0061 //
0062 // As for this implementation: the distribution is parameterized by n (number
0063 // of observations) in the spirit of chi-squared's degrees of freedom. It then
0064 // takes a single argument x. In terms of the Kolmogorov-Smirnov statistical
0065 // test, x represents the distribution of D_n, where D_n is the maximum
0066 // difference between the CDFs being compared, that is,
0067 //
0068 //   D_n = sup|F_n(x) - G(x)|
0069 //
0070 // In the exact distribution, x is confined to the support [0, 1], but in this
0071 // limiting approximation, we allow x to exceed unity (similar to how a normal
0072 // approximation always spills over any boundaries).
0073 //
0074 // As mentioned previously, the CDF is implemented using the \tau
0075 // parameterization of the fourth Jacobi Theta function as
0076 //
0077 // CDF=theta_4(0|2*x*x*n/pi)
0078 //
0079 // The PDF is a hand-coded derivative of that function. Actually, there are two
0080 // (independent) derivatives, as separate code paths are used for "small x"
0081 // (2*x*x*n < pi) and "large x", mirroring the separate code paths in the
0082 // Jacobi Theta implementation to achieve fast convergence. Quantiles are
0083 // computed using a Newton-Raphson iteration from an initial guess that I
0084 // arrived at by trial and error.
0085 //
0086 // The mean and variance are implemented using simple closed-form expressions.
0087 // Skewness and kurtosis use slightly more complicated closed-form expressions
0088 // that involve the zeta function. The mode is calculated at run-time by
0089 // maximizing the PDF. If you have an analytical solution for the mode, feel
0090 // free to plop it in.
0091 //
0092 // The CDF and PDF could almost certainly be re-implemented and sped up using a
0093 // polynomial or rational approximation, since the only meaningful argument is
0094 // x * sqrt(n). But that is left as an exercise for the next maintainer.
0095 //
0096 // In the future, the Pelz-Good approximation could be added. I suggest adding
0097 // a second parameter representing the order, e.g.
0098 //
0099 // kolmogorov_smirnov_dist<>(100) // N=100, order=1
0100 // kolmogorov_smirnov_dist<>(100, 1) // N=100, order=1, i.e. Kolmogorov's formula
0101 // kolmogorov_smirnov_dist<>(100, 4) // N=100, order=4, i.e. Pelz-Good formula
0102 //
0103 // The exact distribution could be added to the API with a special order
0104 // parameter (e.g. 0 or infinity), or a separate distribution type altogether
0105 // (e.g. kolmogorov_smirnov_exact_distribution).
0106 //
0107 #ifndef BOOST_MATH_DISTRIBUTIONS_KOLMOGOROV_SMIRNOV_HPP
0108 #define BOOST_MATH_DISTRIBUTIONS_KOLMOGOROV_SMIRNOV_HPP
0109 
0110 #include <boost/math/distributions/fwd.hpp>
0111 #include <boost/math/distributions/complement.hpp>
0112 #include <boost/math/distributions/detail/common_error_handling.hpp>
0113 #include <boost/math/special_functions/jacobi_theta.hpp>
0114 #include <boost/math/tools/tuple.hpp>
0115 #include <boost/math/tools/roots.hpp> // Newton-Raphson
0116 #include <boost/math/tools/minima.hpp> // For the mode
0117 
0118 namespace boost { namespace math {
0119 
0120 namespace detail {
0121 template <class RealType>
0122 inline RealType kolmogorov_smirnov_quantile_guess(RealType p) {
0123     // Choose a starting point for the Newton-Raphson iteration
0124     if (p > 0.9)
0125         return RealType(1.8) - 5 * (1 - p);
0126     if (p < 0.3)
0127         return p + RealType(0.45);
0128     return p + RealType(0.3);
0129 }
0130 
0131 // d/dk (theta2(0, 1/(2*k*k/M_PI))/sqrt(2*k*k*M_PI))
0132 template <class RealType, class Policy>
0133 RealType kolmogorov_smirnov_pdf_small_x(RealType x, RealType n, const Policy&) {
0134     BOOST_MATH_STD_USING
0135     RealType value = RealType(0), delta = RealType(0), last_delta = RealType(0);
0136     RealType eps = policies::get_epsilon<RealType, Policy>();
0137     int i = 0;
0138     RealType pi2 = constants::pi_sqr<RealType>();
0139     RealType x2n = x*x*n;
0140     if (x2n*x2n == 0.0) {
0141         return static_cast<RealType>(0);
0142     }
0143     while (true) {
0144         delta = exp(-RealType(i+0.5)*RealType(i+0.5)*pi2/(2*x2n)) * (RealType(i+0.5)*RealType(i+0.5)*pi2 - x2n);
0145 
0146         if (delta == 0.0)
0147             break;
0148 
0149         if (last_delta != 0.0 && fabs(delta/last_delta) < eps)
0150             break;
0151 
0152         value += delta + delta;
0153         last_delta = delta;
0154         i++;
0155     }
0156 
0157     return value * sqrt(n) * constants::root_half_pi<RealType>() / (x2n*x2n);
0158 }
0159 
0160 // d/dx (theta4(0, 2*x*x*n/M_PI))
0161 template <class RealType, class Policy>
0162 inline RealType kolmogorov_smirnov_pdf_large_x(RealType x, RealType n, const Policy&) {
0163     BOOST_MATH_STD_USING
0164     RealType value = RealType(0), delta = RealType(0), last_delta = RealType(0);
0165     RealType eps = policies::get_epsilon<RealType, Policy>();
0166     int i = 1;
0167     while (true) {
0168         delta = 8*x*i*i*exp(-2*i*i*x*x*n);
0169 
0170         if (delta == 0.0)
0171             break;
0172 
0173         if (last_delta != 0.0 && fabs(delta / last_delta) < eps)
0174             break;
0175 
0176         if (i%2 == 0)
0177             delta = -delta;
0178 
0179         value += delta;
0180         last_delta = delta;
0181         i++;
0182     }
0183 
0184     return value * n;
0185 }
0186 
0187 } // detail
0188 
0189 template <class RealType = double, class Policy = policies::policy<> >
0190     class kolmogorov_smirnov_distribution
0191 {
0192     public:
0193         typedef RealType value_type;
0194         typedef Policy policy_type;
0195 
0196         // Constructor
0197     kolmogorov_smirnov_distribution( RealType n ) : n_obs_(n)
0198     {
0199         RealType result;
0200         detail::check_df(
0201                 "boost::math::kolmogorov_smirnov_distribution<%1%>::kolmogorov_smirnov_distribution", n_obs_, &result, Policy());
0202     }
0203 
0204     RealType number_of_observations()const
0205     {
0206         return n_obs_;
0207     }
0208 
0209     private:
0210 
0211     RealType n_obs_; // positive integer
0212 };
0213 
0214 typedef kolmogorov_smirnov_distribution<double> kolmogorov_k; // Convenience typedef for double version.
0215 
0216 #ifdef __cpp_deduction_guides
0217 template <class RealType>
0218 kolmogorov_smirnov_distribution(RealType)->kolmogorov_smirnov_distribution<typename boost::math::tools::promote_args<RealType>::type>;
0219 #endif
0220 
0221 namespace detail {
0222 template <class RealType, class Policy>
0223 struct kolmogorov_smirnov_quantile_functor
0224 {
0225   kolmogorov_smirnov_quantile_functor(const boost::math::kolmogorov_smirnov_distribution<RealType, Policy> dist, RealType const& p)
0226     : distribution(dist), prob(p)
0227   {
0228   }
0229 
0230   boost::math::tuple<RealType, RealType> operator()(RealType const& x)
0231   {
0232     RealType fx = cdf(distribution, x) - prob;  // Difference cdf - value - to minimize.
0233     RealType dx = pdf(distribution, x); // pdf is 1st derivative.
0234     // return both function evaluation difference f(x) and 1st derivative f'(x).
0235     return boost::math::make_tuple(fx, dx);
0236   }
0237 private:
0238   const boost::math::kolmogorov_smirnov_distribution<RealType, Policy> distribution;
0239   RealType prob;
0240 };
0241 
0242 template <class RealType, class Policy>
0243 struct kolmogorov_smirnov_complementary_quantile_functor
0244 {
0245   kolmogorov_smirnov_complementary_quantile_functor(const boost::math::kolmogorov_smirnov_distribution<RealType, Policy> dist, RealType const& p)
0246     : distribution(dist), prob(p)
0247   {
0248   }
0249 
0250   boost::math::tuple<RealType, RealType> operator()(RealType const& x)
0251   {
0252     RealType fx = cdf(complement(distribution, x)) - prob;  // Difference cdf - value - to minimize.
0253     RealType dx = -pdf(distribution, x); // pdf is the negative of the derivative of (1-CDF)
0254     // return both function evaluation difference f(x) and 1st derivative f'(x).
0255     return boost::math::make_tuple(fx, dx);
0256   }
0257 private:
0258   const boost::math::kolmogorov_smirnov_distribution<RealType, Policy> distribution;
0259   RealType prob;
0260 };
0261 
0262 template <class RealType, class Policy>
0263 struct kolmogorov_smirnov_negative_pdf_functor
0264 {
0265     RealType operator()(RealType const& x) {
0266         if (2*x*x < constants::pi<RealType>()) {
0267             return -kolmogorov_smirnov_pdf_small_x(x, static_cast<RealType>(1), Policy());
0268         }
0269         return -kolmogorov_smirnov_pdf_large_x(x, static_cast<RealType>(1), Policy());
0270     }
0271 };
0272 } // namespace detail
0273 
0274 template <class RealType, class Policy>
0275 inline const std::pair<RealType, RealType> range(const kolmogorov_smirnov_distribution<RealType, Policy>& /*dist*/)
0276 { // Range of permissible values for random variable x.
0277    using boost::math::tools::max_value;
0278    return std::pair<RealType, RealType>(static_cast<RealType>(0), max_value<RealType>());
0279 }
0280 
0281 template <class RealType, class Policy>
0282 inline const std::pair<RealType, RealType> support(const kolmogorov_smirnov_distribution<RealType, Policy>& /*dist*/)
0283 { // Range of supported values for random variable x.
0284    // This is range where cdf rises from 0 to 1, and outside it, the pdf is zero.
0285    // In the exact distribution, the upper limit would be 1.
0286    using boost::math::tools::max_value;
0287    return std::pair<RealType, RealType>(static_cast<RealType>(0), max_value<RealType>());
0288 }
0289 
0290 template <class RealType, class Policy>
0291 inline RealType pdf(const kolmogorov_smirnov_distribution<RealType, Policy>& dist, const RealType& x)
0292 {
0293    BOOST_FPU_EXCEPTION_GUARD
0294    BOOST_MATH_STD_USING  // for ADL of std functions.
0295 
0296    RealType n = dist.number_of_observations();
0297    RealType error_result;
0298    static const char* function = "boost::math::pdf(const kolmogorov_smirnov_distribution<%1%>&, %1%)";
0299    if(false == detail::check_x_not_NaN(function, x, &error_result, Policy()))
0300       return error_result;
0301 
0302    if(false == detail::check_df(function, n, &error_result, Policy()))
0303       return error_result;
0304 
0305    if (x < 0 || !(boost::math::isfinite)(x))
0306    {
0307       return policies::raise_domain_error<RealType>(
0308          function, "Kolmogorov-Smirnov parameter was %1%, but must be > 0 !", x, Policy());
0309    }
0310 
0311    if (2*x*x*n < constants::pi<RealType>()) {
0312        return detail::kolmogorov_smirnov_pdf_small_x(x, n, Policy());
0313    }
0314 
0315    return detail::kolmogorov_smirnov_pdf_large_x(x, n, Policy());
0316 } // pdf
0317 
0318 template <class RealType, class Policy>
0319 inline RealType cdf(const kolmogorov_smirnov_distribution<RealType, Policy>& dist, const RealType& x)
0320 {
0321     BOOST_MATH_STD_USING // for ADL of std function exp.
0322    static const char* function = "boost::math::cdf(const kolmogorov_smirnov_distribution<%1%>&, %1%)";
0323    RealType error_result;
0324    RealType n = dist.number_of_observations();
0325    if(false == detail::check_x_not_NaN(function, x, &error_result, Policy()))
0326       return error_result;
0327    if(false == detail::check_df(function, n, &error_result, Policy()))
0328       return error_result;
0329    if((x < 0) || !(boost::math::isfinite)(x)) {
0330       return policies::raise_domain_error<RealType>(
0331          function, "Random variable parameter was %1%, but must be between > 0 !", x, Policy());
0332    }
0333 
0334    if (x*x*n == 0)
0335        return 0;
0336 
0337    return jacobi_theta4tau(RealType(0), 2*x*x*n/constants::pi<RealType>(), Policy());
0338 } // cdf
0339 
0340 template <class RealType, class Policy>
0341 inline RealType cdf(const complemented2_type<kolmogorov_smirnov_distribution<RealType, Policy>, RealType>& c) {
0342     BOOST_MATH_STD_USING // for ADL of std function exp.
0343     RealType x = c.param;
0344    static const char* function = "boost::math::cdf(const complemented2_type<const kolmogorov_smirnov_distribution<%1%>&, %1%>)";
0345    RealType error_result;
0346    kolmogorov_smirnov_distribution<RealType, Policy> const& dist = c.dist;
0347    RealType n = dist.number_of_observations();
0348 
0349    if(false == detail::check_x_not_NaN(function, x, &error_result, Policy()))
0350       return error_result;
0351    if(false == detail::check_df(function, n, &error_result, Policy()))
0352       return error_result;
0353 
0354    if((x < 0) || !(boost::math::isfinite)(x))
0355       return policies::raise_domain_error<RealType>(
0356          function, "Random variable parameter was %1%, but must be between > 0 !", x, Policy());
0357 
0358    if (x*x*n == 0)
0359        return 1;
0360 
0361    if (2*x*x*n > constants::pi<RealType>())
0362        return -jacobi_theta4m1tau(RealType(0), 2*x*x*n/constants::pi<RealType>(), Policy());
0363 
0364    return RealType(1) - jacobi_theta4tau(RealType(0), 2*x*x*n/constants::pi<RealType>(), Policy());
0365 } // cdf (complemented)
0366 
0367 template <class RealType, class Policy>
0368 inline RealType quantile(const kolmogorov_smirnov_distribution<RealType, Policy>& dist, const RealType& p)
0369 {
0370     BOOST_MATH_STD_USING
0371    static const char* function = "boost::math::quantile(const kolmogorov_smirnov_distribution<%1%>&, %1%)";
0372    // Error check:
0373    RealType error_result;
0374    RealType n = dist.number_of_observations();
0375    if(false == detail::check_probability(function, p, &error_result, Policy()))
0376       return error_result;
0377    if(false == detail::check_df(function, n, &error_result, Policy()))
0378       return error_result;
0379 
0380    RealType k = detail::kolmogorov_smirnov_quantile_guess(p) / sqrt(n);
0381    const int get_digits = policies::digits<RealType, Policy>();// get digits from policy,
0382    std::uintmax_t m = policies::get_max_root_iterations<Policy>(); // and max iterations.
0383 
0384    return tools::newton_raphson_iterate(detail::kolmogorov_smirnov_quantile_functor<RealType, Policy>(dist, p),
0385            k, RealType(0), RealType(1), get_digits, m);
0386 } // quantile
0387 
0388 template <class RealType, class Policy>
0389 inline RealType quantile(const complemented2_type<kolmogorov_smirnov_distribution<RealType, Policy>, RealType>& c) {
0390     BOOST_MATH_STD_USING
0391    static const char* function = "boost::math::quantile(const kolmogorov_smirnov_distribution<%1%>&, %1%)";
0392    kolmogorov_smirnov_distribution<RealType, Policy> const& dist = c.dist;
0393    RealType n = dist.number_of_observations();
0394    // Error check:
0395    RealType error_result;
0396    RealType p = c.param;
0397 
0398    if(false == detail::check_probability(function, p, &error_result, Policy()))
0399       return error_result;
0400    if(false == detail::check_df(function, n, &error_result, Policy()))
0401       return error_result;
0402 
0403    RealType k = detail::kolmogorov_smirnov_quantile_guess(RealType(1-p)) / sqrt(n);
0404 
0405    const int get_digits = policies::digits<RealType, Policy>();// get digits from policy,
0406    std::uintmax_t m = policies::get_max_root_iterations<Policy>(); // and max iterations.
0407 
0408    return tools::newton_raphson_iterate(
0409            detail::kolmogorov_smirnov_complementary_quantile_functor<RealType, Policy>(dist, p),
0410            k, RealType(0), RealType(1), get_digits, m);
0411 } // quantile (complemented)
0412 
0413 template <class RealType, class Policy>
0414 inline RealType mode(const kolmogorov_smirnov_distribution<RealType, Policy>& dist)
0415 {
0416     BOOST_MATH_STD_USING
0417    static const char* function = "boost::math::mode(const kolmogorov_smirnov_distribution<%1%>&)";
0418    RealType n = dist.number_of_observations();
0419    RealType error_result;
0420    if(false == detail::check_df(function, n, &error_result, Policy()))
0421       return error_result;
0422 
0423     std::pair<RealType, RealType> r = boost::math::tools::brent_find_minima(
0424             detail::kolmogorov_smirnov_negative_pdf_functor<RealType, Policy>(),
0425             static_cast<RealType>(0), static_cast<RealType>(1), policies::digits<RealType, Policy>());
0426     return r.first / sqrt(n);
0427 }
0428 
0429 // Mean and variance come directly from
0430 // https://www.jstatsoft.org/article/view/v008i18 Section 3
0431 template <class RealType, class Policy>
0432 inline RealType mean(const kolmogorov_smirnov_distribution<RealType, Policy>& dist)
0433 {
0434     BOOST_MATH_STD_USING
0435    static const char* function = "boost::math::mean(const kolmogorov_smirnov_distribution<%1%>&)";
0436     RealType n = dist.number_of_observations();
0437     RealType error_result;
0438     if(false == detail::check_df(function, n, &error_result, Policy()))
0439         return error_result;
0440     return constants::root_half_pi<RealType>() * constants::ln_two<RealType>() / sqrt(n);
0441 }
0442 
0443 template <class RealType, class Policy>
0444 inline RealType variance(const kolmogorov_smirnov_distribution<RealType, Policy>& dist)
0445 {
0446    static const char* function = "boost::math::variance(const kolmogorov_smirnov_distribution<%1%>&)";
0447     RealType n = dist.number_of_observations();
0448     RealType error_result;
0449     if(false == detail::check_df(function, n, &error_result, Policy()))
0450         return error_result;
0451     return (constants::pi_sqr_div_six<RealType>()
0452             - constants::pi<RealType>() * constants::ln_two<RealType>() * constants::ln_two<RealType>()) / (2*n);
0453 }
0454 
0455 // Skewness and kurtosis come from integrating the PDF
0456 // The alternating series pops out a Dirichlet eta function which is related to the zeta function
0457 template <class RealType, class Policy>
0458 inline RealType skewness(const kolmogorov_smirnov_distribution<RealType, Policy>& dist)
0459 {
0460     BOOST_MATH_STD_USING
0461    static const char* function = "boost::math::skewness(const kolmogorov_smirnov_distribution<%1%>&)";
0462     RealType n = dist.number_of_observations();
0463     RealType error_result;
0464     if(false == detail::check_df(function, n, &error_result, Policy()))
0465         return error_result;
0466     RealType ex3 = RealType(0.5625) * constants::root_half_pi<RealType>() * constants::zeta_three<RealType>() / n / sqrt(n);
0467     RealType mean = boost::math::mean(dist);
0468     RealType var = boost::math::variance(dist);
0469     return (ex3 - 3 * mean * var - mean * mean * mean) / var / sqrt(var);
0470 }
0471 
0472 template <class RealType, class Policy>
0473 inline RealType kurtosis(const kolmogorov_smirnov_distribution<RealType, Policy>& dist)
0474 {
0475     BOOST_MATH_STD_USING
0476    static const char* function = "boost::math::kurtosis(const kolmogorov_smirnov_distribution<%1%>&)";
0477     RealType n = dist.number_of_observations();
0478     RealType error_result;
0479     if(false == detail::check_df(function, n, &error_result, Policy()))
0480         return error_result;
0481     RealType ex4 = 7 * constants::pi_sqr_div_six<RealType>() * constants::pi_sqr_div_six<RealType>() / 20 / n / n;
0482     RealType mean = boost::math::mean(dist);
0483     RealType var = boost::math::variance(dist);
0484     RealType skew = boost::math::skewness(dist);
0485     return (ex4 - 4 * mean * skew * var * sqrt(var) - 6 * mean * mean * var - mean * mean * mean * mean) / var / var;
0486 }
0487 
0488 template <class RealType, class Policy>
0489 inline RealType kurtosis_excess(const kolmogorov_smirnov_distribution<RealType, Policy>& dist)
0490 {
0491    static const char* function = "boost::math::kurtosis_excess(const kolmogorov_smirnov_distribution<%1%>&)";
0492     RealType n = dist.number_of_observations();
0493     RealType error_result;
0494     if(false == detail::check_df(function, n, &error_result, Policy()))
0495         return error_result;
0496     return kurtosis(dist) - 3;
0497 }
0498 }}
0499 #endif