Categories: Sheets : Report an issue :

Errors and other issues with statistical and mathematical functions in GSheets

Showing 1-12 of 12 messages
Errors and other issues with statistical and mathematical functions in GSheets Bas Braams 5/6/12 1:30 PM
Introduction

Several previous postings and conversations on this forum, e.g. [1-7], are concerned with numerical errors or accuracy issues in Google Spreadsheets. The posting [7] is mine and at first I only repeated observations made in earlier conversations that had become closed for further replies. The initial posting in [7] concerned the cumulative normal distribution function, NORMDIST, but then every other day I would look at another function in the Google Spreadsheets program and find further errors or problems, which I then reported on that same conversation. The conversation [7] has meanwhile become a bit a unwieldy and I like to condense my observations together in this fresh posting.

Statistical distributions

Errors in the cumulative normal distribution function NORMDIST and NORMSDIST were identified by @malemi in [2]. Note that NORMDIST takes four arguments conventionally denoted (number, mean, stdev, C). Argument C is an indicator; C=0 for the distribution function and C=1 for the cumulative distribution function. Arguments mean and stdev specify a shift and a scale. I concentrate on the case of normalized arguments, mean=0 and stdev=1, and the cumulative distribution function. For that case we have NORMSDIST as a short-hand: NORMSDIST(x) = NORMDIST(x,0,1,1).

The implementation of NORMDIST (NORMSDIST) is very seriously wrong. The function returns a negative value for x=-5 and a value greater than 1 for x=5; it is oscillatory about 0 for x.le.-5 and oscillatory about 1 for x.ge.5. As analyzed under [7] near 2012-03-31 it is not simply that an inaccurate approximation is used within the code; rather there seems to be a gross coding error. Different approximations are used on different intervals and I conjecture that the interval identification got messed up somehow.

The cumulative lognormal distribution, LOGNORMDIST, is simply related to the cumulative normal distribution function: lognormdist(x,0,1) = normdist(ln(x),0,1,1), = normsdist(ln(x)). This appears to be how it is computed in the Google Spreadsheets program and therefore LOGNORMDIST has the same accuracy and sign issues as does NORMDIST or NORMSDIST. For the cumulative distribution we find (see [7], around 2012-04-14):
=LOGNORMDIST(145,0,1) returns 1.00000003243862 (a number larger than 1)
=LOGNORMDIST(0.0069,0,1) returns -0.00000002866487 (a number less than 0)
Based on the earlier experience with NORMDIST we can expect that the cumulative function LOGNORMDIST(x,0,1) is at best of single precision accuracy and it deteriorates to basically garbage for approximately x<0.01 while 1-LOGNORMDIST(x,0,1) turns to garbage for approximately x>100.

The discrete probability distributions POISSON, BINOMDIST and HYPGEOMDIST are all messed up in a similar way, unrelated to the NORMDIST issue. The problems were identified in [1], [3-5] and under [7] around 2011-11-19 with a further look on 2012-04-14. The root of the issue is that the mathematical definition of these distributions involves a product of factorials and power and exponential functions. For example:

POISSON(n,x,0) = pow(x,n)*exp(-x)/n!
BINOMDIST(k,n,p,0) = pow(p,k)*pow(1-p,n-k)*n!/(k!*(n-k)!)

The final argument is 0 or 1; 0 for the probability and 1 for the cumulative probability and the issue affects both the density and the cumulative distribution function. The expression for HYPGEOMDIST is a bit messier, but there too there is an argument n that represents the total population and a factor n! occurs in the mathematical expression.

For quite modest values of the arguments the individual factors can overflow or underflow when standard floating point arithmetic is used, even when the desired final result is a number that is not very small. If the mathematical expressions are employed factor by factor in standard IEEE 64-bit floating point arithmetic then one should expect overflow for n! whenever n.gt.170. It is unreasonable to calculate the functions this way; there should be nothing amiss with invoking POISSON or BINOMDIST or HYPGEOMDIST for population sizes very much larger than 170. In fact, in Google Spreadsheets the actual limit is even less than 170; it is 136, and this appears to be due to the use of Stirling’s approximation factor by factor to compute the factorial and to some other issues. Anyone interested in the gory details can find it explored under [7].

The inverse of the cumulative normal distribution function is implemented in the Google Spreadsheets function CRITBINOM. An error in the documentation was noted in [8]. In a reply there and also under [7] I noted that the implementation is affected by the same overflow errors identified for BINOMDIST and other discrete probabilities.

Variance, correlation and regression functions

The calculation the various variance (or standard deviation) functions in Google Spreadsheets is inaccurate in a manner that was fixed in EXCEL with the 2003 release. This error (inaccurate computation) affects the functions VAR, VARA, VARP, VARPA, DVAR, DVARP and the functions STDEV, STDEVA, STDEVP, STDEVPA, DSTDEV and DSTDEVP. I’ll explain it here only for STDEV, following the posting under [7] of 2011-11-15.

GDocs spreadsheets computes stdev(1e8,1e8+1)=0. This is wrong. The numbers 1e8 and 1e8+1 are exactly representable and the computed stdev should be invariant under a common shift. The correct and expected result is therefore the same as that of stdev(0,1), which is 0.707106781... I point out that LibreOffice Calc delivers the correct answer even for much larger shift away from the origin. Apparently in GDocs spreadsheets the standard deviation of N data points x[0..N-1] is being calculated as
sqrt((N*sum(x^2)-(sum(x))^2)/(N*(N-1)))
It is mathematically unobjectionable and numerically unsound.

Various functions for correlation and regression are similarly inaccurate due to (one may assume) a failure to shift the arguments to the vicinity of 0 before doing the linear algebra. Under [7], in postings between 2011-11-17 and 2011-11-23, I investigate LINEST, CORREL, TREND, LOGEST and GROWTH and identify similar errors in all.

Mathematical functions

The complementary error function is closely related to NORMSDIST and Google Spreadsheets offers ERFC. They might have expressed one in terms of the other according to the identity NORMSDIST(-x) = ERFC(x/sqrt(2))/2 but this was not done. At first (around 2012-03-31 under [7]) I assumed that ERFC would just be the system-supplied erfc, but it turns out not to be the case. The Google Spreadsheets function ERFC is messed up in its own way as explored under [7] around 2012-04-21.

For argument 5.922 and larger ERFC returns 0. It appears that the computation of ERFC in Google Spreadsheets employs the identity ERFC(x)+ERF(x)=2 the wrong way around; they compute ERF(x) (using a hidden implementation of ERF) where the result value is close to 2 and then compute ERFC(x) = 2-ERF(x).

Near the origin the Google Spreadsheets function ERFC has another problem; it is basically of single precision accuracy and it fails the mathematical property that erfc(0)=1. Witness:
=ERFC(0) returns 1.00000003

In addition, for no reason that I can imagine, ERFC(x) returns #NUM! for x.lt.0. The mathematical function erfc is perfectly well defined for x.lt.0.

Google Spreadsheets computes =FACT(136) to be 3.66e232 while =FACT(137) retuns #NUM!. But the overflow limit for IEEE 64-bit arithmetic is about 1.7977e308. There should be no problem with =FACT(137). Indeed, in LibreOffice Calc we compute =FACT(170) to be 7.2574e306 and =FACT(171) returns #NUM! as is reasonable. The issue is explored under [7] around 2011-11-24, where I discuss also some other strange behavior near the overflow limit. In addition to the Stirling approximation issue it appears that the effective overflow limit in Google Spreadsheets is a factor 1e15 lower than the IEEE 64-bit overflow limit. (I regard this as a curiosity and not as a major issue.)

Google Spreadsheets seems generally to use approximations of less than full accuracy. Under [7] (2011-11-24) I look at the log-gamma function, GAMMALN, and I find a loss of about 7 bits relative to the implementation in LibreOffice Calc (which appears to offer full 64-bit accuracy).

In Google Spreadsheets the multinomial (MULTINOMIAL) is computed as a ratio of factorials, which again results in gratuitous overflow. MULTINOMIAL should be computed as a product of binomials using a recursion in the number of arguments. See the discussion under [7] around 2011-11-17.

The binomial coefficients have their own unreasonable overflow failure: =COMBIN(1e8,2) fails, as does =COMBIN(5e15,1). This is obviously less severe than the problem with the use of factorials in the discrete probability functions.

Elementary functions and documentation

The MROUND function is described as follows [*]: "MROUND(number, multiple): The result is the nearest integer multiple of the number." There is not much that can be done with this except apply common sense to decide that probably it is meant that the value of MROUND(x,y) is x rounded to the nearest integer multiple of y. Indeed =MROUND(2,5) returns 0 and =MROUND(3,5) returns 5. But why on earth should =MROUND(-2,5) return #NUM!?

The ROUNDDOWN function is described as follows: "ROUNDDOWN(number, count): Rounds the given number. Count (optional) is the number of digits to be rounded down to. If the count parameter is negative, only the whole number portion is rounded. It is rounded to the place indicated by the count." For ROUNDUP we read: "ROUNDUP(number, count): Rounds the given number up. Count (optional) is the number of digits to which rounding up is to be done. If the count parameter is negative, only the whole number portion is rounded. It is rounded to the place indicated by the count." I would not have guessed from that description that for the case of negative first argument rounddown rounds up and roundup rounds down.
=ROUNDDOWN(-0.5,0) returns 0
=ROUNDUP(-0.5,0) returns -1.

Finally one further description [*]. "CEILING(number, significance, mode): Rounds the given number to the nearest integer or multiple of significance. Significance is the value to whose multiple of ten the value is to be rounded up (.01, .1, 1, 10, etc.). Mode is an optional value. If it is indicated and non-zero and if the number and significance are negative, rounding up is carried out based on that value." They might as well say: "Something having to do with rounding, try it out to see what it does". The FLOOR function has the same issue.

Functions that still need investigation

I am not aware of any close look at the random number generator that is used in Google Spreadsheets; functions RAND and RANDBETWEEN.

It could be interesting to see how the various errors in the statistical and mathematical functions propagate into the financial analysis functions or in various financial analysis templates that have been built upon Google Spreadsheets.

Closing remarks

As I wrote under [7], I think that the Google Docs and Spreadsheets team is making a big mistake with their casual attitude towards these errors in the accuracy of the Spreadsheets product. Until a few years ago Microsoft Excel was the standard awful example that would be invoked to demonstrate that spreadsheets could not be used for serious computation [9-14]. Inter alia such articles offered, I think, the implication that Microsoft Corp is not worthy of respect from scientists and engineers.

It appears to me that in the past few years Microsoft has risen quite a bit in reputation among scientists and engineers and Google is losing its privileged status in public perception. Public recognition that the numerical behavior of Google Spreadsheets is inferior (severely inferior, in fact) in 2011 (now 2012) to that of Microsoft Excel in 1997 will not help Google's standing. And the errors in Google Spreadsheets have been noticed outside this forum. Ref. [15] by Keeling and Pavur is relevant; in their abstract they write “Google Docs spreadsheet, while convenient, has deficiencies and should not be used for scientific statistical analysis.” (Their article appeared on-line 2012-01-24, otherwise I might have noticed it when I started on [7].) Probably Ref. [16] by McCullough and Yalta is relevant too, but all that I have been able to see of it to-date is the title.

I conclude with an observation that might be the topic of further development, e.g. with regard to the nature of present-day computer science education. It appears to be possible in the years 2007-2012 for a highly respected company that is able to attract the supposedly highest qualified engineers to develop a spreadsheet product without any concern for the numerical quality of the product and without any of their engineers (computer scientists, one assumes) noticing the gross faults in the numerical implementation. I find it disturbing.

References
 
[1] (2010-03-24) BINOMDIST not working ?

[2] (2010-10-14) NORMDIST throws negative value

[3] (2010-10-14) BINOMDIST does not accept TRIALS~140
 
[4] (2011-02-12) binomdist arbitrarily limits maximum argument - please correct function or help file to be consistent

[5] (2011-04-01) Statistical functions binomdist and hygeomdist give overflow messages with reasonable parameter values

[6] (2011-09-01) NORMDIST throws negative value STILL AFTER 1 YEAR!!!
 
[7] (2011-11-13) normdist throws negative value still

[8] (2012-03-15) Description of CRITBINOM function in Spreadsheet is incorrect.

[9] L. Knüsel: “On the Accuracy of Statistical Distributions in Microsoft Excel 97”, Computational Statistics & Data Analysis Vol. 26 (1998) pp. 375-377.

[10] A. Talha Yalta: The Accuracy of Statistical Distributions in Microsoft ® Excel 2007. Working Paper No 10-06 TOBB University of Economics and Technology Department of Economics (Ankara, Turkey).
(See also [13], but I don't have convenient access to the full text there)

[11] B.D. McCullough and David A. Heiser: "On the accuracy of statistical procedures in Microsoft Excel 2007"

[12] Leo Knüsel: "On the accuracy of statistical distributions in Microsoft Excel 2003"

[13] A. Talha Yalta: "The accuracy of statistical distributions in Microsoft® Excel 2007"

[14] M. G. Almiron et al.: "On the Numerical Accuracy of Spreadsheets"

[15] Kellie B. Keeling and Robert J. Pavur: “Statistical Accuracy of Spreadsheet Software”, The American Statistician, Vol. 65 (2011) pp. 265-273.

[16] A. Talha Yalta and B. D. McCullough, "Spreadsheets in the Cloud: Not Ready Yet.” (On 2012-05-06 this is listed as “working paper” or “under review” on the authors’ home pages; I haven’t seen the content, but I understand that Google Spreadsheets is featured.)
Bas Braams 6/16/12 6:04 AM <This message has been deleted.>
Bas Braams 6/17/12 1:42 PM <This message has been deleted.>
Bas Braams 6/26/12 12:02 PM <This message has been deleted.>
Re: Errors and other issues with statistical and mathematical functions in GSheets Bas Braams 7/1/12 10:29 AM
All the errors described in this posting and in the more extended posting [1] are still present on 2012-07-01 without acknowledgement that they are indeed errors. Some of the errors were reported back in 2010 and maybe earlier. In connection with this report and with [1] I have a small spreadsheet [2] that may be used to inspect if anything changes. The primary documentation or commentary for that little spreadsheet is the present posting and the posting [1]. At some point perhaps the behavior of Google Spreadsheets will change and the GSheet [2] will display different numbers than it does today. Today's display is shown in the pdf version [3]. Links [2] and [3] should both be accessible to anyone without sign-in to Google.

A few additions were made to [1] since the OP here on 2012-05-06. See [1] on 2012-06-19 for a diagnosis of the computation of the cumulative Poisson distribution, POISSON(n,x,1), being the sum over (k:0.le.k.le.n) of POISSON(k,x,0) where POISSON(k,x,0) = pow(x,k)*exp(-x)/k!. The cumulative distribution is apparently computed exactly as written, summing from k=0 up to k=n. It means that for n->infinity the calculated cumulative distribution POISSON(n,x,1) differs from the expected value 1 due to accumulated roundoff. See [1] on 2012-06-20 for a similar diagnosis of the computation of the cumulative binomial distribution BINOMDIST(m,n,p,1), which appears likewise to be computed by direct summation from the left for 0.le.k.le.m of the density BINOMDIST(k,n,p,0) and which therefore fails the expected normalization property BINOMDIST(n,n,p,1)=1. The posting [1] provides links to spreadsheets that demonstrate these issues for POISSON and BINOMDIST. See [1] on 2012-06-30 for a discussion of accuracy issues with the Google Spreadsheets function SUMX2MY2, which computes a difference of sums of squares. See [1] on 2012-07-01 for a correction to the diagnosis of the computation of ERFC; the identity that is used is erfc(x) = 1-erf(x), not erfc(x) = 2-erf(x) as I wrote earlier. The identity is mathematically correct (I was confused earlier about the definitions of erf and erfc), but it is numerically unsound to use it for large positive values of x. I identify an approximation from Numerical Recipes, Section 6.2, as the basis for the code that computes the hidden erf(x). 

[1] 2011-11-13: normdist throws negative value still

[2] GSheetsNumerics (an active Google Spreadsheet)

[3] GSheetsNumerics.pdf (produced on 2012-06-17 from the corresponding GSheet)
Re: Errors and other issues with statistical and mathematical functions in GSheets Bas Braams 7/22/12 1:17 PM
The issues and errors described here, and some other issues and errors too, are displayed in this spreadsheet.
It is viewable by anyone that has the link. The sheet is organized into a Main sheet and 10 further sheets, with titles:

ElemOps: Elementary operations
ElemFuncs: Elementary functions
SearchFuncs: Sorting and search and match functions
GoogleFuncs: Google functions
DBFuncs: Database functions
MathElementary: Elementary mathematical operations
MathFuncs: Mathematical functions
StatDists: Statistical distributions
StatAnal: Statistical analysis functions
FinanceOpsFuncs: Financial operations and functions

It is work in progress, but I like to make it available for feedback anyway.
Bas Braams 8/19/12 12:23 PM <This message has been deleted.>
Re: Errors and other issues with statistical and mathematical functions in GSheets Bas Braams 8/19/12 12:29 PM
I believe that all the errors mentioned earlier in this posting are still present on 2012-08-19, and there are a few more that weren't previously described here.

The posting [1] and the GSheetsIssues spreadsheet [2] display some gross faults with the rounding functions. When ROUNDDOWN (to integer) is applied to the exactly representable integer N=2^52 then the GSheets result is N+1, when ROUNDUP is applied then the result is N-1 and when ROUND is applied then the result is N+2. The correct result would be N in all cases. (I can guess what went wrong for ROUND: the coders thought to use a hack whereby the evaluation of round(x,0) for positive x involves a preliminary substitution x->(1+macheps)*x so that ties may get resolved in the desired direction.) The error with rounding to integer propagates into rounding to a fixed number of decimal places behind the dot; e.g. for 6 decimal places, ROUND(2^32,6) returns 2^32+2^(-20); the correct result would be 2^32.

The functions MATCH, VLOOKUP and HLOOKUP are messed up in their application to a sorted array in a manner that strongly suggests faulty implementation of the binary search algorithm. A comprehensive description is in my posting [3] and there is an earlier clear report by ahab in [4]. These issues are also illustrated in the spreadsheet [2]. The GDocs team might be a bit concerned about (a) having messed up a binary search and (b) not having noticed or fixed same for over a year and possibly much longer.

[1] (2012-07-24) rounding error in google spreadsheets

[2] GSheetsIssues (under continuous development)

[3] (2012-07-22) Trouble with searching and matching; functions SORT, MATCH, HLOOKUP and VLOOKUP

[4] (2011-06-26) Re: VLOOKUP function does not work when you include more than 2 rows in the function after last row of data in a sheet.
Re: Errors and other issues with statistical and mathematical functions in GSheets Bas Braams 10/13/12 1:39 PM
All errors described earlier in this conversation are still errors on 2012-10-13. The most remarkable one, for me, is the faulty implementation of the binary search in the lookup and match functions; it provides proof of some real structural problems with the GDocs effort.
Re: Errors and other issues with statistical and mathematical functions in GSheets Bas Braams 2/6/13 2:22 PM
Binary search in an N-element sorted array returning the N+1 element, round to integer returning a non-integer or the wrong integer, cumulative distribution functions returning a negative value, standard deviation and variance functions implemented in known bad ways, error function at 0 not equal to 0, gratuitous overflow errors in evaluation of discrete probability functions and combinatorial functions, function descriptions that read like "we haven't a clue, try it out"; all these errors are still errors on 2013-02-06. Way to go, Google Docs and Spreadsheets technical team and management!
Re: Errors and other issues with statistical and mathematical functions in GSheets Bas Braams 5/20/13 1:15 AM
None of the errors described here have been fixed and none have been acknowledged as errors by Google staff. On the other hand, a previously unreported error can be added to the list. As was found by Yogi Anand [1-2], the CEILING function may round down instead of up if the argument is just slightly larger than an integer. The FLOOR function has the dual behavior. To be precise: Let N be a nonnegative integer (but not so large that roundoff issues further complicate matters). If 0<x<=1e-7 then Google Sheets CEILING(N+x,1) returns the value N although a return value N+1 would be expected and FLOOR(N+1-x,1) returns N+1 although a return value N would be expected. If 1e-7<x<=1 then CEILING(N+x,1) returns N+1 as expected and FLOOR(N+1-x,1) returns N as expected.

The lack of concern from the Google side about these errors (of which the binary search error and the normsdist/erfc complex of errors stand out the most) has led me to the conjecture that the Spreadsheets project has been effectively abandoned by the Apps team; it just hasn't been announced yet. I posted that conjecture here [3].

[1] yogi_Some Anomalies In Use Of CEILING Function Comparing Google Spreadsheet Excel And OpenOffice

[2] (2013-03-13) Re: Is there anyway to round up?

[3] (2013-04-14) Prediction: Docs text editor and spreadsheets to be frozen and abandoned in favor of Quickoffice and other products
Re: Errors and other issues with statistical and mathematical functions in GSheets Alan Payne242 1/13/14 2:19 PM
I do not see any mention of problems with MINUS functions, I surely cannot get it to do even this! Any ideas? Thanks.