I realized something the other day while doing a curve fit in Excel that I figured was worth sharing.

The bottom line is that if you use Excel’s trend line feature to apply a trend line to a set of data in a graph ….

… experiment with the options to find something that is a reasonable fit ….

…. and have Excel put the equation for the line on the graph ….

…. you need to be careful if you use the equation to predict data, especially with higher order polynomials.

It could be pretty tempting to write a formula that used the trend line equation and assume it was correct. And technically, it is correct. But, in some cases, especially with high power polynomials, your predictions could be way off if you did that because of the compounding of rounding errors.

In other words, the coefficients presented in the equation are correct, but rounded off. And for some applications, the digits that were dropped could make the difference between making an accurate prediction from your data and one that was not so good, especially if you multiply them by numbers that have big exponents.

Most of you probably already realize this, and when I noticed the issue, I sort of had a hunch about the reason for it. But in figuring out how to work around it, I learned some things that will probably be useful, so I thought I would share them.

How I got into this was that I was working on a control valve selection for a condenser water system that has a number of different operating flow rates. So, I was looking at the valve performance for my selection at different flow rates. I was basing my selection of a Bray series 30 butterfly valve and had the data for its flow coefficient (a.k.a C_{V})at different disc angles and decided to make a graph so I could just read the C_{V}for angles that were not directly documented.

Then, I got to thinking that if I could do a curve fit, I could use the equation for the curve to solve for the C_{V}; not a big time saver for picking a particular valve, but if I saved the spreadsheet as a tool or wanted to play “what if” games, it could be handy. Plus, I guess I got a little curious. So, I plotted my curve and got this as a result.

Visually, the trend line looked like a pretty good fit with the 5^{th}order polynomial.

So, I then wrote a formula using the coefficients in the trend line equation and got this result when I plotted it to check myself.

As you can see, quite a difference.

Initially, of course, I thought I had miss-entered one of the coefficients. But when that did not prove to be the case, I realized that with the high power polynomials (x to the 5^{th}for instance) even small change in the coefficient would make a big change in the result and that the problem was probably related to the rounding off of the coefficients.

That got me curious about how you would actually get more accurate numbers for the coefficients out of Excel. My reasoning was that Excel must know them; otherwise it could not have drawn the trend line that visually showed a much closer fit.

It turns out that there is an Excel function called LINEST, which is what you can use to do this. In general terms, it is a least squares curve fitting technique where you input your y and x values and the function returns the coefficients for the equation for your line. It can also force the y intercept to be zero and give you all of the statistical data about the line (like the r^{2}values, etc.)

It is one of the statistical functions, and when I read through the discussion at the Excel tutorial link I reference above, I was a bit overwhelmed by the math jargon. Plus, it was not clear to me how to apply it for the valve C_{V}situation.

Then I came across a LINESTarticle on the web that opened the door to my understanding. I would have never figured it out from the Excel tutorial information. In addition to showing how to apply LINEST for a polynomial, the article also shows how to apply it for other data fits including logarithmic, powers, and exponentials.

Courtesy of the article’s author, I learned that for the polynomial fit, to get the coefficients you need, you use the following form of the function:

=LINEST(H150:H158,G150:G158^{1,2,3,4,5})

Where:

H150:H158 represents the known y values; in my case, these were C_{V }data points I read for different disc angles from the Bray valve data sheet.

G150:G158 represents the x valuescorresponding to the y values you know

**^** is apparently the “magic”;in other words, the little “up arrow” symbol seems to be what tells LINEST that you want to have it tell you the coefficients for the equation of the line; in a normal Excel formula, that symbol would be what you used to raise a number to the left of the “up arrow” to the power on the right of the “up arrow” (for instance 3^2 is 3 squared)

{1,2,3,4,5} tells LINEST the order of the polynomial; in other words how many coefficients you are looking for. In my case, I had a 5^{th} order (y = m_{5} * x^{5} + m_{4} * x^{4} + m_{3}* x^{3} + m_{2} * x^{2} + m_{1} *x + b) polynomial trend line that looked like a good curve fit, so I needed 5 coefficients. If it as a third order polynomial (y = m_{3}* x^{3} + m_{2} * x^{2} + m_{1}*x + b) then I would have used {1,2,3}.

If you want the other statistics or to force the intercept to be 0, you would use a form of the function that looks like this:

=LINEST(H150:H158,G150:G158^{1,2,3,4,5},FALSE,TRUE)

Where everything up to “FALSE” has the same meaning as the previous discussion. As far as the new parameters go:

,FALSEcontrols if you do or don’t force the intercept to be zero with “FALSE” forcing it to zero. If you omit this parameter, the intercept is not forced.

,TRUEcontrols if you get the regression statistics with “TRUE” causing the stats to be returned. If you omit this parameter, they are not returned.

There is much more detail on the syntax in the EXCEL tutorial at the link I included above, along with some examples. But, for my aging brain at least, the Excel tutorial did not include an example of how to do what I wanted to do (at least not one that I understood). So, I am very grateful to whom-ever it was that wrote the article I mentioned previously.

Note that the “curly brackets” in the form above are manually entered vs. the “curly brackets” that Excel automatically enters when you make a formula an array formula. That brings me to the other “trick” for using this; you have to enter it as an array formula.

Array formulas are powerful Excel toolsthat allow you to summarize data in many different ways. But for the current discussion, I found it easier to think of them as formulas that work with a range of numbers as inputs that also return answers that are more than one number, thus needing an array of cells (more than one) for their output.

To create an array formula, you type the formula into a cell, which becomes the upper left corner of the range you want to have your output show up in. After you enter the formula, you highlight your output range and then hit “F2” followed by holding down “Shift” plus “Control” plus “Enter” at the same time.

That actually sounds more complicated than it is. The bottom line is that the LINEST function needs an output range that is at least one row high with a column for each coefficient in the polynomial along with the y intercept. (If you want the additional statistics, then you need 5 rows instead of 1 row in addition to including those parameters in the LINEST formula.)

In my case, I was only looking for the coefficients for the polynomial. So I picked a cell to enter my formula in and then highlighted that cell along with the next 5 cells to the right and then did the “F2” “Shift” plus “Control” plus “Enter” thing.

Here is what that looked like in my spreadsheet right after I had entered the formula, highlighted the output range, and hit “F2”.

Note that the labels (m_{5}, m_{4}, m_{3}, m_{2}, m_{1}, and b) were ones that I placed in advance for my own reference; if nothing else, it helps me see which cells to select for the next step. Before I hit “F2”, the cell with the LINEST function in it (cell D14) held a number that turns out to be the m_{5}coefficient.

When I highlight the output range and then hit “F2”, the formula opens up and you can see that it references the C_{V}data for the first parameter required by LINEST (the blue box and blue text in the formula).

You can also see that it references the disc angle data for the second parameter (the green box and green text in the formula).

Finally, you can see the “magic” ^{1,2,3,4,5} parameter, which seems to be what makes this work in terms of assessing the polynomial coefficients.

Hitting “Shift” plus “Control” plus “Enter” at the same time populates all of the cells in the range you have selected with the LINEST formula. But each cell yields a different result, with the results being the parameters of interest.

In this screen shot above, the cell highlighted in green is where I originally entered the formula. The cells highlighted in blue are the cells that I selected along with the green cell as the range for the output of the formula before doing the “F2”, “Control” plus “Shift” plus “Enter” thing.

If you compare the results from the LINEST function with the coefficients from the trend line equation, you see that the trend line coefficients are what you would get if you rounded off the LINEST coefficients.

But those seemingly insignificant digits make a big difference in the results you get if you use them to assess the polynomial.

As you can see, the C_{V} values calculated from the coefficients developed with LINEST (column I; lavender highlight) agree closely with the actual Bray values that are behind them. But the C_{V}values calculated using the rounded off coefficients as presented in the trend line equation (column J; pink highlight and the green line in the graph) diverge from the Bray values significantly.

Specifically, In the graph above:

- The red line is the actual data from Bray.
- The blue line is the trend line you get if you ask Excel to apply a trend line to the red data series and then experiment with the different options until you get something that looks like a fit. In this particular case, I ended up using a 5
^{th}order polynomial, as mentioned previously. - The equation, highlighted in yellow, is the trend line equation provided by the spreadsheet function.
- The green line is what you get if you use that equation to predict the C
_{V}values for various disc angles.

I guess all of this comes back to something that my teachers and mentors have taught me ever since I started adding 1 plus 1 to get 2. That being that its always good to check your answer somehow before moving forward with it. In this case, when I took the step of plotting the results of the equation I had developed from the trend line coefficients, I discovered there was a problem.

That insight led me down the road of discovery to the LINEST function, which in turn led me to the missing digits in my coefficients; the little things that made the difference between using bad data and good data to make a decision for a client.

David Sellers

Senior Engineer – Facility Dynamics Engineering

Good tips on the LINEST function David. Another trick you can use with the functions generated by the trendline function within a graph is to right click on the trendline label, click format trendline label, then change the format category to number and increase the decimal places to however many are appropriate. That method can be a little more expedient depending on the situation and level of precision called for.

As Chuck McClure, one of my mentors would have said, the only thing wrong with that idea is that I didn’t think of it. Thanks much for sharing Ben. That would be much faster than the approach I used.

David

A couple of guys that I work with offered a few comments in a discussion I was having with them about this. One is an amusing quote forwarded by Steve Briggs;

— With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.John von Neumann (via Wikiquote)

The other is from Murphy O’Dea, which has a lot of detailed technical information about curve fits that some folks who find this post might find to be useful, so I thought I would paste it in here for reference if you’re interested.

Murph’s stuff is pretty mathematcially technical and at the most, I sort of vaguely recognize a bit of it from my college days. At this point in my life, I’m probably lucky I can remember how to spell polynomial; probably some combination of age, practice, and the mixed blessing of computers and software that do some of the thinking for you.

The latter is probably the real reason I did the post. Its pretty easy to see stuff in a spreadsheet and just figure its right and off you run with it. The push to be faster and quicker doesn’t help.

I think my first grade teacher was the first person to teach me the importance of checking your work, right after she taught me 1+1=2. That takes a little time, but is usually worth it, even if all it does is give you the comfort of the assurance that you are on the right track.

Thanks to both Steve and Murph for sharing; What follows is Murph’s input.

Don’t have time to try it out right now, but another way to do this is to use better basis functions. Normal polynomial fits use a linear combination (x, x^2, x^3, x^4, … N). Another way to do it is to use one of the orthogonal basis functions (one of a family which are all solutions of singular Sturm-Liouville Partial Differential Equations (PDE)). These functions include Chebyshev, Legendre, Laguerre, etc… For continuous data with no singularities, these give what is known as “Exponential” or “Spectral” convergence, which also reduces the effect of roundoff (faster convergence means less terms required for a given accuracy).I use these (mostly Legendre and Chebyshev) to solve PDEs, which has different math behind it than doing curve fits. I found this link:

http://www.extremeoptimization.com/Samples/CurveFitting.aspx

I know there’s tons of stuff out there, including some apps…

If you want to try to build your own spreadsheet using these functions (I’d try Chebyshev to start…), I think it would be pretty easy. The basic format is the same:

F(x) = c1*T1(x) + c2*T2(x) + c3*T3(x) + N*TN(x)

The c’s are the coefficients to be solved for, the T’s are the Chebyshev basis functions. These can be written as cosine functions with a change of variable, or as adapted polynomials.

So, like any curve fit, you plug in your data points for x1,F1 ; x2,F2 ; …N and you get N simultaneous equations which you solve for the c’s (linear algebra). This then gives you an equation you can use.

Now, I don’t know if you can adapt that LINEST() function to do this, but you could probably write one in excel (VBASIC ?), or use MATLAB…

Another tidbit: At their very core, most transcendental functions (like trig functions, etc) can be represented using exponentials. Trying the exponential feature in LINEST() with fewer terms than the polynomials would be an interesting exercise 😉Working with CFD simulations that have lots and i do mean LOTS of decimals, i found your approach excellent. I will be sure to mention you in my Thesis

Fantastic article on this function. Explained in better detail than the Excel Help section. Thanks for saving my day as I too was struggling to understand why I wasn’t getting the correct values when I use the coefficients from my trendline equation.

Hi Hidayat,

This has turned out to be one of my most popular posts and on occasion, I have wondered if it was actually useful or if it was so far off the mark that there were a bunch of mathemeticians passing it around as the joke of the day. So, its good to hear that it was helpful.

Thanks much for letting me know.

David

So I too had the same issue with excels formulae. In my case I had a fit which reported an r2=1, yet when I plotted the curve using the formula it was way off. Interestingly you can format the formula on the chart in the same way you format the data in a cell. I formatted the formuala to scientific numbering with 6 decimal places and low and behold the plotted data wa perfect.

Hi Steve,

This is by an order of magnitude, my most popular post, which is odd given that the blog is basically about HVAC. For a while, I thought it probably was being passed around by really smart people in Math departments as the joke of the day. But it seems others, like you and I, have been puzzled by the same thing.

And, one of the more amusing things about this post is that I got it up and everything and about two days later got a message similar to yours saying basically “nice post, but did you know you could format the label to show more decimal places”, which, of course, clearly, I did not.

So I learned two things, one being the stuff in the post and the other being the formatting trick. Always something to learn with Excel. Thanks for your feedback and for visiting the blog.

Best,

David

Hi David,

Thanks for that, much easier to understand than the excel tutorials

Glad it helped.

Thanks for visiting the blog,

David

Dear David,

I seems that you’ve made a error in the coefficients from the trendline.

In the file (https://av8rdas.files.wordpress.com/2012/09/compare-coefficients_thumb.jpg?w=644&h=116), you fill in the coefficients from the LINEST and from the trendline. If you look at the m1 coefficient, there is a symbol mistake. The LINEST gives -44.35 while from the trendline, the value is 44.358. The two values differ by a factor of -1.

I assume that you typed trendline values… Typo mistake…

And then this might make a huge difference in your conclusion that the number of significant numbers (or digits after the decimal place) makes a large difference.

Best regards,

Yusuf

Hi Yusuf,

Sharp eyes; out of the 43,159 people who have looked at that post (it is by far, my most popular post, even more popular than the ones about how to make a Jeopardy game), you are the first one to notice the typo, or at least the first one to say something. Thanks much.

I have fixed the graphic and the reality is that it was a typo in my transcription of the equation coefficients. Meaning Excel was actually working with the negative number. And while I agree that if Excel had been working with the positive number, it would have made a difference, the conclusion I reached about the decimal places mattering is still valid. In other words, the conclusion and math was based on the negative number, not the typo. And the number of decimal places you use when you generate the multi-order polynomial can have a huge impact on the result (as will getting the sign wrong if you are not careful when you type things in).

Thanks again for your sharp eyes, and thanks for visiting the blog.

Best,

David

i obtained my data from research, plotted and it gave exponential curve. how can i use excell to find equation for this curve.

Hi Robert,

The procedure is very similar to what I show in the blog post. But instead of picking the “Polynomial” option for the trend line, you pick the “Exponential” option, which shows up at the top of the list. That said, if you don’t get a good fit, you may have to try some of the other options on a “trial and error” basis. I do that a lot myself.

Thanks for visiting the blog.

You can just simply use the “format trendline label” to increase the decimal places…

There are lots of tips and tricks you can do with Excel to curve-fit any given data set, but no matter how perfect you can fit the curve, prediction out side the range of your actual hard data is always risky. It all comes down to how confident you are with your work, and if it was to be scrutinized how well you can defend yourself. Best way to play safe is to stay within the range of your hard data — that is only do interpolations, never extrapolations.

Btw, I was actually googling to confirm if the Excel’s TREND() formula only do linear function. Though I didn’t find my answer from your post, its good to know about the LINEST() formula. Thanks.

Didn’t see it mentioned, may be exceedingly obvious, but you can use the index function to extract each of the coefficients the linest function spits out. This way you can have it retrieve a single y value using the trendline values based on a given x value in a single cell, and update the y value if the x changes.

eg

for a second order poly

=(x-new^2)*INDEX(LINEST(known_y,known_x^{1,2}),1) + $A$2*INDEX(LINEST(known_y,known_x^{1,2}),2) + INDEX(LINEST(known_y,known_x^{1,2}),3)

Great information. Really informative and so much better than the default Excel Help files. Many thanks to all!