Square root algorithms

Bjørn, Sat Apr 22 2006, 11:50AM

Does anyone know any efficient square root algorithms? I made this one in an attempt to make it very short and failsafe but it is the fastest one I have tried out so far.

I was surprised not to find anything faster so if anyone has any ideas or pointers towards something efficient it would be interesting to hear. I saw an algorithm years ago that was very efficient but I can't recall the name.


;r0 = sqrt(r1)
		mov     r0, #0
		mov     r3, #32768
loop1   orr     r0, r0, r3
		mul     r2, r0, r0
		cmp     r2, r1
		bichi   r0, r0, r3
		movnes  r3, r3, lsr #1
		bne     loop1
Re: Square root algorithms
ShawnLG, Sat Apr 22 2006, 03:48PM

This one says it is under 20 clock cycles on some cpus.
Link2
Re: Square root algorithms
JimG, Sat Apr 22 2006, 04:38PM

The code you've shown is equivalent to the following C code:

int const isqrt( int const square )
{
   int guess = 0;
   int bit = 1 << 15;  // do the first 15 LSBs
   while ( bit > 0 ) 
   {
      guess |= bit;   // build up a value
      if ( guess * guess > square ) // is the square of the guess bigger than the orginal?
         guess &= ~bit; // remove the bit 
      bit >>= 1;
   }
   return guess;
}

It doesn't scale very well beyond 32 bit inputs and will always take 96 instructions to execute, but the algorithm you choose is dependant on your situation. If you're ever going to be taking the square root of 8 bit numbers you would use a different algorithm from if you were getting the square root of an 80 bit floating point number.

Normally table lookup with a Newton iteration like Newton-Raphson yeilds fast and accurate results for larger numbers. A purely iterative approach works really well for small numbers. It's always a trade off of overhead

I have used the BSR trick on Intel that is pointed out in the link ShawnLG provided to get an approximation before doing an iterative approach, as well.


Update: Fixed or to an and.
Re: Square root algorithms
Electroholic, Sat Apr 22 2006, 09:20PM

Copied from circuit cellar issue 187

Sum of odds

uint16_t Exh_sumodd-opt
(uint16_t A)
{
uint16_t x_odd = 1;

while (x_odd <= A)
{
A -= x_odd;
x_odd += 2;
}

return (x_odd>>1);
}
Re: Square root algorithms
Alfons, Sat Apr 22 2006, 11:43PM

In BASIC, this will work:

10 INPUT "Calculate square root of";A
20 GOSUB 50
30 PRINT "Square root of "A" is: ";B
40 END
50 O=1:B=0
60 AA=A*100
70 AA=AA-O
80 B=B+1
90 O=O+2
100 IF AA<=0 THEN B=B/10:RETURN
110 GOTO 70

(The code was given to me by Fedor Steeman, a Tandy Color Computer enthousiast)
Re: Square root algorithms
Bjørn, Sun Apr 23 2006, 12:05AM

My code does exactly the same as the following C code except that it exits early if it finds an exact result.

Table lookup in Flash memory is quite slow and the accuracy is not what I was aiming for but I have made a note of the <20 cycle algorithm for other uses.

I tried the sum of odds and i was faster on 8 bit numbers than my first attempt but if I configure my code for 8 bit then it becomes a lot faster than the sum of odds.
Re: Square root algorithms
Simon, Sun Apr 23 2006, 12:15AM

The Newton method should work well. To keep the options coming, what about a Taylor expansion? Ie. approximating the square-root function with a polynomial. Each term in your approximation should only involve two extra multiplications at most.

This isn't the best applicaton for Taylor approximations, however. Exponents and trig are better examples.

I guess this is meant to be a general algorithm, otherwise that would change things a lot.
Re: Square root algorithms
Bjørn, Mon Apr 24 2006, 04:02PM

It is supposed to be a general purpose algorithm for a microcomputer and have these properties that does not mix well:

  • Predictable
  • Accurate
  • Fast
  • Small


My first attempt unrolls well and I can use an unrolled version for 16 bit numbers and use that for small numbers and to approximate the square root of large numbers and use the looped version for 32 bit numbers.
Re: Square root algorithms
Carbon_Rod, Tue Apr 25 2006, 04:43AM

For low speed 8-bit real time fixed-root core design a look-up style hash table is nice as it uses no ram. =) Its cost is O (1) so it’s fast, simple, and predictable. There is a space-time trade off.. However, it also offers controlled number rounding if a weighted result is needed.

Did you mean an n-th root function or specifically square root? Estimated results using successive approximation techniques are common.
Re: Square root algorithms
Bjørn, Tue Apr 25 2006, 12:54PM

I was specifically thinking about square roots since that is what I needed at the time. I would be interested in efficient implementations of all kinds of useful functions, including IEEE 754 floating point functions. There is only room for 8192 instructions and there is a lot of other things to be included.
Re: Square root algorithms
Bjørn, Sat May 13 2006, 12:41PM

I found some fairly optimal code, 51 cycles, no approximations and no tables.


;Wilco Dijkstra 
			;R0 = sqrt(R1)
			;Use R0-R2
sqrt32f?A	MOV		r2, #3 << 30
			MOV		r0, #1 << 30

			CMP		r1, r0, ROR #2 * 0
			SUBHS	r1, r1, r0, ROR #2 * 0
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 1
			SUBHS	r1, r1, r0, ROR #2 * 1
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 2
			SUBHS	r1, r1, r0, ROR #2 * 2
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 3
			SUBHS	r1, r1, r0, ROR #2 * 3
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 4
			SUBHS	r1, r1, r0, ROR #2 * 4
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 5
			SUBHS	r1, r1, r0, ROR #2 * 5
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 6
			SUBHS	r1, r1, r0, ROR #2 * 6
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 7
			SUBHS	r1, r1, r0, ROR #2 * 7
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 8
			SUBHS	r1, r1, r0, ROR #2 * 8
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 9
			SUBHS	r1, r1, r0, ROR #2 * 9
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 10
			SUBHS	r1, r1, r0, ROR #2 * 10
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 11
			SUBHS	r1, r1, r0, ROR #2 * 11
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 12
			SUBHS	r1, r1, r0, ROR #2 * 12
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 13
			SUBHS	r1, r1, r0, ROR #2 * 13
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 14
			SUBHS	r1, r1, r0, ROR #2 * 14
			ADC		r0, r2, r0, LSL #1
			CMP		r1, r0, ROR #2 * 15
			SUBHS	r1, r1, r0, ROR #2 * 15
			ADC		r0, r2, r0, LSL #1

			BIC		r0, r0, #3 << 30  ; for rounding add: CMP r1, r0  ADC r0,r0, #1

			mov		pc,r14