Fast approximate arctan2/atan2 function

(click to interact)

Alternatively titled "exciting adventures in doing trigonometry without CRT".

The context

In what can be considered a recurring mistake, I decide to make a tiny DLL. Again!

"Surely" - I think to myself - "I'm not doing anything weird this time around - no complex containers, no CRT-specific functions. Just a few WinAPI calls and a WndProc callback. Nothing will go wrong this time".

And it mostly doesn't. Up until the point where someone suggests that it'd be cool to have a feature where a native cursor automatically rotates towards specified point (such as at a target, or away from the player) without having to update it from the game code (which has some latency).

Turns out that trig functions are also part of CRT!
Kind of makes sense when you think about it, but you don't usually think about it.

Furthermore, it's an intrinsic so it's nowhere to be found in Windows Kits source code (although Ghidra reveals that it's some 150 lines of code not too unlike Julia's libm implementation).

Continued search for compact arctangent implementations eventually leads me into the wonderful field of approximation functions - (usually) polynomial functions that don't look like they make any sense but provide almost as good of an output as The Real Thing for the specified range.

Among the results, there was this blog post citing "Efficient approximations for the arctangent function", Rajan, S. Sichun Wang Inkol, R. Joyal, A., May 2006 and "Streamlining digital signal processing: a tricks of the trade guidebook". A rather fortunate omission of algorithms of interest being available in the Google Books preview had apparently been corrected in ten years time, but I have been able to find the first paper without paying anyone an outrageous sum of money.

The Efficient approximations paper lists 7 functions total, with error ranging from 0.0015 rad. (0.085 deg.) to the almost amusingly imprecise 0.07 rad. (4 deg.).

The author of the post picks the most precise function out of the bunch,

arctan(x) ≈ π/4x − x(|x| − 1) × (0.2447 + 0.0663|x|), −1 ≤ x ≤ 1.
which seems like a good idea - in my case however long the function takes will be greatly outweighted by the subsequent WinAPI calls anyway.

Next up is turning this arctan into arctan2.

There are specific ways that an arctan2 function should be implemented in, which also happen to be of little use here because our arctan function has a limited input range (read: ±45deg).

The paper shows an example of picking the right formula based on octant in table 2.
After thinking a little bit about what an optimal way of figuring out an octant might look like, I conclude that it is unlikely to make a substantial difference and I might as well have a readable function with a 3 nested checks (side ➜ quadrant ➜ octant).

The code

It's about what you'd expect

constexpr double M_PI   = 3.141592653589793;
constexpr double M_PI_2 = 1.5707963267948966;
constexpr double M_PI_4 = 0.7853981633974483;
double FastArcTan(double x) {
	return M_PI_4 * x - x * (fabs(x) - 1) * (0.2447 + 0.0663 * fabs(x));
double FastArcTan2(double y, double x) {
	if (x >= 0) { // -pi/2 .. pi/2
		if (y >= 0) { // 0 .. pi/2
			if (y < x) { // 0 .. pi/4
				return FastArcTan(y / x);
			} else { // pi/4 .. pi/2
				return M_PI_2 - FastArcTan(x / y);
		} else {
			if (-y < x) { // -pi/4 .. 0
				return FastArcTan(y / x);
			} else { // -pi/2 .. -pi/4
				return -M_PI_2 - FastArcTan(x / y);
	} else { // -pi..-pi/2, pi/2..pi
		if (y >= 0) { // pi/2 .. pi
			if (y < -x) { // pi*3/4 .. pi
				return FastArcTan(y / x) + M_PI;
			} else { // pi/2 .. pi*3/4
				return M_PI_2 - FastArcTan(x / y);
		} else { // -pi .. -pi/2
			if (-y < -x) { // -pi .. -pi*3/4
				return FastArcTan(y / x) - M_PI;
			} else { // -pi*3/4 .. -pi/2
				return -M_PI_2 - FastArcTan(x / y);

A real arctan2 should also handle infinities and NaNs (see the aforementioned libm implementation), but my inputs are technically 16-bit WinAPI coordinates so I couldn't be bothered.

Testing it with the following on MSVC,

inline int64_t getTimer() {
	return std::chrono::duration_cast<std::chrono::microseconds>(

int main(int argc, char** argv) {
	struct { double x, y; } static inputs[36000];
	struct { double real, fake; int diff; } static outputs[std::size(inputs)];
	for (int i = 0; i < std::size(inputs); i++) {
		double angle = (double)i / std::size(inputs) * M_PI * 2;
		double len = 4;
		inputs[i].x = cos(angle) * len;
		inputs[i].y = sin(angle) * len;
	auto t = getTimer();
	for (int i = 0; i < std::size(inputs); i++) outputs[i].real = atan2(inputs[i].y, inputs[i].x);
	t = getTimer() - t;
	double sum = 0; for (int i = 0; i < std::size(inputs); i++) sum += outputs[i].real;
	printf("t=%lld sum~%.3f\n", t, sum);

	t = getTimer();
	for (int i = 0; i < std::size(inputs); i++) outputs[i].fake = FastArcTan2(inputs[i].y, inputs[i].x);
	t = getTimer() - t;
	sum = 0;
	for (int i = 0; i < std::size(inputs); i++) {
		sum += outputs[i].fake;
		outputs[i].diff = (int)((outputs[i].fake - outputs[i].real) * 10000);
	printf("t=%lld sum~%.3f\n", t, sum);

I get ~2150µs with "real" atan2 and ~315µs with FastArcTan2, which is pretty good - about 6 times faster?


As per above:

  • Maximum error of 0.0015 rad. / 0.085 deg.
  • Doesn't handle infinities/NaNs.

That's all!

Related posts:

One thought on “Fast approximate arctan2/atan2 function

  1. _declspec(naked) float _vectorcall arctg2(float y, float x)
      static const float ct[14] =       
        vmulss xmm2,xmm0,xmm0           
        mov edx,offset ct               
        vfmadd231ss xmm2,xmm1,xmm1      
        vmovmskps ecx,xmm1              
        vsqrtss xmm2,xmm2,xmm2          
        jecxz arctg_x_sgn               
        vsubss xmm2,xmm2,xmm1           
        vdivss xmm0,xmm2,xmm0           
        jmp arctg_x_done                
        vaddss xmm2,xmm2,xmm1           
        vdivss xmm0,xmm0,xmm2           
        vshufps xmm1,xmm0,xmm0,0        
        vmovups xmm3,[edx+16]           
        vmulps xmm2,xmm1,xmm1           
        vmulps xmm4,xmm2,xmm2           
        vucomiss xmm2,[edx+40]          
        ja arctg_big                    
        vfmadd231ps xmm3,xmm2,[edx]     
        vmovhlps xmm1,xmm1,xmm3         
        vfmadd231ps xmm3,xmm4,xmm1      1
        vmovshdup xmm2,xmm3             
        vdivss xmm2,xmm2,xmm3           
        vmulss xmm0,xmm0,xmm2           
        vfmadd213ps xmm3,xmm2,[edx]     
        vmovmskpd eax,xmm1              
        vmovhlps xmm0,xmm0,xmm3         
        vfmadd213ps xmm3,xmm4,xmm0      
        vmovss xmm0,[edx+4*eax+32]      
        vucomiss xmm2,[edx+52]          
        jnb arctg_end                   
        vmovshdup xmm4,xmm3             
        vmulss xmm1,xmm1,xmm3           
        vfmsub132ss xmm0,xmm4,xmm1      
        vdivss xmm0,xmm0,xmm1           
        vaddss xmm0,xmm0,[edx+4*eax+36] 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.