(click to interact)

Alternatively titled "exciting adventures in doing trigonometry without CRT".## The context

In what can be considered a recurring mistake, I decide to make a tiny DLL. Again!

"Surely" - I think to myself - "I'm not doing anything weird this time around - no complex containers, no CRT-specific functions. Just a few WinAPI calls and a WndProc callback. Nothing will go wrong this time".

And it mostly doesn't. Up until the point where someone suggests that it'd be cool to have a feature where a native cursor automatically rotates towards specified point (such as at a target, or away from the player) without having to update it from the game code (which has some latency).

Turns out that trig functions are also part of CRT!

Kind of makes sense when you think about it, but you don't usually think about it.

Furthermore, it's an intrinsic so it's nowhere to be found in Windows Kits source code (although Ghidra reveals that it's some 150 lines of code not too unlike Julia's libm implementation).

Continued search for compact arctangent implementations eventually leads me into the wonderful field of approximation functions - (usually) polynomial functions that don't look like they make any sense but provide almost as good of an output as The Real Thing for the specified range.

Among the results, there was this blog post citing
*"Efficient approximations for the arctangent function", Rajan, S. Sichun Wang Inkol, R. Joyal, A., May 2006* and *"Streamlining digital signal processing: a tricks of the trade guidebook"*. A rather fortunate omission of algorithms of interest being available in the Google Books preview had apparently been corrected in ten years time, but I have been able to find the first paper without paying anyone an outrageous sum of money.

The Efficient approximations paper lists 7 functions total, with error ranging from 0.0015 rad. (0.085 deg.) to the almost amusingly imprecise 0.07 rad. (4 deg.).

The author of the post picks the most precise function out of the bunch,

arctan(x) ≈ π/4x − x(|x| − 1) × (0.2447 + 0.0663|x|), −1 ≤ x ≤ 1.which seems like a good idea - in my case however long the function takes will be greatly outweighted by the subsequent WinAPI calls anyway.

Next up is turning this arctan into arctan2.

There are specific ways that an arctan2 function should be implemented in, which also happen to be of little use here because our arctan function has a limited input range (read: ±45deg).

The paper shows an example of picking the right formula based on octant in table 2.

After thinking a little bit about what an optimal way of figuring out an octant might look like,
I conclude that it is unlikely to make a substantial difference and I might as well have
a readable function with a 3 nested checks (side ➜ quadrant ➜ octant).

## The code

It's about what you'd expect

constexpr double M_PI = 3.141592653589793; constexpr double M_PI_2 = 1.5707963267948966; constexpr double M_PI_4 = 0.7853981633974483; double FastArcTan(double x) { return M_PI_4 * x - x * (fabs(x) - 1) * (0.2447 + 0.0663 * fabs(x)); } double FastArcTan2(double y, double x) { if (x >= 0) { // -pi/2 .. pi/2 if (y >= 0) { // 0 .. pi/2 if (y < x) { // 0 .. pi/4 return FastArcTan(y / x); } else { // pi/4 .. pi/2 return M_PI_2 - FastArcTan(x / y); } } else { if (-y < x) { // -pi/4 .. 0 return FastArcTan(y / x); } else { // -pi/2 .. -pi/4 return -M_PI_2 - FastArcTan(x / y); } } } else { // -pi..-pi/2, pi/2..pi if (y >= 0) { // pi/2 .. pi if (y < -x) { // pi*3/4 .. pi return FastArcTan(y / x) + M_PI; } else { // pi/2 .. pi*3/4 return M_PI_2 - FastArcTan(x / y); } } else { // -pi .. -pi/2 if (-y < -x) { // -pi .. -pi*3/4 return FastArcTan(y / x) - M_PI; } else { // -pi*3/4 .. -pi/2 return -M_PI_2 - FastArcTan(x / y); } } } }

A *real* arctan2 should also handle infinities and NaNs
(see the aforementioned libm implementation), but my inputs are technically
16-bit WinAPI coordinates so I couldn't be bothered.

Testing it with the following on MSVC,

inline int64_t getTimer() { return std::chrono::duration_cast<std::chrono::microseconds>( std::chrono::system_clock::now().time_since_epoch() ).count(); } int main(int argc, char** argv) { struct { double x, y; } static inputs[36000]; struct { double real, fake; int diff; } static outputs[std::size(inputs)]; for (int i = 0; i < std::size(inputs); i++) { double angle = (double)i / std::size(inputs) * M_PI * 2; double len = 4; inputs[i].x = cos(angle) * len; inputs[i].y = sin(angle) * len; } auto t = getTimer(); for (int i = 0; i < std::size(inputs); i++) outputs[i].real = atan2(inputs[i].y, inputs[i].x); t = getTimer() - t; double sum = 0; for (int i = 0; i < std::size(inputs); i++) sum += outputs[i].real; printf("t=%lld sum~%.3f\n", t, sum); t = getTimer(); for (int i = 0; i < std::size(inputs); i++) outputs[i].fake = FastArcTan2(inputs[i].y, inputs[i].x); t = getTimer() - t; sum = 0; for (int i = 0; i < std::size(inputs); i++) { sum += outputs[i].fake; outputs[i].diff = (int)((outputs[i].fake - outputs[i].real) * 10000); } printf("t=%lld sum~%.3f\n", t, sum); }

I get ~2150µs with "real" `atan2`

and ~315µs with `FastArcTan2`

, which is pretty good - about 6 times faster?

## Caveats

As per above:

- Maximum error of 0.0015 rad. / 0.085 deg.
- Doesn't handle infinities/NaNs.

That's all!