This question has already been answered in the Power BI forums here, as well as a good article by radacad, here, but I'm someone who loves to optimize things. If you're working with a large fact table (holding the first set of GPS coordinates) and a large lookup table (holding the second set of GPS coordinates), the calculation will take a long time. Indeed, even on my small data set of 80K records and a lookup table containing 1.4K rows, the calculation already takes up to 24 seconds for a calculated column (not a measure). Indeed, minimum distance calculations are not trivial.
Note: Even though I'm working with a calculated column, the same optimization techniques can be applied for a measure.
So what are we working with? I have a fact table, UFO Sightings and lookup table, Airports. Each one consisting of latitude and longitude coordinates. The goal is to determine if UFO sightings mostly occur closer to airports (hint hint, they do).
Notice that there is no relationship between UFO Sightings and Airports. This is intentional for the purposes of this example.
Here is what we start out with (Calculated Column):
Closest Airport Distance (km) =
IF ( ISBLANK ( 'UFO Sightings'[city_latitude] ), BLANK(),
VAR _RowLat = 'UFO Sightings'[citylatitude]
VAR _RowLong = 'UFO Sightings'[citylongitude]
// Find the airport with the closest distance
VAR _ClosestAirportDistance =
MINX(
ALL ( Airports ),
VAR _SampleLat = 'Airports'[Latitude]
VAR _SampleLong = 'Airports'[Longitude]
VAR _P = DIVIDE( PI(), 180 )
VAR _A =
0.5 -
COS( ( _SampleLat - _RowLat ) * _P ) / 2 +
COS( _RowLat * _P ) * COS( _SampleLat * _P ) *
( 1 - COS( ( _SampleLong - _RowLong ) * _P ) ) / 2
VAR _Result = 12742 * ASIN( ( SQRT( _A ) ) )
RETURN
_Result
)
RETURN
_ClosestAirportDistance
)
Although this creates a simple query plan with just two queries, all the work is done by the formula engine (100% of it!). This is because the storage engine cannot compute complex calculation (like ACOS, SIN, COS, etc.).
1. Remove columns from the lookup table
In my case, the Airports lookup table contains columns that are not needed for the distance calculation; i.e. airport name, airport city, etc. The only columns needed are latitude and longitude. Therefore, instead of passing in the entire table, we can use SELECTCOLUMNS to remove those unneeded columns. The optimizer may do this already for us, but it's a good practice to do it ourselves anyways.
VAR _AirportLongLat =
SELECTCOLUMNS(
'Airports',
'Airports'[Latitude],
'Airports'[Longitude]
)
In this instance, it appears that the optimizer is aware that our calculation solely relies on the latitude and longitude columns, and our SELECTCOLUMNS is not really needed, and subsequently does not improve performance. We can confirm this in the second xmSQL query:
2. Reduce cardinality of coordinates
The speed of any query depends greatly on the cardinality of the column(s) you are filtering over. Our case is slightly different because we are iterating of the values (not filtering), but it may be worthwhile exploring this option.
For my purpose, I'm interested in large distances (greater than 1km), so I do not need the precision of 6 or more decimal places (which gives you precision in feet). Although a good idea, reducing the precision to 3 decimal places only gives me a minimal efficiency boost (precision still remains great though!).
37.24315904° N, 115.79308726° W
to
37.243° N, 115.793° W
Even if this technique doesn't improve the speed of the calculation much, it does reduce the size of our pbix file due to better compression. We can see this size reduction by viewing the metrics with DAX Studio, under the Advanced tab.
596KB -> 226KB = 62% size reduction!
3. Avoid checking for nulls with IF()
Some of my latitude and longitude values are missing in my fact table. It would be silly to run expensive trigonometry expressions over these missing values. This is why I'm checking for null values before doing the distance calculation. However, it turns out that IF and ISBLANK are expensive operations as well.
Removing this logic condition improves speed by about 8% ~ 10%!
4. Pre-Filter the lookup table
We know that the storage engine cannot perform the trigonometry math (ACOS, COS & SIN) required to calculate the distance. So it looks like our only option is to limit the number of iterations the formula engine needs to perform. My first thought is to filter the dimension table (airports in my case) as much as possible before performing the trig math.
One way to do this is filter the table for close matches. For example, given the GPS coordinate (0, 0) and the comparison coordinates of (1, 1) and (10, 10), we can easily (even visually) see that the first coordinate of (1, 1) is closer than (10, 10).
Therefore, we have the option to narrow down the dimension table based on the latitude and longitude table within a specified range of values (within X GPS units). In my case, I set this to 5 points (represented by the _Buffer variable):
VAR _Buffer = 5
VAR _FilteredAirports =
FILTER(
Airports,
Airports[Latitude] >= _RowLat - _Buffer &&
Airports[Latitude] <= _RowLat + _Buffer &&
Airports[Longitude] >= _RowLong - _Buffer &&
Airports[Longitude] <= _RowLong + _Buffer
)
This is by far the best improvement so far. The execution time is now just under 5 seconds! 4x speed boost!
Combining this technique with the reduced cardinality saves us an additional second as well, so we end up with around 3.6 seconds.
Using the Pythagorean theorem
Using the Pythagorean theorem (a^2 + b^2 = c^2) will not give accurate distance results, however if your data is focused on a single part of the globe, say a single country, it may be possible to use the Pythagorean theorem to pre-calculate an estimated minimum distance first. That will give you the closest coordinates (airport in my case). You can then calculate the accurate distance using the more complex distance formula. The DAX code becomes a bit more verbose as follows:
// Add an estimated distance column to the Airports table using Pythagorean Theorem
VAR _ClosestAirportDistances =
ADDCOLUMNS(
'Airports',
"@pth_distance",
VAR _SampleLat = 'Airports'[Latitude]
VAR _SampleLong = 'Airports'[Longitude]
VAR _Result =
SQRT((_SampleLat - RowLat) *
( _SampleLat - _RowLat) + (_SampleLong - _RowLong) *
( _SampleLong - _RowLong)
)
RETURN
_Result
)
// Find the minimum Pyth. Th. distance
VAR _MinAirportDistanceBasedOnPyth =
MINX (
_ClosestAirportDistances,
[@pth_distance]
)
// Calculate the accurate distance on only the closest airport
VAR _TrueClosestDistance =
MINX(
FILTER(
_ClosestAirportDistances,
[@pth_distance] = _MinAirportDistanceBasedOnPyth
),
VAR _SampleLat = 'Airports'[Latitude]
VAR _SampleLong = 'Airports'[Longitude]
VAR _P = DIVIDE( PI(), 180 )
VAR _A = 0.5 -
COS( ( _SampleLat - _RowLat ) * _P ) / 2 +
COS( _RowLat * P ) COS( _SampleLat * P ) *
( 1 - COS( ( _SampleLong - RowLong ) * _P ) ) / 2
VAR _Result = 12742 * ASIN( ( SQRT( A ) ) )
RETURN
_Result
)
This technique is about twice as fast as our original DAX code, however...
still not faster than pre-filtering the lookup table using a buffer (technique number 4 above), so this technique may be good to use in combination with pre-filtering the lookup table. I would urge caution because in my case I ended up with some different distances because the Pythagorean Theorem returned different airports. Out of 80K rows, 4.7K returned different distances, averaging 3.7 km off, and the largest difference was 56 km.
Cross-Join on the lookup table
Another option is to do a cross-join between the fact and lookup table - performing the distance calculation on one big table instead of iterating between two tables. That being said, the query plan only has two queries, and my guess is that performance would be equivalent, if not worse, than our original code.
And indeed, using a cross-join seems to be the wrong choice.
My code ran for greater than 30 seconds so it's not even worth exploring further. Though it is interesting that even though the total number of iterations remains the same (80K x 1.4K), iterating over two tables instead of one big table results in better performing DAX.
Conclusions
Whenever you need to optimize your DAX code, reducing the number of iteration is always the best thing to do. In this article, we achieved a 4X speed optimization by pre-fiiltering the lookup table before applying any type of calculation. This is further refined by other techniques by using less decimals for the GPS coordinates, and removing logical checks.
Optimizing DAX is a fun and exciting topic. If you're interested in more optimizing techniques, check out my other article here where I talk about SQLBI's book, Optimizing DAX.
This blog was in part a continuation of my other blog post about UFO Sightings.
Cheers!
コメント