Deep Dive into Python: Normalizing Location Data for Time Comparison
In the past week, I've begun working with a team to create Germ Theory, a collaborative effort with Dr. Eric Ding at Harvard to map disease outbreaks and determine who may be infected. As part of this effort, I am working directly with Jose Merino to assign a disease index for every user on the application.
Essentially, the app will collect massive amounts of location data from patients with a given disease and people who use the app. To determine someone's likelihood of infection, we determine whether they have come close to someone with a given disease in a given period of time. We then index all this data, so the user will receive an index from 0-1 based on their likelihood of infection. Public health officials will also have the ability to view high-risk citizens in their area and possibly intervene and get them help.
Because we will be processing massive amounts of data, we elected to implement this feature as a python service, and we're setting up a distributing computing framework for it using Apache Spark. Ultimately, the most naive implementation of our algorithm will take two sets of data: locations of a non-infected user and locations of an infected user and output an index between 0 and 1.
However, this data is often inconsistent. We may have data for an infected user at 4:02 PM 4:07 PM on a given day but only have data for a non-infected user at 4:00 PM and 4:10 PM. this data might look something like this:
# data comes in a list of tuples of (id, latitude, longitude, time in unix time) infectedUser = [(1, 37.755447716467756, -122.41640567779541, 1415721720), (1, 37.75237696810695, -122.41748929023743, 1415722020)] comparedUser = [(2, 37.75554102450563, -122.41775751113892, 1415721600), (2, 37.753852979082346, -122.4162769317627, 1415722200)]
On a map, this data looks like this(green is the infected User and Blue is the non-infected user):
Unfortunately, to determine if these two users came into contact, we need to check their positions at the same datetime. To do so, I designed an algorithm that, given a massive data set, will normalize our data so we can compare two users at similar times. Essentially, when we pull data from our postgres database, it will immediately be sorted into long lists of tuples as described above. To "normalize" so we have comparable position vs. time data, I built a function called normalize, as follows:
# Function takes a variety of different timed data and returns data points # userData: a set of position data for a specific user(sorted by time) # numOfPositions: the number of iterations we want to compare # timePeriod: amount of time to analyze, in seconds def normalize(userData, numOfIterations, startTime, endTime): data =  timePeriod = endTime - startTime times =  segment = timePeriod / numOfIterations current = startTime while current <= timePeriod: times.append(current) current += segment for time in times: surroundingIndices = approxBinarySearch(userData, time, 0) # returns a list containing the indices of the surrounding two times, returns one index if there is a match if len(surroundingIndices) == 1: # no interpolation necessary data.append(userData[surroundingIndices]) elif len(surroundingIndices) == 2: newLatLong = interpolate( userData[surroundingIndices], userData[surroundingIndices], time) # create a tuple of the interpolated latitude and longitude newTuple = (userData, newLatLong, newLatLong, time) data.append(newTuple) else: break # break if we are unable to access data for the given time return data
As described in the function above, our normalize function takes a start and an end time, a data set, and a number of iterations(a number of time period for which we want to check where our users are).
There are also two other functions called here, that I defined elsewhere: interpolate and approxBinarySearch. Approximate Binary Search is an interesting function I wrote to comb through the array of tuples and find the exact two time periods which surround a desired time. For example, if we want to find where someone was at 4:05 and we know where they were at 4:00 and 4:10, it will return a tuple of the indexes of 4:00 and 4:10 from the array of locations. The full code for this function is as follows:
# Function will search an array of times for an interval between which our times match # All Times stored in unix time def approxBinarySearch(tupleArray, targetTime, lo, hi = None): if hi is None: hi = len(tupleArray) if targetTime > tupleArray[len(tupleArray ) - 1] or targetTime < tupleArray: return () while lo < hi: mid = (lo + hi)//2 midval = tupleArray[mid] if mid + 1 != len(tupleArray): nextval = tupleArray[mid + 1] else: nextval = None if mid != 0: prevval = tupleArray[mid - 1] else: prevval = None # does midval match our target? if targetTime == midval: return [mid] # test if the target fits between previous and current elif prevval < targetTime < midval: return [mid - 1, mid] # test if the target fits between current and next elif midval < targetTime < nextval: return [mid, mid + 1] elif midval < targetTime: lo = mid + 1 elif midval > targetTime: hi = mid else: return [mid] return ()
As you can see, the function works like a typical binary search, except that it searches neighboring points in the array to test for a match. It was essential to implement this using binary search rather than a less efficient sorting algorithm as we're trying to maximize the efficiency of our overall algorithm. Once we've found the two points that surround the point in time we're looking for, we need to find where the user was at the desired time. To do so, we take the two surrounding values that we just found, and we interpolate the value between them.
# this function finds the actual latitude and longitude coordinates between two points ie 0--x--------0 def interpolate(coords1, coords2, targetTime): # coords 1/2 looks something like this: (userid, latitude, longitude, time) # we calculate the solution as follows: # 1. Calculate alpha -> (target time - coords1 time) / (coords2 time - coords1 time) # 2. solve for latitude: coords1 lat + alpha(coords2 lat - coords1 lat) return (coords1 + (((targetTime - coords1) / (coords2 - coords1)) * (coords2 - coords1)), coords1 + (((targetTime - coords1) / (coords2 - coords1)) * (coords2 - coords1)) )
Although the code above is a bit illegible, it essentially executes the following function:
We execute this function twice. Once for latitude and once for longitude. The p variable stands for position and the t variable stands for time. Essentially, we find the difference in latitude and longitude based on the actual difference in times. For example, if we wanted to get a location at 4:05, we could cross-reference a location at 4:02 and 4:07, and that value would be .6, which is used to find a relatively precise spot that a user was at a given time.
While the answer may seem straightforward now, it was a pretty tough thing to put together. That said, I loved the challenge of working in python, a language I was completely unfamiliar with, and I'm looking forward to integrating our work with Apache Spark soon! If you have any questions about the implementation, share them in the comments below!