How to Find User Geo Location and Perform Large Scale Data Analysis in Data Warehouse

With deluge of data from users in weblogs companies make use of them for multiple purpose including user behavior like how long users spent time on which pages, which are more popular pages, user page traversal, etc. Typical weblog has information (but not any specific user identifiable info) like the time of the day, ip address of user, the page served, status of page served, etc.

One way to make use of IP address is to map it to a location like city, state or country. One can get/purchase products that provide list of IP blocks mapped to locations. One for example is MaxMind. The data is reasonably static but you can get the latest data once every month or quarter and use it. So, now you have users’ IP addresses in one hand and mapping of IP blocks to city/country in another and finding the user location may seem straight forward which it is if you are doing it for few IP addrs. But it can be a big performance issue if you are doing analysis on tens or hundreds of thousands of IPs on regular basis (using MySQL database).

MySQL 4.x or 5.x, do not use index while performing range operations between two sets of data. But GeoCity data maps block of an IP address to a city and for a good reason. This will force you to use range operations like (>=) greater than or equal and (

Note: IPv4 (IP address of 4 octets like with 32 bits unlike IPv6 which are 128 bits long) can have 2^32 (2 to power of 32) addresses, nearly 4 billion. Furthermore, IP authorities allocate group of IP addresses to ISPs or particular country and there is no reason to store all separate IP addresses.