Need optimization advice for querying 2 million IP ranges at high throughput (100s/sec)

Anton_Seitz · November 11, 2024, 4:14am

Hey Guys!

I’m struggling with performance on IP range lookups and could really use some expert advice.

Core Problem:

Database contains ~2 million IP ranges across different entries
Need to check which entry owns specific IPs
Requirement: Process hundreds of IP lookups per second
Current performance: Several seconds per lookup (way too slow)

Database Structure:

{
    "company_name": "Example Corp",
    "entries": [
        {
            "ranges": [
                {
                    
// Tried both approaches:
                    
// 1. Integer ranges:
                    "ip_start": NumberLong("167772160"),  
// 
                    "ip_end": NumberLong("184549375"),    
// 
                    
                    
// 2. Binary format:
                    "ip_start_bin": Binary("..."),
                    "ip_end_bin": Binary("...")
                }
                
// Multiple ranges per entry
            ]
        }
        
// Multiple entries possible
    ]
}

Current Implementation:

def ip_to_number(ip: str) -> int:
    """Convert IP to integer for range comparison"""
    return int(ipaddress.IPv4Address(ip))

# Also tried binary format:
def ip_to_binary(ip: str) -> bytes:
    """Convert IP to binary format"""
    return struct.pack('>I', int(ipaddress.IPv4Address(ip)))

def find_company_for_ip(ip):
    ip_num = ip_to_number(ip)  
# or ip_to_binary(ip)
    
    query = {
        "entries.ranges": {
            "$elemMatch": {
                "ip_start": {"$lte": ip_num},
                "ip_end": {"$gte": ip_num}
            }
        }
    }
    
    return collection.find(query, {"company_name": 1}).hint("ip_range_idx")
    #ip_range_idx is the index

# CIDR approach (faster but accuracy concerns):
def find_company_for_ip_cidr(ip):
    ip_obj = ipaddress.ip_address(ip)
    query = {
        "entries.ranges.cidr": {
            "$regex": f"^{ip_obj.exploded.rsplit('.', 1)[0]}"
        }
    }
    return collection.find(query, {"company_name": 1})

What I’ve Tried:

Data Storage:

Stored IP ranges as integers (converted from IP) (problem with IPV6 anyway)
Stored as binary format
Converted to CIDR notation (faster queries but uncertain about accuracy for all range types)
Indexed both formats

Indexing:

Compound index on ip_start and ip_end
Index on CIDR field
Various index types and combinations

Query Optimization:

Batch processing
Parallel processing
Different query structures
Aggregation pipeline approaches

Key Challenges:

The sheer volume of IP ranges (2 million) makes range queries very slow (2s per request)
Need sub-second response time for lookups
Each lookup potentially needs to scan a large number of ranges

Questions:

Would a different data structure/schema work better for IP range lookups?
Should we consider a specialized IP/CIDR database solution instead?
Any other ideas how to make this requirement possible?

Technical Details:

MongoDB version: 6.0
Python driver: PyMongo 4.5.0
Dev Server with 8 Cores and 32GB of RAM
Total IP ranges: ~2 million
Required throughput: 100+ lookups/second
Current performance: ~2-5 seconds per lookup

Any insights, suggestions, or alternative approaches would be greatly appreciated.
Happy to provide more details if needed.
Thank you very much!

Shane · November 11, 2024, 6:56pm

My suggestion is to look at the explain query plan and determine whether it is properly using an efficient index. See cursor – Tools for iterating over MongoDB query results - PyMongo 4.10.1 documentation and https://www.mongodb.com/docs/manual/reference/command/explain/#output .