Need optimization advice for querying 2 million IP ranges at high throughput (100s/sec)

Hey Guys!

I’m struggling with performance on IP range lookups and could really use some expert advice.

Core Problem:

  • Database contains ~2 million IP ranges across different entries
  • Need to check which entry owns specific IPs
  • Requirement: Process hundreds of IP lookups per second
  • Current performance: Several seconds per lookup (way too slow)

Database Structure:

{
    "company_name": "Example Corp",
    "entries": [
        {
            "ranges": [
                {
                    
// Tried both approaches:
                    
// 1. Integer ranges:
                    "ip_start": NumberLong("167772160"),  
// 
                    "ip_end": NumberLong("184549375"),    
// 
                    
                    
// 2. Binary format:
                    "ip_start_bin": Binary("..."),
                    "ip_end_bin": Binary("...")
                }
                
// Multiple ranges per entry
            ]
        }
        
// Multiple entries possible
    ]
}

Current Implementation:

def ip_to_number(ip: str) -> int:
    """Convert IP to integer for range comparison"""
    return int(ipaddress.IPv4Address(ip))

# Also tried binary format:
def ip_to_binary(ip: str) -> bytes:
    """Convert IP to binary format"""
    return struct.pack('>I', int(ipaddress.IPv4Address(ip)))

def find_company_for_ip(ip):
    ip_num = ip_to_number(ip)  
# or ip_to_binary(ip)
    
    query = {
        "entries.ranges": {
            "$elemMatch": {
                "ip_start": {"$lte": ip_num},
                "ip_end": {"$gte": ip_num}
            }
        }
    }
    
    return collection.find(query, {"company_name": 1}).hint("ip_range_idx")
    #ip_range_idx is the index

# CIDR approach (faster but accuracy concerns):
def find_company_for_ip_cidr(ip):
    ip_obj = ipaddress.ip_address(ip)
    query = {
        "entries.ranges.cidr": {
            "$regex": f"^{ip_obj.exploded.rsplit('.', 1)[0]}"
        }
    }
    return collection.find(query, {"company_name": 1})

What I’ve Tried:

  1. Data Storage:
  • Stored IP ranges as integers (converted from IP) (problem with IPV6 anyway)
  • Stored as binary format
  • Converted to CIDR notation (faster queries but uncertain about accuracy for all range types)
  • Indexed both formats
  1. Indexing:
  • Compound index on ip_start and ip_end
  • Index on CIDR field
  • Various index types and combinations
  1. Query Optimization:
  • Batch processing
  • Parallel processing
  • Different query structures
  • Aggregation pipeline approaches

Key Challenges:

  1. The sheer volume of IP ranges (2 million) makes range queries very slow (2s per request)
  2. Need sub-second response time for lookups
  3. Each lookup potentially needs to scan a large number of ranges

Questions:

  1. Would a different data structure/schema work better for IP range lookups?
  2. Should we consider a specialized IP/CIDR database solution instead?
  3. Any other ideas how to make this requirement possible?

Technical Details:

  • MongoDB version: 6.0
  • Python driver: PyMongo 4.5.0
  • Dev Server with 8 Cores and 32GB of RAM
  • Total IP ranges: ~2 million
  • Required throughput: 100+ lookups/second
  • Current performance: ~2-5 seconds per lookup

Any insights, suggestions, or alternative approaches would be greatly appreciated.
Happy to provide more details if needed.
Thank you very much!

My suggestion is to look at the explain query plan and determine whether it is properly using an efficient index. See cursor – Tools for iterating over MongoDB query results - PyMongo 4.10.1 documentation and https://www.mongodb.com/docs/manual/reference/command/explain/#output .