Hey Guys!
I’m struggling with performance on IP range lookups and could really use some expert advice.
Core Problem:
- Database contains ~2 million IP ranges across different entries
- Need to check which entry owns specific IPs
- Requirement: Process hundreds of IP lookups per second
- Current performance: Several seconds per lookup (way too slow)
Database Structure:
{
"company_name": "Example Corp",
"entries": [
{
"ranges": [
{
// Tried both approaches:
// 1. Integer ranges:
"ip_start": NumberLong("167772160"),
//
"ip_end": NumberLong("184549375"),
//
// 2. Binary format:
"ip_start_bin": Binary("..."),
"ip_end_bin": Binary("...")
}
// Multiple ranges per entry
]
}
// Multiple entries possible
]
}
Current Implementation:
def ip_to_number(ip: str) -> int:
"""Convert IP to integer for range comparison"""
return int(ipaddress.IPv4Address(ip))
# Also tried binary format:
def ip_to_binary(ip: str) -> bytes:
"""Convert IP to binary format"""
return struct.pack('>I', int(ipaddress.IPv4Address(ip)))
def find_company_for_ip(ip):
ip_num = ip_to_number(ip)
# or ip_to_binary(ip)
query = {
"entries.ranges": {
"$elemMatch": {
"ip_start": {"$lte": ip_num},
"ip_end": {"$gte": ip_num}
}
}
}
return collection.find(query, {"company_name": 1}).hint("ip_range_idx")
#ip_range_idx is the index
# CIDR approach (faster but accuracy concerns):
def find_company_for_ip_cidr(ip):
ip_obj = ipaddress.ip_address(ip)
query = {
"entries.ranges.cidr": {
"$regex": f"^{ip_obj.exploded.rsplit('.', 1)[0]}"
}
}
return collection.find(query, {"company_name": 1})
What I’ve Tried:
- Data Storage:
- Stored IP ranges as integers (converted from IP) (problem with IPV6 anyway)
- Stored as binary format
- Converted to CIDR notation (faster queries but uncertain about accuracy for all range types)
- Indexed both formats
- Indexing:
- Compound index on ip_start and ip_end
- Index on CIDR field
- Various index types and combinations
- Query Optimization:
- Batch processing
- Parallel processing
- Different query structures
- Aggregation pipeline approaches
Key Challenges:
- The sheer volume of IP ranges (2 million) makes range queries very slow (2s per request)
- Need sub-second response time for lookups
- Each lookup potentially needs to scan a large number of ranges
Questions:
- Would a different data structure/schema work better for IP range lookups?
- Should we consider a specialized IP/CIDR database solution instead?
- Any other ideas how to make this requirement possible?
Technical Details:
- MongoDB version: 6.0
- Python driver: PyMongo 4.5.0
- Dev Server with 8 Cores and 32GB of RAM
- Total IP ranges: ~2 million
- Required throughput: 100+ lookups/second
- Current performance: ~2-5 seconds per lookup
Any insights, suggestions, or alternative approaches would be greatly appreciated.
Happy to provide more details if needed.
Thank you very much!