与 PyMongo 比较

在此页面上

读取数据

写入数据
基准测试

在本指南中，您可以了解 PyMongoArrow 和 PyMongo 驱动程序之间的区别。本指南假定您熟悉基本的 PyMongo和MongoDB概念。

读取数据

使用 PyMongo 读取数据的最基本方法是：

coll = db.benchmark
f = list(coll.find({}, projection={"_id": 0}))
table = pyarrow.Table.from_pylist(f)

这种方法可行，但必须排除 _id字段，否则会出现以下错误：

pyarrow.lib.ArrowInvalid: Could not convert ObjectId('642f2f4720d92a85355671b3') with type ObjectId: did not recognize Python value type when inferring an Arrow data type

以下代码示例展示了使用 PyMongo 时出现的上述错误的变通方法：

>>> f = list(coll.find({}))
>>> for doc in f:
...     doc["_id"] = str(doc["_id"])
...
>>> table = pyarrow.Table.from_pylist(f)
>>> print(table)
pyarrow.Table
_id: string
x: int64
y: double

尽管这可以避免错误，但一个缺点是 Arrow 无法识别_id是 ObjectId，如将_id显示为字符串的模式所示。

PyMongoArrow 通过 Arrow 或 Pandas 扩展类型支持 BSON types。这样可以避免前面的解决方法。

>>> from pymongoarrow.types import ObjectIdType
>>> schema = Schema({"_id": ObjectIdType(), "x": pyarrow.int64(), "y": pyarrow.float64()})
>>> table = find_arrow_all(coll, {}, schema=schema)
>>> print(table)
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
x: int64
y: double

通过这种方法，Arrow 可以正确识别类型。这限制了非数字扩展类型的使用，但避免了对某些操作（例如对日期时间排序）进行不必要的转换。

f = list(coll.find({}, projection={"_id": 0, "x": 0}))
naive_table = pyarrow.Table.from_pylist(f)
schema = Schema({"time": pyarrow.timestamp("ms")})
table = find_arrow_all(coll, {}, schema=schema)
assert (
    table.sort_by([("time", "ascending")])["time"]
    == naive_table["time"].cast(pyarrow.timestamp("ms")).sort()
)

此外，PyMongoArrow 支持 Pandas 扩展类型。对于 PyMongo， Decimal128值的行为如下：

coll = client.test.test
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
cursor = coll.find({})
df = pd.DataFrame(list(cursor))
print(df.dtypes)
# _id      object
# value    object

PyMongoArrow 中的等效项是：

from pymongoarrow.api import find_pandas_all
coll = client.test.test
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
df = find_pandas_all(coll, {})
print(df.dtypes)
# _id      bson_PandasObjectId
# value    bson_PandasDecimal128

在这两种情况下，基础值都是 BSON 类类型：

print(df["value"][0])
Decimal128("0")

写入数据

使用 PyMongo 从 Arrow 表写入数据如下所示：

data = arrow_table.to_pylist()
db.collname.insert_many(data)

PyMongoArrow 中的等效项是：

from pymongoarrow.api import write
write(db.collname, arrow_table)

从 PyMongoArrow 1.0 开始，使用write函数的主要优点是它会遍历 arrow 表、数据框或 numpy 数组，并且不会将整个对象转换为列表。

基准测试

以下测量值是使用 PyMongoArrow 1.0 版和 PyMongo 4.4 版进行的。对于插入，该库的性能与使用传统 PyMongo 时大致相同，并且使用相同数量的内存。

ProfileInsertSmall.peakmem_insert_conventional      107M
ProfileInsertSmall.peakmem_insert_arrow             108M
ProfileInsertSmall.time_insert_conventional         202±0.8ms
ProfileInsertSmall.time_insert_arrow                181±0.4ms
ProfileInsertLarge.peakmem_insert_arrow             127M
ProfileInsertLarge.peakmem_insert_conventional      125M
ProfileInsertLarge.time_insert_arrow                425±1ms
ProfileInsertLarge.time_insert_conventional         440±1ms

对于读取，库对于小文档和嵌套文档较慢，但对于大文档较快。在所有情况下，它使用的内存都更少。

ProfileReadSmall.peakmem_conventional_arrow     85.8M
ProfileReadSmall.peakmem_to_arrow               83.1M
ProfileReadSmall.time_conventional_arrow        38.1±0.3ms
ProfileReadSmall.time_to_arrow                  60.8±0.3ms
ProfileReadLarge.peakmem_conventional_arrow     138M
ProfileReadLarge.peakmem_to_arrow               106M
ProfileReadLarge.time_conventional_ndarray      243±20ms
ProfileReadLarge.time_to_arrow                  186±0.8ms
ProfileReadDocument.peakmem_conventional_arrow  209M
ProfileReadDocument.peakmem_to_arrow            152M
ProfileReadDocument.time_conventional_arrow     865±7ms
ProfileReadDocument.time_to_arrow               937±1ms

后退

新增功能

来年

数据类型