PyMongo との比較

項目一覧

データの読み取り

データの書込み
ベンチマーク

このガイドでは、PyMongoArrow と PyMongo ドライバーの違いについて学習できます。このガイドでは、 PyMongoとMongoDBの基本的な概念に理解していることを前提としています。

データの読み取り

PyMongo を使用してデータを読み取る最も基本的な方法は次のとおりです。

coll = db.benchmark
f = list(coll.find({}, projection={"_id": 0}))
table = pyarrow.Table.from_pylist(f)

これは機能しますが、 _idフィールドを除外する必要があります。そうしないと、次のエラーが発生します。

pyarrow.lib.ArrowInvalid: Could not convert ObjectId('642f2f4720d92a85355671b3') with type ObjectId: did not recognize Python value type when inferring an Arrow data type

以下のコード例は、PyMongo を使用している場合の上記のエラーの回避策を示しています。

>>> f = list(coll.find({}))
>>> for doc in f:
...     doc["_id"] = str(doc["_id"])
...
>>> table = pyarrow.Table.from_pylist(f)
>>> print(table)
pyarrow.Table
_id: string
x: int64
y: double

これでエラーは回避できても、 _idを string として表示するスキーマが示すように、Arrow は_idが ObjectId であることを識別できないという限界があります。

PyMongoArrow は、Arrow またはpandas 拡張タイプを通じてBSON typesをサポートします。これにより、前述の回避策を回避できます。

>>> from pymongoarrow.types import ObjectIdType
>>> schema = Schema({"_id": ObjectIdType(), "x": pyarrow.int64(), "y": pyarrow.float64()})
>>> table = find_arrow_all(coll, {}, schema=schema)
>>> print(table)
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
x: int64
y: double

このメソッドを使用すると、Arrow は型を正しく識別します。これは、数値以外の拡張型での使用は制限されますが、日時のソートなどの特定の操作では不要なキャストを回避できます。

f = list(coll.find({}, projection={"_id": 0, "x": 0}))
naive_table = pyarrow.Table.from_pylist(f)
schema = Schema({"time": pyarrow.timestamp("ms")})
table = find_arrow_all(coll, {}, schema=schema)
assert (
    table.sort_by([("time", "ascending")])["time"]
    == naive_table["time"].cast(pyarrow.timestamp("ms")).sort()
)

さらに、PyMongoArrow は Pandora 拡張型をサポートしています。 PyMongo では、 Decimal128値は次のように動作します。

coll = client.test.test
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
cursor = coll.find({})
df = pd.DataFrame(list(cursor))
print(df.dtypes)
# _id      object
# value    object

PyMongoArrow で同等のは次のとおりです。

from pymongoarrow.api import find_pandas_all
coll = client.test.test
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
df = find_pandas_all(coll, {})
print(df.dtypes)
# _id      bson_PandasObjectId
# value    bson_PandasDecimal128

どちらの場合も、基礎となる値は BSON クラスの型です。

print(df["value"][0])
Decimal128("0")

データの書込み

PyMongo を使用して Arrow テーブルからデータを書き込むと、次のようになります。

data = arrow_table.to_pylist()
db.collname.insert_many(data)

PyMongoArrow で同等のは次のとおりです。

from pymongoarrow.api import write
write(db.collname, arrow_table)

PyMongoArrow 1.0 以降、 write関数を使用する主な利点は、矢印テーブル、データフレーム、または numpy 配列を反復処理し、オブジェクト全体をリストに変換しないことです。

ベンチマーク

次の測定値は、PyMongoArrow バージョン 1.0 および PyMongo バージョン 4.4 で取得されたものです。挿入の場合、ライブラリは従来の PyMongo を使用する場合とほぼ同じを実行し、同じ量のメモリを使用します。

ProfileInsertSmall.peakmem_insert_conventional      107M
ProfileInsertSmall.peakmem_insert_arrow             108M
ProfileInsertSmall.time_insert_conventional         202±0.8ms
ProfileInsertSmall.time_insert_arrow                181±0.4ms
ProfileInsertLarge.peakmem_insert_arrow             127M
ProfileInsertLarge.peakmem_insert_conventional      125M
ProfileInsertLarge.time_insert_arrow                425±1ms
ProfileInsertLarge.time_insert_conventional         440±1ms

読み取りの場合、ライブラリは小さいドキュメントとネストされたドキュメントでは遅くなりますが、大きなドキュメントでは高速です。すべての場合、使用するメモリが少なくなります。

ProfileReadSmall.peakmem_conventional_arrow     85.8M
ProfileReadSmall.peakmem_to_arrow               83.1M
ProfileReadSmall.time_conventional_arrow        38.1±0.3ms
ProfileReadSmall.time_to_arrow                  60.8±0.3ms
ProfileReadLarge.peakmem_conventional_arrow     138M
ProfileReadLarge.peakmem_to_arrow               106M
ProfileReadLarge.time_conventional_ndarray      243±20ms
ProfileReadLarge.time_to_arrow                  186±0.8ms
ProfileReadDocument.peakmem_conventional_arrow  209M
ProfileReadDocument.peakmem_to_arrow            152M
ProfileReadDocument.time_conventional_arrow     865±7ms
ProfileReadDocument.time_to_arrow               937±1ms

戻る

新機能

データ型