PyMongo와 비교하기

이 페이지의 내용

데이터 읽기

데이터 쓰기
벤치마크

이 가이드에서는 PyMongoArrow와 PyMongo 드라이버의 차이점에 대해 알아볼 수 있습니다. 이 가이드는 기본 PyMongo 및 MongoDB 개념을 잘 알고 있다고 가정합니다.

데이터 읽기

PyMongo를 사용하여 데이터를 읽는 가장 기본적인 방법은 다음과 같습니다.

coll = db.benchmark
f = list(coll.find({}, projection={"_id": 0}))
table = pyarrow.Table.from_pylist(f)

이 방법은 작동하지만 _id 필드를 제외해야 하며, 그렇지 않으면 다음 오류가 발생합니다.

pyarrow.lib.ArrowInvalid: Could not convert ObjectId('642f2f4720d92a85355671b3') with type ObjectId: did not recognize Python value type when inferring an Arrow data type

다음 코드 예시는 PyMongo를 사용할 때 앞에 설명한 오류에 대한 해결 방법을 보여줍니다.

>>> f = list(coll.find({}))
>>> for doc in f:
...     doc["_id"] = str(doc["_id"])
...
>>> table = pyarrow.Table.from_pylist(f)
>>> print(table)
pyarrow.Table
_id: string
x: int64
y: double

이렇게 하면 오류를 방지할 수 있지만 _id 을(를) 문자열로 표시하는 스키마에서 볼 수 있듯이 Arrow가 _id 가 ObjectId인지 식별하지 못하는 것이 단점입니다.

PyMongoArrow는 Arrow 또는 Pandas 확장 유형을 통해 BSON types을 지원합니다. 이렇게 하면 앞의 해결 방법을 피할 수 있습니다.

>>> from pymongoarrow.types import ObjectIdType
>>> schema = Schema({"_id": ObjectIdType(), "x": pyarrow.int64(), "y": pyarrow.float64()})
>>> table = find_arrow_all(coll, {}, schema=schema)
>>> print(table)
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
x: int64
y: double

Arrow는 이 메서드를 사용하여 유형을 올바르게 식별합니다. 이는 숫자가 아닌 확장 유형에는 제한적으로 사용되지만 날짜/시간 정렬과 같은 특정 작업을 위한 불필요한 캐스팅을 방지합니다.

f = list(coll.find({}, projection={"_id": 0, "x": 0}))
naive_table = pyarrow.Table.from_pylist(f)
schema = Schema({"time": pyarrow.timestamp("ms")})
table = find_arrow_all(coll, {}, schema=schema)
assert (
    table.sort_by([("time", "ascending")])["time"]
    == naive_table["time"].cast(pyarrow.timestamp("ms")).sort()
)

또한 PyMongoArrow는 Pandas 확장 유형을 지원합니다. PyMongo에서 Decimal128 값은 다음과 같이 작동합니다:

coll = client.test.test
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
cursor = coll.find({})
df = pd.DataFrame(list(cursor))
print(df.dtypes)
# _id      object
# value    object

PyMongoArrow에서 이에 해당하는 것은 다음과 같습니다.

from pymongoarrow.api import find_pandas_all
coll = client.test.test
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
df = find_pandas_all(coll, {})
print(df.dtypes)
# _id      bson_PandasObjectId
# value    bson_PandasDecimal128

두 경우 모두 기본 값은 BSON 클래스 유형입니다.

print(df["value"][0])
Decimal128("0")

데이터 쓰기

PyMongo를 사용하여 Arrow 테이블의 데이터를 작성하는 방법은 다음과 같습니다.

data = arrow_table.to_pylist()
db.collname.insert_many(data)

PyMongoArrow에서 이에 해당하는 것은 다음과 같습니다.

from pymongoarrow.api import write
write(db.collname, arrow_table)

PyMongoArrow 1.0부터 write 함수 사용의 주요 이점은 화살표 테이블, 데이터 프레임 또는 Numpy 배열을 반복할 수 있고 전체 객체를 목록으로 변환하지 않는다는 점입니다.

벤치마크

다음 측정은 PyMongoArrow 버전 1.0 및 PyMongo 버전 4.4에서 수행되었습니다. 삽입의 경우 라이브러리는 기존 PyMongo를 사용할 때와 거의 동일한 성능을 발휘하며 동일한 양의 메모리를 사용합니다.

ProfileInsertSmall.peakmem_insert_conventional      107M
ProfileInsertSmall.peakmem_insert_arrow             108M
ProfileInsertSmall.time_insert_conventional         202±0.8ms
ProfileInsertSmall.time_insert_arrow                181±0.4ms
ProfileInsertLarge.peakmem_insert_arrow             127M
ProfileInsertLarge.peakmem_insert_conventional      125M
ProfileInsertLarge.time_insert_arrow                425±1ms
ProfileInsertLarge.time_insert_conventional         440±1ms

읽기의 경우 라이브러리는 작은 문서와 중첩된 문서의 경우 속도가 느리지만 큰 문서의 경우 빠릅니다. 모든 경우에 더 적은 메모리를 사용합니다.

ProfileReadSmall.peakmem_conventional_arrow     85.8M
ProfileReadSmall.peakmem_to_arrow               83.1M
ProfileReadSmall.time_conventional_arrow        38.1±0.3ms
ProfileReadSmall.time_to_arrow                  60.8±0.3ms
ProfileReadLarge.peakmem_conventional_arrow     138M
ProfileReadLarge.peakmem_to_arrow               106M
ProfileReadLarge.time_conventional_ndarray      243±20ms
ProfileReadLarge.time_to_arrow                  186±0.8ms
ProfileReadDocument.peakmem_conventional_arrow  209M
ProfileReadDocument.peakmem_to_arrow            152M
ProfileReadDocument.time_conventional_arrow     865±7ms
ProfileReadDocument.time_to_arrow               937±1ms

돌아가기

새로운 기능

데이터 유형