154 lines
4.3 KiB
Plaintext
154 lines
4.3 KiB
Plaintext
Metadata-Version: 2.4
|
|
Name: fastparquet
|
|
Version: 2025.12.0
|
|
Summary: Python support for Parquet file format
|
|
Home-page: https://github.com/dask/fastparquet/
|
|
Author: Martin Durant
|
|
Author-email: mdurant@anaconda.com
|
|
License: Apache License 2.0
|
|
Classifier: Development Status :: 4 - Beta
|
|
Classifier: Intended Audience :: Developers
|
|
Classifier: Intended Audience :: System Administrators
|
|
Classifier: License :: OSI Approved :: Apache Software License
|
|
Classifier: Programming Language :: Python
|
|
Classifier: Programming Language :: Python :: 3
|
|
Classifier: Programming Language :: Python :: 3.10
|
|
Classifier: Programming Language :: Python :: 3.11
|
|
Classifier: Programming Language :: Python :: 3.12
|
|
Classifier: Programming Language :: Python :: 3.13
|
|
Classifier: Programming Language :: Python :: 3.14
|
|
Classifier: Programming Language :: Python :: Implementation :: CPython
|
|
Requires-Python: >=3.10
|
|
License-File: LICENSE
|
|
Requires-Dist: pandas>=1.5.0
|
|
Requires-Dist: numpy
|
|
Requires-Dist: cramjam>=2.3
|
|
Requires-Dist: fsspec
|
|
Requires-Dist: packaging
|
|
Provides-Extra: lzo
|
|
Requires-Dist: python-lzo; extra == "lzo"
|
|
Dynamic: author
|
|
Dynamic: author-email
|
|
Dynamic: classifier
|
|
Dynamic: description
|
|
Dynamic: home-page
|
|
Dynamic: license
|
|
Dynamic: license-file
|
|
Dynamic: provides-extra
|
|
Dynamic: requires-dist
|
|
Dynamic: requires-python
|
|
Dynamic: summary
|
|
|
|
fastparquet
|
|
===========
|
|
|
|
.. image:: https://github.com/dask/fastparquet/actions/workflows/main.yaml/badge.svg
|
|
:target: https://github.com/dask/fastparquet/actions/workflows/main.yaml
|
|
|
|
.. image:: https://readthedocs.org/projects/fastparquet/badge/?version=latest
|
|
:target: https://fastparquet.readthedocs.io/en/latest/
|
|
|
|
fastparquet is a python implementation of the `parquet
|
|
format <https://github.com/apache/parquet-format>`_, aiming integrate
|
|
into python-based big data work-flows. It is used implicitly by
|
|
the projects Dask, Pandas and intake-parquet.
|
|
|
|
We offer a high degree of support for the features of the parquet format, and
|
|
very competitive performance, in a small install size and codebase.
|
|
|
|
Details of this project, how to use it and comparisons to other work can be found in the documentation_.
|
|
|
|
.. _documentation: https://fastparquet.readthedocs.io
|
|
|
|
Requirements
|
|
------------
|
|
|
|
(all development is against recent versions in the default anaconda channels
|
|
and/or conda-forge)
|
|
|
|
Required:
|
|
|
|
- numpy
|
|
- pandas
|
|
- cython >= 0.29.23 (if building from pyx files)
|
|
- cramjam
|
|
- fsspec
|
|
|
|
Supported compression algorithms:
|
|
|
|
- Available by default:
|
|
|
|
- gzip
|
|
- snappy
|
|
- brotli
|
|
- lz4
|
|
- zstandard
|
|
|
|
- Optionally supported
|
|
|
|
- `lzo <https://github.com/jd-boyd/python-lzo>`_
|
|
|
|
|
|
Installation
|
|
------------
|
|
|
|
Install using conda, to get the latest compiled version::
|
|
|
|
conda install -c conda-forge fastparquet
|
|
|
|
or install from PyPI::
|
|
|
|
pip install fastparquet
|
|
|
|
You may wish to install numpy first, to help pip's resolver.
|
|
This may install an appropriate wheel, or compile from source. For the latter,
|
|
you will need a suitable C compiler toolchain on your system.
|
|
|
|
You can also install latest version from github::
|
|
|
|
pip install git+https://github.com/dask/fastparquet
|
|
|
|
in which case you should also have ``cython`` to be able to rebuild the C files.
|
|
|
|
Usage
|
|
-----
|
|
|
|
Please refer to the documentation_.
|
|
|
|
*Reading*
|
|
|
|
.. code-block:: python
|
|
|
|
from fastparquet import ParquetFile
|
|
pf = ParquetFile('myfile.parq')
|
|
df = pf.to_pandas()
|
|
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])
|
|
|
|
You may specify which columns to load, which of those to keep as categoricals
|
|
(if the data uses dictionary encoding). The file-path can be a single file,
|
|
a metadata file pointing to other data files, or a directory (tree) containing
|
|
data files. The latter is what is typically output by hive/spark.
|
|
|
|
*Writing*
|
|
|
|
.. code-block:: python
|
|
|
|
from fastparquet import write
|
|
write('outfile.parq', df)
|
|
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
|
|
compression='GZIP', file_scheme='hive')
|
|
|
|
The default is to produce a single output file with a single row-group
|
|
(i.e., logical segment) and no compression. At the moment, only simple
|
|
data-types and plain encoding are supported, so expect performance to be
|
|
similar to *numpy.savez*.
|
|
|
|
History
|
|
-------
|
|
|
|
This project forked in October 2016 from `parquet-python`_, which was not designed
|
|
for vectorised loading of big data or parallel access.
|
|
|
|
.. _parquet-python: https://github.com/jcrobak/parquet-python
|
|
|