Metadata-Version: 2.4
Name: fastparquet
Version: 2025.12.0
Summary: Python support for Parquet file format
Home-page: https://github.com/dask/fastparquet/
Author: Martin Durant
Author-email: mdurant@anaconda.com
License: Apache License 2.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: Implementation :: CPython
Requires-Python: >=3.10
License-File: LICENSE
Requires-Dist: pandas>=1.5.0
Requires-Dist: numpy
Requires-Dist: cramjam>=2.3
Requires-Dist: fsspec
Requires-Dist: packaging
Provides-Extra: lzo
Requires-Dist: python-lzo; extra == "lzo"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

fastparquet
===========

.. image:: https://github.com/dask/fastparquet/actions/workflows/main.yaml/badge.svg
    :target: https://github.com/dask/fastparquet/actions/workflows/main.yaml

.. image:: https://readthedocs.org/projects/fastparquet/badge/?version=latest
    :target: https://fastparquet.readthedocs.io/en/latest/

fastparquet is a python implementation of the `parquet
format <https://github.com/apache/parquet-format>`_, aiming integrate
into python-based big data work-flows. It is used implicitly by
the projects Dask, Pandas and intake-parquet.

We offer a high degree of support for the features of the parquet format, and
very competitive performance, in a small install size and codebase.

Details of this project, how to use it and comparisons to other work can be found in the documentation_.

.. _documentation: https://fastparquet.readthedocs.io

Requirements
------------

(all development is against recent versions in the default anaconda channels
and/or conda-forge)

Required:

- numpy
- pandas
- cython >= 0.29.23 (if building from pyx files)
- cramjam
- fsspec

Supported compression algorithms:

- Available by default:

  - gzip
  - snappy
  - brotli
  - lz4
  - zstandard

- Optionally supported
  
  - `lzo <https://github.com/jd-boyd/python-lzo>`_


Installation
------------

Install using conda, to get the latest compiled version::

   conda install -c conda-forge fastparquet

or install from PyPI::

   pip install fastparquet

You may wish to install numpy first, to help pip's resolver.
This may install an appropriate wheel, or compile from source. For the latter,
you will need a suitable C compiler toolchain on your system.

You can also install latest version from github::

   pip install git+https://github.com/dask/fastparquet

in which case you should also have ``cython`` to be able to rebuild the C files.

Usage
-----

Please refer to the documentation_.

*Reading*

.. code-block:: python

    from fastparquet import ParquetFile
    pf = ParquetFile('myfile.parq')
    df = pf.to_pandas()
    df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

You may specify which columns to load, which of those to keep as categoricals
(if the data uses dictionary encoding). The file-path can be a single file,
a metadata file pointing to other data files, or a directory (tree) containing
data files. The latter is what is typically output by hive/spark.

*Writing*

.. code-block:: python

    from fastparquet import write
    write('outfile.parq', df)
    write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
          compression='GZIP', file_scheme='hive')

The default is to produce a single output file with a single row-group
(i.e., logical segment) and no compression. At the moment, only simple
data-types and plain encoding are supported, so expect performance to be
similar to *numpy.savez*.

History
-------

This project forked in October 2016 from `parquet-python`_, which was not designed
for vectorised loading of big data or parallel access.

.. _parquet-python: https://github.com/jcrobak/parquet-python