This release includes :
{parquetize}
now has a minimal version (2.4.0) for {haven}
dependency package to ensure that conversions are performed correctly from SAS files compressed in BINARY mode #46csv_to_parquet
now has a read_delim_args
argument, allowing passing of arguments to read_delim
(added by @nikostr).table_to_parquet
can now convert files with uppercase extensions (.SAS7BDAT, .SAV, .DTA)This release includes :
@inheritParams
to simply documentation of functions arguments #38. This leads to some renaming of arguments (e.g path_to_csv
-> path_to_file
…)compression
and compression_level
are now passed to write_parquet_at_once and write_parquet_by_chunk functions and now available in main conversion functions of parquetize
#36@importFrom
in a file to facilitate their maintenance #37This release includes :
You can convert to parquet any query you want on any DBI compatible RDBMS :
dbi_connection <- DBI::dbConnect(RSQLite::SQLite(),
system.file("extdata","iris.sqlite",package = "parquetize"))
# Reading iris table from local sqlite database
# and conversion to one parquet file :
dbi_to_parquet(
conn = dbi_connection,
sql_query = "SELECT * FROM iris",
path_to_parquet = tempdir(),
parquetname = "iris"
)
You can find more information on dbi_to_parquet
documentation.
Two arguments are deprecated to avoid confusion with arrow concept and keep consistency
chunk_size
is replaced by max_rows
(chunk size is an arrow concept).chunk_memory_size
is replaced by max_memory
for consistencyThis release includes :
parquetize
!Due to these numerous contributions, @nbc is now officially part of the project authors !
After a big refactoring, three arguments are deprecated :
by_chunk
: table_to_parquet
will automatically chunked if you use one of chunk_memory_size
or chunk_size
.csv_as_a_zip
: csv_to_table
will detect if file is a zip by the extension.url_to_csv
: use path_to_csv
instead, csv_to_table
will detect if the file is remote with the file path.They will raise a deprecation warning for the moment.
The possibility to chunk parquet by memory size with table_to_parquet()
: table_to_parquet()
takes a chunk_memory_size
argument to convert an input file into parquet file of roughly chunk_memory_size
Mb size when data are loaded in memory.
Argument by_chunk
is deprecated (see above).
Example of use of the argument chunk_memory_size
:
table_to_parquet(
path_to_table = system.file("examples","iris.sas7bdat", package = "haven"),
path_to_parquet = tempdir(),
chunk_memory_size = 5000, # this will create files of around 5Gb when loaded in memory
)
write_parquet
when chunkingThe functionality for users to pass argument to write_parquet()
when chunking argument (in the ellipsis). Can be used for example to pass compression
and compression_level
.
Example:
table_to_parquet(
path_to_table = system.file("examples","iris.sas7bdat", package = "haven"),
path_to_parquet = tempdir(),
compression = "zstd",
compression_level = 10,
chunk_memory_size = 5000
)
download_extract
This function is added to … download and unzip file if needed.
file_path <- download_extract(
"https://www.nomisweb.co.uk/output/census/2021/census2021-ts007.zip",
filename_in_zip = "census2021-ts007-ctry.csv"
)
csv_to_parquet(
file_path,
path_to_parquet = tempdir()
)
Under the cover, this release has hardened tests
This release fix an error when converting a sas file by chunk.
This release includes :
table_to_parquet()
and csv_to_parquet()
functions #20inst/extdata
directory.This release includes :
table_to_parquet()
function has been fixed when the argument by_chunk
is TRUE.This release removes duckdb_to_parquet()
function on the advice of Brian Ripley from CRAN.
Indeed, the storage of DuckDB is not yet stable. The storage will be stabilized when version 1.0 releases.
This release includes corrections for CRAN submission.
This release includes an important feature :
The table_to_parquet()
function can now convert tables to parquet format with less memory consumption. Useful for huge tables and for computers with little RAM. (#15) A vignette has been written about it. See here.
nb_rows
argument in the table_to_parquet()
functionby_chunk
, chunk_size
and skip
(see documentation)duckdb_to_parquet()
function to convert duckdb files to parquet format.sqlite_to_parquet()
function to convert sqlite files to parquet format.rds_to_parquet()
function to convert rds files to parquet format.json_to_parquet()
function to convert json and ndjson files to parquet format.path_to_parquet
exists in functions csv_to_parquet()
or table_to_parquet()
(@py-b)table_to_parquet()
function to convert SAS, SPSS and Stata files to parquet format.csv_to_parquet()
function to convert csv files to parquet format.parquetize_example()
function to get path to package data examples.NEWS.md
file to track changes to the package.