polars¶
Overview¶
The cookbook contain ideas, concepts, snippsets that using the polars in the advanced way.
Code¶
Convert integer to timestamp¶
Convert the epoch timestamp to Datetime
import polars as pl
# Create a Polars DataFrame with the epoch timestamp
df = pl.DataFrame({"timestamp": [1136253600]})
df = df.with_column(pl.col("timestamp") * 1000).cast(pl.Datetime(time_unit="ms", time_zone="UTC"))
Check dataframe is the same
In addition to the correct answer above, it is good to note that for unit testing there is polars.testing.assert_frame_equal, which provides better error reporting, has more configuration options and raises an assertion on False.
Reference: https://stackoverflow.com/questions/71011161/compare-two-polars-dataframes-for-equality
Apply function to multiple columns in single dataset¶
Apply function to multiple columns in single dataset using struct
and map_elements
Usage: Using custom function to row-level for multiple columns on columnars stage
See: https://github.com/pola-rs/polars/issues/4374
See: https://stackoverflow.com/questions/74433918/apply-a-function-to-2-columns-in-polars
Using .pipe
method¶
https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.pipe.html
Set config global from JSON¶
polars
supported configurate following styles to handle the behaviour of the package when it hit.
-
Through context manager with
pl.Config
object -
Through inline code with
set_*
property
Let take an example:
# Style 1: Config from the object
with pl.Config(
tbl_formatting="ASCII_FULL_CONDENSED",
tbl_hide_column_data_types=True,
tbl_hide_dataframe_shape=True,
fmt_str_lengths=500,
) as cfg:
# Style 2: Config from the object
cfg.set_tbl_cols(-1)
cfg.set_tbl_rows(-1)
# Your script will be affected in the context
...
But when you has multiple scripts that you want to turn on the same configuration. Following stuff happened:
-
Copy and paste the same config through file script
-
When you change a single global format, you need to seach and replace in overall codebase
-
Check to have the same config in overall files
So my idea is store some global configs attributes somewhere and applied it in overall script
You can archive this by save to to json file like this
with pl.Config(...) as cfg:
# Another setting
# cfg.set_*
# Save to file
cfg.save_to_file("polars_config.json")
Then, in the pre-call script, you loaded it somewhere, maybe in __init__.py
file (Currently it's work for package style only)
Error¶
df = pl.DataFrame({"a": [[0.1, 0.2, 0.3]]})
df.with_columns(pl.lit([0]).alias("c")).with_columns(gather=pl.col("a").list.gather(pl.col("c"), null_on_oob=True))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\WIN11\AppData\Roaming\Python\Python39\site-packages\polars\dataframe\frame.py", line 8890, in with_columns
return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
File "C:\Users\WIN11\AppData\Roaming\Python\Python39\site-packages\polars\lazyframe\frame.py", line 2027, in collect
return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: could not extract number from any-value of dtype: 'List(Int64)'
Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
import polars as pl
pl.__version__
'1.9.0'
df = pl.DataFrame({"a": [[0.1, 0.2, 0.3]]})
df.with_columns(pl.lit([0]).alias("c")).with_columns(gather=pl.col("a").list.gather(pl.col("c"), null_on_oob=True))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\PROJECT\app-flareon\venv\lib\site-packages\polars\dataframe\frame.py", line 9183, in with_columns
return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
File "D:\PROJECT\app-flareon\venv\lib\site-packages\polars\lazyframe\frame.py", line 2050, in collect
return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: could not extract number from any-value of dtype: 'List(Int64)'
df
shape: (1, 1)
┌─────────────────┐
│ a │
│ --- │
│ list[f64] │
╞═════════════════╡
│ [0.1, 0.2, 0.3] │
└─────────────────┘
df.with_columns(gather=pl.lit([0.1]))
shape: (1, 2)
┌─────────────────┬───────────┐
│ a ┆ gather │
│ --- ┆ --- │
│ list[f64] ┆ list[f64] │
╞═════════════════╪═══════════╡
│ [0.1, 0.2, 0.3] ┆ [0.1] │
└─────────────────┴───────────┘
pl.show_versions()
--------Version info---------
Polars: 1.9.0
Index type: UInt32
Platform: Windows-10-10.0.22631-SP0
Python: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]
----Optional dependencies----
adbc_driver_manager <not installed>
altair <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake <not installed>
fastexcel 0.11.5
fsspec <not installed>
gevent <not installed>
great_tables <not installed>
matplotlib <not installed>
nest_asyncio <not installed>
numpy 2.0.1
openpyxl 3.1.5
pandas 2.2.2
pyarrow 16.1.0
pydantic 2.7.4
pyiceberg <not installed>
sqlalchemy 2.0.32
torch <not installed>
xlsx2csv 0.8.3
xlsxwriter 3.2.0
Get rows that handle the first max value in a columns¶
import polars as pl
a = pl.DataFrame({"a": [1, 2, 5, None, 5, None]})
shape: (6, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 1 │
│ 2 │
│ 5 │
│ null │
│ 5 │
│ null │
└──────┘
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))))
shape: (6, 2)
┌──────┬───────┐
│ a ┆ c │
│ --- ┆ --- │
│ i64 ┆ bool │
╞══════╪═══════╡
│ 1 ┆ false │
│ 2 ┆ false │
│ 5 ┆ true │
│ null ┆ null │
│ 5 ┆ true │
│ null ┆ null │
└──────┴───────┘
Get rows that handle the first max value in a columns¶
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).cast(pl.Int8).cum_sum())
shape: (6, 2)
┌──────┬──────┐
│ a ┆ c │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 0 │
│ 2 ┆ 0 │
│ 5 ┆ 1 │
│ null ┆ null │
│ 5 ┆ 2 │
│ null ┆ null │
└──────┴──────┘
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).cast(pl.Int8).cum_sum())
shape: (6, 2)
┌──────┬─────┐
│ a ┆ c │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪═════╡
│ 1 ┆ 0 │
│ 2 ┆ 0 │
│ 5 ┆ 1 │
│ null ┆ 1 │
│ 5 ┆ 2 │
│ null ┆ 2 │
└──────┴─────┘
df.with_columns(c=pl.arg_true(pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False)).first())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/polars/__init__.py", line 414, in __getattr__
raise AttributeError(msg)
AttributeError: module 'polars' has no attribute 'arg_true'
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true().first())
shape: (6, 2)
┌──────┬─────┐
│ a ┆ c │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞══════╪═════╡
│ 1 ┆ 2 │
│ 2 ┆ 2 │
│ 5 ┆ 2 │
│ null ┆ 2 │
│ 5 ┆ 2 │
│ null ┆ 2 │
└──────┴─────┘
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true().first())
shape: (6, 2)
┌──────┬─────┐
│ a ┆ c │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞══════╪═════╡
│ 1 ┆ 2 │
│ 2 ┆ 2 │
│ 5 ┆ 2 │
│ null ┆ 2 │
│ 5 ┆ 2 │
│ null ┆ 2 │
└──────┴─────┘
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/polars/dataframe/frame.py", line 8890, in with_columns
return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
File "/usr/local/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 2027, in collect
return wrap_df(ldf.collect(callback))
polars.exceptions.ShapeError: unable to add a column of length 2 to a DataFrame of height 6
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/polars/dataframe/frame.py", line 8890, in with_columns
return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
File "/usr/local/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 2027, in collect
return wrap_df(ldf.collect(callback))
polars.exceptions.ShapeError: unable to add a column of length 2 to a DataFrame of height 6
df
shape: (6, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 1 │
│ 2 │
│ 5 │
│ null │
│ 5 │
│ null │
└──────┘
df.with_columns(_fil=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true().first())
shape: (6, 2)
┌──────┬──────┐
│ a ┆ _fil │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞══════╪══════╡
│ 1 ┆ 2 │
│ 2 ┆ 2 │
│ 5 ┆ 2 │
│ null ┆ 2 │
│ 5 ┆ 2 │
│ null ┆ 2 │
└──────┴──────┘
df.with_row_index(name="index", offset=0).with_columns(_fil=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true().first()).filter(pl.col("index").le(pl.col("_fil")))
shape: (3, 3)
┌───────┬─────┬──────┐
│ index ┆ a ┆ _fil │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ u32 │
╞═══════╪═════╪══════╡
│ 0 ┆ 1 ┆ 2 │
│ 1 ┆ 2 ┆ 2 │
│ 2 ┆ 5 ┆ 2 │
└───────┴─────┴──────┘
>>>
Troubleshoting¶
ValueError: invalid time_unit
¶
Expected one of {'ns','us','ms'}, got 's'.
To fix this, let's convert the epoch timestamp using one of the valid time units. We'll go for milliseconds (ms)
Broken pipe¶
in__] asset=CADCHF component=STRATEGY strategy=CADCHF_81
Traceback (most recent call last):
File "/usr/bin/sharpedo/portfolio/builder.py", line 1237, in <module>
on_build_result = build_portfolio(
File "/usr/bin/sharpedo/portfolio/builder.py", line 930, in build_portfolio
for s_resu in executor.map(
File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
yield fs.pop().result()
File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/bin/sharpedo/portfolio/builder.py", line 715, in task_construct_strategy
strategy_result.collect(streaming=True, engine="cpu").write_csv(
File "/usr/local/lib/python3.9/site-packages/polars/dataframe/frame.py", line 2877, in write_csv
self._df.write_csv(
BrokenPipeError: Broken pipe (os error 32)
make: *** [Makefile:55: portfolio-run] Error 1
Problem: File open - check the open and kill
.with_row_count() causing: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'¶
Error
return lf.sink_parquet(
polars.exceptions.InvalidOperationError: sink_Parquet(ParquetWriteOptions { compression: Lz4Raw, statistics: StatisticsOptions { min_value: true, max_value: true, distinct_count: false, null_count: true }, row_group_size: None, data_page_size: None, maintain_order: true }) not yet supported in standard engine. Use 'collect().write_parquet()'
Resolved problem:
out = (
df.with_columns(
pl.when(pl.col("Timestamp") > pl.col("Timestamp_Lab")).then(
pl.col("Timestamp_Lab", "Hemoglobin", "Leukocytes", "Platelets")
)
)
.map(lambda df:
df.groupby("Timestamp")
.agg(
pl.col("ID_1", "ID_2", "Event").first(),
pl.col("Hemoglobin", "Leukocytes", "Platelets")
),
streamable = True,
schema = {
"Timestamp": pl.Datetime("us"),
"ID_1": pl.Int64,
"ID_2": pl.Int64,
"Event": pl.Int64,
"Hemoglobin": pl.List(pl.Float64),
"Leukocytes": pl.List(pl.Float64),
"Platelets": pl.List(pl.Int64),
}
)
)
out.sink_parquet("moo.parquet")
See:
https://github.com/pola-rs/polars/issues/9740
https://github.com/pola-rs/polars/issues/15767?form=MG0AV3
Corner case: Big-number¶
import polars as pl
from src.util import str_to_number
if __name__ == "__main__":
dataset = pl.DataFrame(data=[["1", "200,000,000,000,000,000,000"]], schema=["very_big_container"]).with_columns(
pl.col("very_big_container")
.map_elements(str_to_number, return_dtype=pl.Float64)
.cast(pl.Int64, strict=False).name.keep()
)
print(dataset)
AVOID_LAMBDA_FUINCTION_POLARS.txt¶
output = output.with_columns([
pl.when(pl.col(col).is_not_null()).then(pl.lit(1)).otherwise(-1)
.apply(pl.Int64, lambda _: 1, skip_nulls=True).alias(col + "_multiple"),
])
Snapshot and snipshit
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.str.len_bytes.html#polars.Expr.str.len_bytes
After this
content = content.unique(keep='first', maintain_order=True)
for col in content.columns: content = content.with_columns(pl.col(col).apply(lambda x: str(x).strip(), skip_nulls=True))
content = content.with_columns(pl.col('table_position').str.parse_int(radix=10, strict=True).cast(pl.Int64).keep_name())
content = content.with_columns(pl.col('inclusion_date').str.strptime(pl.Date, format='%d/%m/%Y', strict=False))
content = content.with_columns(pl.col('free_float_rate').map_elements(
lambda x: str_to_number(x, **seperator_style), skip_nulls=True, return_dtype=pl.Float64
).cast(pl.Float64))
content = content.with_columns([
pl.col("free_float_rate").map_elements(lambda _: 1, skip_nulls=True).cast(pl.Int64).alias("free_float_rate_multiple"),
])
content = content.with_columns([
pl.col(pl.Utf8).map_elements(lambda x: x.strip() if len(x.strip()) != 0 else None).keep_name()
])
Troubleshooting¶
Overview¶
The component script related to project has a lot errors when running in run time.
For each error, make sure:
-
The errors has been captured full case of traceback and matching with the context
-
Just fixed case by case to not exist the complex problems to solve.
-
Verify the issue and re-deployment/rollback level to make the project worked as it is.
Issues¶
[SOLVED] Illegal instruction¶
Context:
Happened when conflict instruction of package dependencies with the current OS.
What error message means is that the execute has some CPU instructions that the CPU that runs it doesn't understand.
Related:
-
Package Dependencies
-
OS system
Current meet:
-
The OS in the deployment OS is
Centos7
maybe related to non-updated CPU -
The upgrade version of
polars
greater than0.19.0
-
When execute script has related to
polars
There are various the same issue but at different of package. E.g:
-
At
numpy
Illegal instruction (core dumped) on import for numpy 1.19.5 on ARM64 -
At
polars
: -
"Illegal instruction (core dumped)" with pip installation on non-AVX CPU
-
Issue 5999: Illegal instruction when trying to create a dataframe on an old CPU with polars-lts-cpu
-
At
tensorflow
How to Resolve The Error "Illegal instruction (core dumped)" when Running "import tensorflow" in a Python Program
Checkpoint:
- [1] Make sure you has the permission of execute script
# For recursive at the deployment folder
chmod R 0777 deployment-folder/
# Or at the file script level
chmod 0777 deployment-folder/path/to/file/execution.py
-
[2] Define your area of error by run line-by-line of the script from the top-to-bottom in the interactive mode.
-
[3] Then check at the version related level
- [4] Check your OS CPU
lscpu
# Architecture: x86_64
# CPU op-mode(s): 32-bit, 64-bit
# Byte Order: Little Endian
# CPU(s): 16
# On-line CPU(s) list: 0-15
# Thread(s) per core: 1
# Core(s) per socket: 1
# Socket(s): 16
# NUMA node(s): 2
# Vendor ID: GenuineIntel
# CPU family: 6
# Model: 45
# Model name: Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
# Stepping: 7
# CPU MHz: 2400.000
# BogoMIPS: 4800.00
# Hypervisor vendor: VMware
# Virtualization type: full
# L1d cache: 32K
# L1i cache: 32K
# L2 cache: 256K
# L3 cache: 20480K
# NUMA node0 CPU(s): 0-7
# NUMA node1 CPU(s): 8-15
# Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm ssbd ibrs ibpb stibp tsc_adjust arat spec_ctrl intel_stibp flush_l1d arch_capabilities
- [5] Find the related model that suitable for that OS.
In this case, download polars-lts-cpu
instead, follow the comment Polars Issue 2922
Other reference:
Annotated¶
content = content.with_columns(pl.lit(index.name).cast(pl.Utf8).alias("code"))
content = content.with_columns(pl.lit(url).cast(pl.Utf8).alias("reference_url"))
content = content.with_columns(pl.lit(lang.code).cast(pl.Utf8).alias("language"))
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.apply.html#polars.Expr.apply