Skip to content

polars

Overview

The cookbook contain ideas, concepts, snippsets that using the polars in the advanced way.

Code

Convert integer to timestamp

Convert the epoch timestamp to Datetime

import polars as pl

# Create a Polars DataFrame with the epoch timestamp
df = pl.DataFrame({"timestamp": [1136253600]})
df = df.with_column(pl.col("timestamp") * 1000).cast(pl.Datetime(time_unit="ms", time_zone="UTC"))

Check dataframe is the same

import polars as pl
pl.DataFrame({"x": [1,2,3]}).frame_equal(pl.DataFrame({"x": [1,2,3]}))  # True

In addition to the correct answer above, it is good to note that for unit testing there is polars.testing.assert_frame_equal, which provides better error reporting, has more configuration options and raises an assertion on False.

Reference: https://stackoverflow.com/questions/71011161/compare-two-polars-dataframes-for-equality

Apply function to multiple columns in single dataset

Apply function to multiple columns in single dataset using struct and map_elements

Usage: Using custom function to row-level for multiple columns on columnars stage

See: https://github.com/pola-rs/polars/issues/4374

See: https://stackoverflow.com/questions/72991324/how-to-apply-custom-functions-with-multiple-parameters-in-polars/72997458#72997458

See: https://stackoverflow.com/questions/74433918/apply-a-function-to-2-columns-in-polars

Using .pipe method

https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.pipe.html

Set config global from JSON

polars supported configurate following styles to handle the behaviour of the package when it hit.

  • Through context manager with pl.Config object

  • Through inline code with set_* property

Let take an example:

# Style 1: Config from the object
with pl.Config(
  tbl_formatting="ASCII_FULL_CONDENSED",
  tbl_hide_column_data_types=True,
  tbl_hide_dataframe_shape=True,
  fmt_str_lengths=500,
) as cfg:

  # Style 2: Config from the object
  cfg.set_tbl_cols(-1)
  cfg.set_tbl_rows(-1)

  # Your script will be affected in the context
  ...

But when you has multiple scripts that you want to turn on the same configuration. Following stuff happened:

  • Copy and paste the same config through file script

  • When you change a single global format, you need to seach and replace in overall codebase

  • Check to have the same config in overall files

So my idea is store some global configs attributes somewhere and applied it in overall script

You can archive this by save to to json file like this

with pl.Config(...) as cfg:

  # Another setting
  # cfg.set_*

  # Save to file
  cfg.save_to_file("polars_config.json")

Then, in the pre-call script, you loaded it somewhere, maybe in __init__.py file (Currently it's work for package style only)

pl.Config.load_from_file("polars_config.json")

Error

df = pl.DataFrame({"a": [[0.1, 0.2, 0.3]]})
df.with_columns(pl.lit([0]).alias("c")).with_columns(gather=pl.col("a").list.gather(pl.col("c"), null_on_oob=True))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\WIN11\AppData\Roaming\Python\Python39\site-packages\polars\dataframe\frame.py", line 8890, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
  File "C:\Users\WIN11\AppData\Roaming\Python\Python39\site-packages\polars\lazyframe\frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: could not extract number from any-value of dtype: 'List(Int64)'
Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
import polars as pl
pl.__version__
'1.9.0'
df = pl.DataFrame({"a": [[0.1, 0.2, 0.3]]})
df.with_columns(pl.lit([0]).alias("c")).with_columns(gather=pl.col("a").list.gather(pl.col("c"), null_on_oob=True))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\PROJECT\app-flareon\venv\lib\site-packages\polars\dataframe\frame.py", line 9183, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
  File "D:\PROJECT\app-flareon\venv\lib\site-packages\polars\lazyframe\frame.py", line 2050, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: could not extract number from any-value of dtype: 'List(Int64)'
df
shape: (1, 1)
┌─────────────────┐
 a               
 ---             
 list[f64]       
╞═════════════════╡
 [0.1, 0.2, 0.3] 
└─────────────────┘
df.with_columns(gather=pl.lit([0.1]))
shape: (1, 2)
┌─────────────────┬───────────┐
 a                gather    
 ---              ---       
 list[f64]        list[f64] 
╞═════════════════╪═══════════╡
 [0.1, 0.2, 0.3]  [0.1]     
└─────────────────┴───────────┘
pl.show_versions()
--------Version info---------
Polars:              1.9.0
Index type:          UInt32
Platform:            Windows-10-10.0.22631-SP0
Python:              3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            0.11.5
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                2.0.1
openpyxl             3.1.5
pandas               2.2.2
pyarrow              16.1.0
pydantic             2.7.4
pyiceberg            <not installed>
sqlalchemy           2.0.32
torch                <not installed>
xlsx2csv             0.8.3
xlsxwriter           3.2.0

Get rows that handle the first max value in a columns

import polars as pl
a = pl.DataFrame({"a": [1, 2, 5, None, 5, None]})
shape: (6, 1)
┌──────┐
 a    
 ---  
 i64  
╞══════╡
 1    
 2    
 5    
 null 
 5    
 null 
└──────┘
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))))
shape: (6, 2)
┌──────┬───────┐
 a     c     
 ---   ---   
 i64   bool  
╞══════╪═══════╡
 1     false 
 2     false 
 5     true  
 null  null  
 5     true  
 null  null  
└──────┴───────┘

Get rows that handle the first max value in a columns

df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).cast(pl.Int8).cum_sum())
shape: (6, 2)
┌──────┬──────┐
 a     c    
 ---   ---  
 i64   i64  
╞══════╪══════╡
 1     0    
 2     0    
 5     1    
 null  null 
 5     2    
 null  null 
└──────┴──────┘
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).cast(pl.Int8).cum_sum())
shape: (6, 2)
┌──────┬─────┐
 a     c   
 ---   --- 
 i64   i64 
╞══════╪═════╡
 1     0   
 2     0   
 5     1   
 null  1   
 5     2   
 null  2   
└──────┴─────┘
df.with_columns(c=pl.arg_true(pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False)).first())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/polars/__init__.py", line 414, in __getattr__
    raise AttributeError(msg)
AttributeError: module 'polars' has no attribute 'arg_true'
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true().first())
shape: (6, 2)
┌──────┬─────┐
 a     c   
 ---   --- 
 i64   u32 
╞══════╪═════╡
 1     2   
 2     2   
 5     2   
 null  2   
 5     2   
 null  2   
└──────┴─────┘
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true().first())
shape: (6, 2)
┌──────┬─────┐
 a     c   
 ---   --- 
 i64   u32 
╞══════╪═════╡
 1     2   
 2     2   
 5     2   
 null  2   
 5     2   
 null  2   
└──────┴─────┘
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/polars/dataframe/frame.py", line 8890, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
  File "/usr/local/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ShapeError: unable to add a column of length 2 to a DataFrame of height 6
df.with_columns(c=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/polars/dataframe/frame.py", line 8890, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
  File "/usr/local/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ShapeError: unable to add a column of length 2 to a DataFrame of height 6
df
shape: (6, 1)
┌──────┐
 a    
 ---  
 i64  
╞══════╡
 1    
 2    
 5    
 null 
 5    
 null 
└──────┘
df.with_columns(_fil=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true().first())
shape: (6, 2)
┌──────┬──────┐
 a     _fil 
 ---   ---  
 i64   u32  
╞══════╪══════╡
 1     2    
 2     2    
 5     2    
 null  2    
 5     2    
 null  2    
└──────┴──────┘
df.with_row_index(name="index", offset=0).with_columns(_fil=pl.col("a").eq(pl.col("a").max().over(pl.lit(1))).fill_null(False).arg_true().first()).filter(pl.col("index").le(pl.col("_fil")))
shape: (3, 3)
┌───────┬─────┬──────┐
 index  a    _fil 
 ---    ---  ---  
 u32    i64  u32  
╞═══════╪═════╪══════╡
 0      1    2    
 1      2    2    
 2      5    2    
└───────┴─────┴──────┘
>>>

Troubleshoting

ValueError: invalid time_unit

Expected one of {'ns','us','ms'}, got 's'.

To fix this, let's convert the epoch timestamp using one of the valid time units. We'll go for milliseconds (ms)

Broken pipe

in__] asset=CADCHF component=STRATEGY strategy=CADCHF_81
Traceback (most recent call last):
  File "/usr/bin/sharpedo/portfolio/builder.py", line 1237, in <module>
    on_build_result = build_portfolio(
  File "/usr/bin/sharpedo/portfolio/builder.py", line 930, in build_portfolio
    for s_resu in executor.map(
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/bin/sharpedo/portfolio/builder.py", line 715, in task_construct_strategy
    strategy_result.collect(streaming=True, engine="cpu").write_csv(
  File "/usr/local/lib/python3.9/site-packages/polars/dataframe/frame.py", line 2877, in write_csv
    self._df.write_csv(
BrokenPipeError: Broken pipe (os error 32)
make: *** [Makefile:55: portfolio-run] Error 1

Problem: File open - check the open and kill

.with_row_count() causing: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'

Error

    return lf.sink_parquet(

polars.exceptions.InvalidOperationError: sink_Parquet(ParquetWriteOptions { compression: Lz4Raw, statistics: StatisticsOptions { min_value: true, max_value: true, distinct_count: false, null_count: true }, row_group_size: None, data_page_size: None, maintain_order: true }) not yet supported in standard engine. Use 'collect().write_parquet()'

Resolved problem:

out = (
    df.with_columns(
        pl.when(pl.col("Timestamp") > pl.col("Timestamp_Lab")).then(
            pl.col("Timestamp_Lab", "Hemoglobin", "Leukocytes", "Platelets")
        )
    )
    .map(lambda df:
       df.groupby("Timestamp")
         .agg(
            pl.col("ID_1", "ID_2", "Event").first(),
            pl.col("Hemoglobin", "Leukocytes", "Platelets")
         ),
         streamable = True,
         schema = {
            "Timestamp": pl.Datetime("us"),
            "ID_1": pl.Int64,
            "ID_2": pl.Int64,
            "Event": pl.Int64,
            "Hemoglobin": pl.List(pl.Float64),
            "Leukocytes": pl.List(pl.Float64),
            "Platelets": pl.List(pl.Int64),
        }
    )
)

out.sink_parquet("moo.parquet")

See:

https://github.com/pola-rs/polars/issues/9740

https://github.com/pola-rs/polars/issues/15767?form=MG0AV3

Corner case: Big-number

import polars as pl
from src.util import str_to_number


if __name__ == "__main__":

    dataset = pl.DataFrame(data=[["1", "200,000,000,000,000,000,000"]], schema=["very_big_container"]).with_columns(
        pl.col("very_big_container")
        .map_elements(str_to_number, return_dtype=pl.Float64)
        .cast(pl.Int64, strict=False).name.keep()
    )
    print(dataset)

AVOID_LAMBDA_FUINCTION_POLARS.txt

output = output.with_columns([
    pl.when(pl.col(col).is_not_null()).then(pl.lit(1)).otherwise(-1)

    .apply(pl.Int64, lambda _: 1, skip_nulls=True).alias(col + "_multiple"),
])

Snapshot and snipshit

content = content.with_columns(pl.col(col).apply(lambda x: str(x).strip(), skip_nulls=True))
content = content.with_columns(pl.col(col).str.strip_chars())

https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.str.len_bytes.html#polars.Expr.str.len_bytes

After this

content = content.unique(keep='first', maintain_order=True)

for col in content.columns: content = content.with_columns(pl.col(col).apply(lambda x: str(x).strip(), skip_nulls=True))

content = content.with_columns(pl.col('table_position').apply(lambda x: int(x)))
content = content.with_columns(pl.col('table_position').str.parse_int(radix=10, strict=True).cast(pl.Int64).keep_name())
content = content.with_columns(pl.col('inclusion_date').str.strptime(pl.Date, format='%d/%m/%Y', strict=False))
content = content.with_columns(pl.col('free_float_rate').map_elements(
    lambda x: str_to_number(x, **seperator_style), skip_nulls=True, return_dtype=pl.Float64
).cast(pl.Float64))
content = content.with_columns([
    pl.col("free_float_rate").map_elements(lambda _: 1, skip_nulls=True).cast(pl.Int64).alias("free_float_rate_multiple"),
])
content = content.with_columns([
    pl.col(pl.Utf8).map_elements(lambda x: x.strip() if len(x.strip()) != 0 else None).keep_name()
])

Troubleshooting

Overview

The component script related to project has a lot errors when running in run time.

For each error, make sure:

  • The errors has been captured full case of traceback and matching with the context

  • Just fixed case by case to not exist the complex problems to solve.

  • Verify the issue and re-deployment/rollback level to make the project worked as it is.

Issues

[SOLVED] Illegal instruction

Context:

Happened when conflict instruction of package dependencies with the current OS.

What error message means is that the execute has some CPU instructions that the CPU that runs it doesn't understand.

Related:

  • Package Dependencies

  • OS system

Current meet:

  • The OS in the deployment OS is Centos7 maybe related to non-updated CPU

  • The upgrade version of polars greater than 0.19.0

  • When execute script has related to polars

python3 path/to/file/execution.py -v
# Illegal Instruction

There are various the same issue but at different of package. E.g:

Checkpoint:

  • [1] Make sure you has the permission of execute script
# For recursive at the deployment folder
chmod R 0777 deployment-folder/

# Or at the file script level
chmod 0777 deployment-folder/path/to/file/execution.py
  • [2] Define your area of error by run line-by-line of the script from the top-to-bottom in the interactive mode.

  • [3] Then check at the version related level

# Syntax:
# $ python -vc "import <packge>"
python -vc "import polars as pl"
  • [4] Check your OS CPU
lscpu
# Architecture:          x86_64
# CPU op-mode(s):        32-bit, 64-bit
# Byte Order:            Little Endian
# CPU(s):                16
# On-line CPU(s) list:   0-15
# Thread(s) per core:    1
# Core(s) per socket:    1
# Socket(s):             16
# NUMA node(s):          2
# Vendor ID:             GenuineIntel
# CPU family:            6
# Model:                 45
# Model name:            Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
# Stepping:              7
# CPU MHz:               2400.000
# BogoMIPS:              4800.00
# Hypervisor vendor:     VMware
# Virtualization type:   full
# L1d cache:             32K
# L1i cache:             32K
# L2 cache:              256K
# L3 cache:              20480K
# NUMA node0 CPU(s):     0-7
# NUMA node1 CPU(s):     8-15
# Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm ssbd ibrs ibpb stibp tsc_adjust arat spec_ctrl intel_stibp flush_l1d arch_capabilities
  • [5] Find the related model that suitable for that OS.

In this case, download polars-lts-cpu instead, follow the comment Polars Issue 2922

requirement.txt
polars-lts-cpu==0.19.8

Other reference:

Annotated

content = content.with_columns(pl.lit(index.name).cast(pl.Utf8).alias("code"))
content = content.with_columns(pl.lit(url).cast(pl.Utf8).alias("reference_url"))
content = content.with_columns(pl.lit(lang.code).cast(pl.Utf8).alias("language"))

https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.apply.html#polars.Expr.apply