Skip to content

[WIP] SEC Filling Metadata

Overview

SEC Filling metadata has been published by FinnHub through Kaggle on website from 1994 to 2020

Prerequisite

Tool Description Access
Kaggle CLI Interactive with kaggle, implemted for Python Official GitHub
fsspec Filesystem interfaces for Python Documentation

Method

[1] Declare environment

Declare requirements both for selected phase: production or development-only

requirements-dev.txt
kaggle
devtools==0.11.0
requirements.txt
fsspec==2023.4.0
polars==0.16.6

then execute to set up

Execute Vitural Environment
# Set up vitural
python -m venv venv
source venv/Scripts/activate;
python -m pip install --upgrade pip==23.0.1;
python -m pip install -r requirements.txt;
python -m pip install -r requirements-dev.txt;

[2] Download dataset:

Using command from copy from Kaggle UI with path and unzip support.

Download dataset from Kaggle
# Download
kaggle datasets download -d finnhub/sec-filings -p temp/ --unzip;

You can see the progress with the total resources downloaded (in MB) and the download rate.

Download=

[2] Using fsspec to interactive with folder and multiple read using polars

*) Declare dependencies to working with

# Global
import os

# External
from fsspec.implementations.local import LocalFileSystem
import polars as pl

a) Get CSV file path in the download folder

if __name__ == "__main__":

    # Declare
    TMP_PATH: str = os.path.join("temp")

    # Create Local Interface
    fs = LocalFileSystem(auto_mkdir=False)

    # List all CSV file in the target folder
    CSV_PATH: list[str] = fs.glob(path="{}/*.csv".format(TMP_PATH))
    # Example output:
    # >>> from devtools import debug
    # >>> debug(CSV_PATH)
    # [
    #     '<path>/temp/1994.QTR1.csv',
    #     '<path>/temp/1994.QTR2.csv',
    #     '<path>/temp/1994.QTR3.csv',
    #     '<path>/temp/1994.QTR4.csv',
    #     '<path>/temp/1995.QTR1.csv',
    #     '<path>/temp/1995.QTR2.csv',
    #     '<path>/temp/1995.QTR3.csv',
    #     '<path>/temp/1995.QTR4.csv',
    #     ...
    # ]

    # See what total files has been downloaded
    # >>> len(CSV_PATH)
    # 115

b) Read by read_csv and merge it together by concat

    # Declare
    SEC_METADATA: pl.DataFrame = pl.DataFrame([])

    # Loop
    # a) Read: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html
    # b) then concat: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.concat.html#polars.concat
    for path in CSV_PATH:
        content: pl.DataFrame = pl.read_csv(path, has_header=True)
        SEC_METADATA = pl.concat(items=[SEC_METADATA, content], how="diagonal")

There are 3 strategy of concat by polars:

flowchart LR
    polars_method[Polars Concat Method]
    polars_method --> vertical[vertical] --> vd[applies multiple vstack operations]
    polars_method --> diagonal[diagonal] --> dd[finds a union between the column schemas and fills missing column values with null]
    polars_method --> horizontal[horizontal] --> hd[stacks Series from DataFrames horizontally and fills with nulls if the lengths dont match]

Download=

As you can see, total of 115 files yield 21M record rows of 8 columns

c) Basic Look on dataframe

    # Inspect schema
    SEC_METADATA.schema
    # {'symbol': Utf8, 'cik': Int64, 'form': Utf8, 'access_number': Utf8, 'filed_date': Utf8, 'accepted_date': Utf8, 'report_url': Utf8, 'filing_url': Utf8}

    # Shape: (Rows, Columns)
    SEC_METADATA.shape
    # (21474416, 8)

    # Total records
    SEC_METADATA.shape[0] * SEC_METADATA.shape[1]

    # Sample of dataset
    SEC_METADATA.glimpse()
    # Rows: 21474416
    # Columns: 8
    # $ symbol        <str> None, None, None, None, None, None, None, None, None, None # noqa: E501
    # $ cik           <i64> 100030, 100030, 100030, 100063, 100063, 100122, 100122, 100166, 100240, 100240 # noqa: E501
    # $ form          <str> SC 13G/A, SC 13G/A, SC 13G/A, SC 13G/A, SC 13G, SC 13D, SC 13D, S-8, 10-K, 424B2 # noqa: E501
    # $ access_number <str> 0000950152-94-000075, 0000950150-94-000279, 0000009749-94-000047, 0000759277-94-000089, 0000912057-94-000316, 0000100122-94-000010, 0000100122-94-000012, 0000950118-94-000015, 0000950144-94-000787, 0000950144-94-000139 # noqa: E501
    # $ filed_date    <str> 1994-02-07, 1994-02-10, 1994-02-14, 1994-02-14, 1994-02-10, 1994-03-23, 1994-03-24, 1994-02-11, 1994-03-31, 1994-01-31 # noqa: E501
    # $ accepted_date <str> 1994-02-07 00:00:00, 1994-02-10 00:00:00, 1994-02-14 00:00:00, 1994-02-14 00:00:00, 1994-02-10 00:00:00, 1994-03-23 00:00:00, 1994-03-24 00:00:00, 1994-02-11 00:00:00, 1994-03-31 00:00:00, 1994-01-31 00:00:00 # noqa: E501
    # $ report_url    <str> https://www.sec.gov/Archives/edgar/data/918711/0000950152-94-000075.txt, https://www.sec.gov/Archives/edgar/data/732812/0000950150-94-000279.txt, https://www.sec.gov/Archives/edgar/data/9749/0000009749-94-000047.txt, https://www.sec.gov/Archives/edgar/data/34066/0000759277-94-000089.txt, https://www.sec.gov/Archives/edgar/data/36104/0000912057-94-000316.txt, https://www.sec.gov/Archives/edgar/data/716634/0000100122-94-000010.txt, https://www.sec.gov/Archives/edgar/data/716634/0000100122-94-000012.txt, https://www.sec.gov/Archives/edgar/data/100166/0000950118-94-000015.txt, https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt, https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000139.txt # noqa: E501
    # $ filing_url    <str> https://www.sec.gov/Archives/edgar/data/918711/000095015294000075/0000950152-94-000075-index.html, https://www.sec.gov/Archives/edgar/data/732812/000095015094000279/0000950150-94-000279-index.html, https://www.sec.gov/Archives/edgar/data/9749/000000974994000047/0000009749-94-000047-index.html, https://www.sec.gov/Archives/edgar/data/34066/000075927794000089/0000759277-94-000089-index.html, https://www.sec.gov/Archives/edgar/data/36104/000091205794000316/0000912057-94-000316-index.html, https://www.sec.gov/Archives/edgar/data/716634/000010012294000010/0000100122-94-000010-index.html, https://www.sec.gov/Archives/edgar/data/716634/000010012294000012/0000100122-94-000012-index.html, https://www.sec.gov/Archives/edgar/data/100166/000095011894000015/0000950118-94-000015-index.html, https://www.sec.gov/Archives/edgar/data/100240/000095014494000787/0000950144-94-000787-index.html, https://www.sec.gov/Archives/edgar/data/100240/000095014494000139/0000950144-94-000139-index.html # noqa: E501

d) Data

Field Type Mandatory Description
acceptedDate string yes Date & time when the filing was accepted by the SEC (format: %Y-%m-%d %H:%M:%S).
accessNumber string no The unique accession (access) number for the filing.
cik string no Central Index Key for the company.
filedDate string no The actual filed timestamp for the document in the same format.
filingUrl string no Web link to the original filing on SEC or other repository.
form string no SEC form type (e.g., 10-K, 8-K, etc.).
reportUrl string no URL to the generated sentiment report (depends on API & subscription).
symbol string no The equity ticker symbol the filing is associated with.

e) Answer some question

Total

Reference

https://finnhub.io/docs/api/filings-sentiment

https://seaborn.pydata.org/examples/timeseries_facets.html