[WIP] SEC Filling Metadata¶
Overview¶
SEC Filling metadata has been published by FinnHub through Kaggle on website from 1994 to 2020
Prerequisite¶
| Tool | Description | Access |
|---|---|---|
| Kaggle CLI | Interactive with kaggle, implemted for Python | Official GitHub |
| fsspec | Filesystem interfaces for Python | Documentation |
Method¶
[1] Declare environment
Declare requirements both for selected phase: production or development-only
then execute to set up
# Set up vitural
python -m venv venv
source venv/Scripts/activate;
python -m pip install --upgrade pip==23.0.1;
python -m pip install -r requirements.txt;
python -m pip install -r requirements-dev.txt;
[2] Download dataset:
Using command from copy from Kaggle UI with path and unzip support.
# Download
kaggle datasets download -d finnhub/sec-filings -p temp/ --unzip;
You can see the progress with the total resources downloaded (in MB) and the download rate.
[2] Using fsspec to interactive with folder and multiple read using polars
*) Declare dependencies to working with
# Global
import os
# External
from fsspec.implementations.local import LocalFileSystem
import polars as pl
a) Get CSV file path in the download folder
if __name__ == "__main__":
# Declare
TMP_PATH: str = os.path.join("temp")
# Create Local Interface
fs = LocalFileSystem(auto_mkdir=False)
# List all CSV file in the target folder
CSV_PATH: list[str] = fs.glob(path="{}/*.csv".format(TMP_PATH))
# Example output:
# >>> from devtools import debug
# >>> debug(CSV_PATH)
# [
# '<path>/temp/1994.QTR1.csv',
# '<path>/temp/1994.QTR2.csv',
# '<path>/temp/1994.QTR3.csv',
# '<path>/temp/1994.QTR4.csv',
# '<path>/temp/1995.QTR1.csv',
# '<path>/temp/1995.QTR2.csv',
# '<path>/temp/1995.QTR3.csv',
# '<path>/temp/1995.QTR4.csv',
# ...
# ]
# See what total files has been downloaded
# >>> len(CSV_PATH)
# 115
b) Read by read_csv and merge it together by concat
# Declare
SEC_METADATA: pl.DataFrame = pl.DataFrame([])
# Loop
# a) Read: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html
# b) then concat: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.concat.html#polars.concat
for path in CSV_PATH:
content: pl.DataFrame = pl.read_csv(path, has_header=True)
SEC_METADATA = pl.concat(items=[SEC_METADATA, content], how="diagonal")
There are 3 strategy of concat by polars:
flowchart LR
polars_method[Polars Concat Method]
polars_method --> vertical[vertical] --> vd[applies multiple vstack operations]
polars_method --> diagonal[diagonal] --> dd[finds a union between the column schemas and fills missing column values with null]
polars_method --> horizontal[horizontal] --> hd[stacks Series from DataFrames horizontally and fills with nulls if the lengths dont match] As you can see, total of 115 files yield 21M record rows of 8 columns
c) Basic Look on dataframe
# Inspect schema
SEC_METADATA.schema
# {'symbol': Utf8, 'cik': Int64, 'form': Utf8, 'access_number': Utf8, 'filed_date': Utf8, 'accepted_date': Utf8, 'report_url': Utf8, 'filing_url': Utf8}
# Shape: (Rows, Columns)
SEC_METADATA.shape
# (21474416, 8)
# Total records
SEC_METADATA.shape[0] * SEC_METADATA.shape[1]
# Sample of dataset
SEC_METADATA.glimpse()
# Rows: 21474416
# Columns: 8
# $ symbol <str> None, None, None, None, None, None, None, None, None, None # noqa: E501
# $ cik <i64> 100030, 100030, 100030, 100063, 100063, 100122, 100122, 100166, 100240, 100240 # noqa: E501
# $ form <str> SC 13G/A, SC 13G/A, SC 13G/A, SC 13G/A, SC 13G, SC 13D, SC 13D, S-8, 10-K, 424B2 # noqa: E501
# $ access_number <str> 0000950152-94-000075, 0000950150-94-000279, 0000009749-94-000047, 0000759277-94-000089, 0000912057-94-000316, 0000100122-94-000010, 0000100122-94-000012, 0000950118-94-000015, 0000950144-94-000787, 0000950144-94-000139 # noqa: E501
# $ filed_date <str> 1994-02-07, 1994-02-10, 1994-02-14, 1994-02-14, 1994-02-10, 1994-03-23, 1994-03-24, 1994-02-11, 1994-03-31, 1994-01-31 # noqa: E501
# $ accepted_date <str> 1994-02-07 00:00:00, 1994-02-10 00:00:00, 1994-02-14 00:00:00, 1994-02-14 00:00:00, 1994-02-10 00:00:00, 1994-03-23 00:00:00, 1994-03-24 00:00:00, 1994-02-11 00:00:00, 1994-03-31 00:00:00, 1994-01-31 00:00:00 # noqa: E501
# $ report_url <str> https://www.sec.gov/Archives/edgar/data/918711/0000950152-94-000075.txt, https://www.sec.gov/Archives/edgar/data/732812/0000950150-94-000279.txt, https://www.sec.gov/Archives/edgar/data/9749/0000009749-94-000047.txt, https://www.sec.gov/Archives/edgar/data/34066/0000759277-94-000089.txt, https://www.sec.gov/Archives/edgar/data/36104/0000912057-94-000316.txt, https://www.sec.gov/Archives/edgar/data/716634/0000100122-94-000010.txt, https://www.sec.gov/Archives/edgar/data/716634/0000100122-94-000012.txt, https://www.sec.gov/Archives/edgar/data/100166/0000950118-94-000015.txt, https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt, https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000139.txt # noqa: E501
# $ filing_url <str> https://www.sec.gov/Archives/edgar/data/918711/000095015294000075/0000950152-94-000075-index.html, https://www.sec.gov/Archives/edgar/data/732812/000095015094000279/0000950150-94-000279-index.html, https://www.sec.gov/Archives/edgar/data/9749/000000974994000047/0000009749-94-000047-index.html, https://www.sec.gov/Archives/edgar/data/34066/000075927794000089/0000759277-94-000089-index.html, https://www.sec.gov/Archives/edgar/data/36104/000091205794000316/0000912057-94-000316-index.html, https://www.sec.gov/Archives/edgar/data/716634/000010012294000010/0000100122-94-000010-index.html, https://www.sec.gov/Archives/edgar/data/716634/000010012294000012/0000100122-94-000012-index.html, https://www.sec.gov/Archives/edgar/data/100166/000095011894000015/0000950118-94-000015-index.html, https://www.sec.gov/Archives/edgar/data/100240/000095014494000787/0000950144-94-000787-index.html, https://www.sec.gov/Archives/edgar/data/100240/000095014494000139/0000950144-94-000139-index.html # noqa: E501
d) Data
| Field | Type | Mandatory | Description |
|---|---|---|---|
| acceptedDate | string | yes | Date & time when the filing was accepted by the SEC (format: %Y-%m-%d %H:%M:%S). |
| accessNumber | string | no | The unique accession (access) number for the filing. |
| cik | string | no | Central Index Key for the company. |
| filedDate | string | no | The actual filed timestamp for the document in the same format. |
| filingUrl | string | no | Web link to the original filing on SEC or other repository. |
| form | string | no | SEC form type (e.g., 10-K, 8-K, etc.). |
| reportUrl | string | no | URL to the generated sentiment report (depends on API & subscription). |
| symbol | string | no | The equity ticker symbol the filing is associated with. |
e) Answer some question
Total

