Introduction¶
Overview¶
Crawl is now used by a lot of company to get information from the market and capacibility for the company to dealing with the change always. And the data, is in various form, then say, a lot data are not captured up to now.
Its some kind like you get the diamond in the sea but if you reduced the area to search and send it to the right place, it becoming values
Some crawl useage:
-
Scanning E-commerce Product Pricing in website and then compare it together.
-
Get News on Newspaper about the topics that related.
-
Get Sale price on Real Estate.
So to be crawler is a capacity for a data guy to get much as posbile around our world.
Python Crawler¶
I dont try to teach you to be profeessional in the day life. I just show you the requirement you need to pass and the basic process that the way need to be.
And to be of that, Python has a lot of support for this with useful libraries and we need to have a little knowlegde about this.
Table libraries that related to crawl process
Library | Features | Use-case | Description |
---|---|---|---|
API, HTML,… | requests, BeatifulSoup, 123123 | --- | --- |
Transformation | re, pandas, math, datetime,… | --- | --- |
Write DB | SQLAlchemly, psycong2 | --- | --- |
Log | Logging | --- | --- |
Schedule | Airflow, Argo | --- | --- |
Its is a lot of concepts you need to adopt before you to be a master in above libraries.
The process is depend, it has same stucture, can come
- Define data demand and sources? Kind of Input/Output
Define? Data ở đâu? Data cần ntn?
Website: https://www.hnx.vn/
Define? Có API không? Does it has API
F12 (Window + F12) >> Chọn Network
Can you reverse API
So it just has HTML elements only?
Use
There are a lot of exercise you can walk through;
Exercise:
Sample on get ticker of HOSE
Let say, based on
Website Official: https://www.hsx.vn/
Some note
Be a good guy in the internet
Try not to be catch at