Skip to content

Start to be Crawler

Overview

Crawl is now used by a lot of company to get information from the market and capacibility for the company to dealing with the change always. And the data, is in various form, then say, a lot data are not captured up to now.

Its some kind like you get the diamond in the sea but if you reduced the area to search and send it to the right place, it becoming values

Some crawl useage:

  1. Scanning E-commerce Product Pricing in website and then compare it together.

  2. Get News on Newspaper about the topics that related.

  3. Get Sale price on Real Estate.

So to be crawler is a capacity for a data guy to get much as posbile around our world.

Python Crawler

I dont try to teach you to be profeessional in the day life. I just show you the requirement you need to pass and the basic process that the way need to be.

And to be of that, Python has a lot of support for this with useful libraries and we need to have a little knowlegde about this.

Table libraries that related to crawl process

Library Features Use-case Description
API, HTML,… requests, BeatifulSoup, 123123 --- ---
Transformation re, pandas, math, datetime,… --- ---
Write DB SQLAlchemly, psycong2 --- ---
Log Logging --- ---
Schedule Airflow, Argo --- ---

Its is a lot of concepts you need to adopt before you to be a master in above libraries.

The process is depend, it has same stucture, can come

  1. Define data demand and sources? Kind of Input/Output

Define? Data ở đâu? Data cần ntn?

Website: https://www.hnx.vn/

Define? Có API không? Does it has API

F12 (Window + F12) >> Chọn Network

Can you reverse API

So it just has HTML elements only?

Use

There are a lot of exercise you can walk through;

Exercise:

Sample on get ticker of HOSE

Let say, based on

Website Official: https://www.hsx.vn/

Some note

Be a good guy in the internet

Try not to be catch at