BR_PROCON

Description

Crawler developed in Python 3 to extract the information that is in a PDF on https://southernchemical.com/historical-pricing/

Reference Video

Solution

Architecture

Crawler flow

The crawler will access the page, extract the relevant data and save it in a pandas dataframe. The program is set up to extract information such as title, description and links, but can easily be adapted to extract other types of data.

read_excel: Read excel file and extract ID and Name information.
start_urls: Generate a initial list of URLs to be scraped.
run_crawler: Run scrapy crawler passing initial URLs as parameter to scrapy and collect URLs as output.
read_data: Read data from URLs.
transform_data: Transform data read from url.
validate: Validate data with Costdrivers platform.
upload: Upload data to Costdrivers platform.

Differentials

Yes
No
If Yes, add fluxograma

Installation

To run the code locally:

1 - Clone the repository:

git clone https://github.com/github/costdrivers.git

2 - Install the libraries needed to run the code with the command:
```
pip install -r requirements.txt
```

Python version: ^3.10

Use

To run the crawler, simply call the .py file: The program will access the page, extract the relevant information, and display it in a pandas dataframe.

INTLSouthernChemicalCrawler Objects

class INTLSouthernChemicalCrawler()

Class for crawling INTL Southern Chemical excel data.

Attributes

df_excel: pd.DataFrame Dataframe with excel data. df_results: pd.DataFrame Dataframe with the columns that should be sent to the platform.

init

def __init__(filename: str = "INTL_SouthernChemical.xlsx") -> None

Initializes the INTLSouthernChemicalCrawler with default attributes.

Arguments:

filename str, optional - The name of the excel file to read. Defaults to "INTL_SouthernChemical.xlsx".

read_excel

def read_excel() -> None

Read excel file and extract relevant information as ID and "name" of the information

Returns:

None.

transform_data

def transform_data() -> pd.DataFrame

Transforms the dataframes extracted from the Excel files and returns a single dataframe.

Returns:

pd.DataFrame - A dataframe that contains the transformed data from the Excel files.

validate

def validate() -> None

Validate data with Costdrivers platform

upload_info

def upload_info() -> None

Upload data to Costdrivers platform

Scraper Objects

class Scraper()

Initialize the Scraper class.

Attributes

df : Optional[pd.DataFrame] The extracted DataFrame from the PDF table. _month : Optional[str] The name of the current month. _year : Optional[int] The current year. _current_date : datetime.datetime The current date and time.

get_pdf_table

arquivo.py

def get_pdf_table() -> pd.DataFrame

Fetches table data from a PDF URL and returns it as a pandas dataframe.

Returns:

A pandas dataframe containing the table data.

BR_PROCON

Description

Reference Video

Solution

Architecture

Crawler flow

Differentials

Installation

Python version: ^3.10

Use

INTLSouthernChemicalCrawler Objects

Attributes

__init__

read_excel

transform_data

validate

upload_info

Scraper Objects

Attributes

get_pdf_table

Updates

init