Skip to content

COL_FENAVI

Description

Crawler developed in Python 3 to extract the information that is in a PDF on https://southernchemical.com/historical-pricing/

  • Crawler type: PDF (camelot)
  • Should I follow the template:
  • Yes
  • No
  • Requires some kind of authentication:
  • Yes (If Yes, describe)
  • No
  • URL retrieval:
  • Yes (If Yes, describe)
  • No

Reference Video

Solution

Architecture

image

Crawler flow

The crawler will access the page, extract the relevant data and save it in a pandas dataframe. The program is set up to extract information such as title, description and links, but can easily be adapted to extract other types of data.

image

  • read_excel: Read excel file and extract ID and Name information.
  • start_urls: Generate a initial list of URLs to be scraped.
  • run_crawler: Run scrapy crawler passing initial URLs as parameter to scrapy and collect URLs as output.
  • read_data: Read data from URLs.
  • transform_data: Transform data read from url.
  • validate: Validate data with Costdrivers platform.
  • upload: Upload data to Costdrivers platform.

Differentials

  • Yes
  • No

  • If Yes, add fluxograma

Installation

To run the code locally:

  • 1 - Clone the repository:
    git clone https://github.com/github/costdrivers.git
    
  • 2 - Install the libraries needed to run the code with the command: pip install -r requirements.txt

Python version: ^3.10

Use

To run the crawler, simply call the .py file: The program will access the page, extract the relevant information, and display it in a pandas dataframe.

INTLSouthernChemicalCrawler Objects

class INTLSouthernChemicalCrawler()

Class for crawling INTL Southern Chemical excel data.

Attributes

df_excel: pd.DataFrame Dataframe with excel data. df_results: pd.DataFrame Dataframe with the columns that should be sent to the platform.

__init__

def __init__(filename: str = "INTL_SouthernChemical.xlsx") -> None

Initializes the INTLSouthernChemicalCrawler with default attributes.

Arguments:

  • filename str, optional - The name of the excel file to read. Defaults to "INTL_SouthernChemical.xlsx".

read_excel

def read_excel() -> None

Read excel file and extract relevant information as ID and "name" of the information

Returns:

None.

transform_data

def transform_data() -> pd.DataFrame

Transforms the dataframes extracted from the Excel files and returns a single dataframe.

Returns:

  • pd.DataFrame - A dataframe that contains the transformed data from the Excel files.

validate

def validate() -> None

Validate data with Costdrivers platform

upload_info

def upload_info() -> None

Upload data to Costdrivers platform

Scraper Objects

class Scraper()

Initialize the Scraper class.

Attributes

df : Optional[pd.DataFrame] The extracted DataFrame from the PDF table. _month : Optional[str] The name of the current month. _year : Optional[int] The current year. _current_date : datetime.datetime The current date and time.

get_pdf_table

arquivo.py
def get_pdf_table() -> pd.DataFrame

Fetches table data from a PDF URL and returns it as a pandas dataframe.

Returns:

A pandas dataframe containing the table data.

Updates