BR_PROCON
Description
Crawler developed in Python 3 to extract the information that is in a PDF on https://southernchemical.com/historical-pricing/
- Crawler type: PDF (camelot)
- Should I follow the template:
- Yes
- No
- Requires some kind of authentication:
- Yes (If Yes, describe)
- No
- URL retrieval:
- Yes (If Yes, describe)
- No
Reference Video
Solution
Architecture
Crawler flow
The crawler will access the page, extract the relevant data and save it in a pandas dataframe. The program is set up to extract information such as title, description and links, but can easily be adapted to extract other types of data.
- read_excel: Read excel file and extract ID and Name information.
- start_urls: Generate a initial list of URLs to be scraped.
- run_crawler: Run scrapy crawler passing initial URLs as parameter to scrapy and collect URLs as output.
- read_data: Read data from URLs.
- transform_data: Transform data read from url.
- validate: Validate data with Costdrivers platform.
- upload: Upload data to Costdrivers platform.
Differentials
- Yes
-
No
-
If Yes, add fluxograma
Installation
To run the code locally:
- 1 - Clone the repository:
git clone https://github.com/github/costdrivers.git
- 2 - Install the libraries needed to run the code with the command:
pip install -r requirements.txt
Python version: ^3.10
Use
To run the crawler, simply call the .py file: The program will access the page, extract the relevant information, and display it in a pandas dataframe.
INTLSouthernChemicalCrawler Objects
class INTLSouthernChemicalCrawler()
Class for crawling INTL Southern Chemical excel data.
Attributes
df_excel: pd.DataFrame Dataframe with excel data. df_results: pd.DataFrame Dataframe with the columns that should be sent to the platform.
__init__
def __init__(filename: str = "INTL_SouthernChemical.xlsx") -> None
Initializes the INTLSouthernChemicalCrawler with default attributes.
Arguments:
filename
str, optional - The name of the excel file to read. Defaults to "INTL_SouthernChemical.xlsx".
read_excel
def read_excel() -> None
Read excel file and extract relevant information as ID and "name" of the information
Returns:
None.
transform_data
def transform_data() -> pd.DataFrame
Transforms the dataframes extracted from the Excel files and returns a single dataframe.
Returns:
pd.DataFrame
- A dataframe that contains the transformed data from the Excel files.
validate
def validate() -> None
Validate data with Costdrivers platform
upload_info
def upload_info() -> None
Upload data to Costdrivers platform
Scraper Objects
class Scraper()
Initialize the Scraper class.
Attributes
df : Optional[pd.DataFrame] The extracted DataFrame from the PDF table. _month : Optional[str] The name of the current month. _year : Optional[int] The current year. _current_date : datetime.datetime The current date and time.
get_pdf_table
def get_pdf_table() -> pd.DataFrame
Fetches table data from a PDF URL and returns it as a pandas dataframe.
Returns:
A pandas dataframe containing the table data.