Learning Data Engineering

Jul 4, 2024

Data Engineering
Data Engineering
Data Engineering


I tweeted some time ago that I am interested in learning data engineering because of my frustration in the startups where I work.


A common problem I always encounter is that data is scattered everywhere, and it is tough for a product manager like me to use data to make good decisions.


So, to empathise more with the situation, I decided to learn about the role through the IBM data engineering course suggested by a friend.




The crux of a data engineer’s job is to extract data from multiple sources(the web, APIs, files, etc.) and store it somewhere accessible for data analysis, data science(building models and predictive analysis to make decisions), or any other way it can help the business.


In this article, I write about my first practice project in the course. The goal of the task was to practise the ETL(extract, transform, and load) functions. The data was a list of countries by their nominal GDP, and the source of this data is a webpage here. The task was to extract the data from the webpage, transform it, and load the data into a JSON, CSV, and SQLite database.


The Steps I took


1. Extraction


I wrote a function to extract the webpage data by scraping the website. I imported different modules, libraries and packages in Python to achieve this, including:


  1. The requests module to access the webpage from the application

  2. Pandas library to create and modify data frames to suit the task

  3. BeautifulSoup to parse the HTML text from the website.


I sent a request to the website and thereafter saved the HTML text to a variable through Beautiful Soup; it is from that variable I was able to locate the GDP table and save the result to a panda data frame(which is 70% of the task)


html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')

def extract():
    tables = data.find('table', {'class': 'wikitable'})
    rows = tables.find_all('tr')[3:]
    df = pd.DataFrame(columns=["Country", "GDP_USD_billion"])

    for row in rows:
        col = row.find_all('td')
        if col:
            country_name = col[0].find('a').text.strip()
            GDP_str = col[2].text.strip().replace(',', '')
           
            data_dict = {
            "Country": country_name,
            "GDP_USD_billion" : GDP_str
        }
        df1 = pd.DataFrame(data_dict, index=[0])
        df = pd.concat([df,df1], ignore_index=True)
    return df


  1. Transformation

    The instruction from the task was to ensure that the GDP amount was loaded in billions(USD) and rounded to 2 decimal places. Since the data type was in a string, I had some transformations to do:


def transform_data(extracted_data):
     extracted_data["GDP_USD_billion"] = pd.to_numeric(extracted_data["GDP_USD_billion"], errors = 'coerce')/1000
     extracted_data["GDP_USD_billion"] = extracted_data["GDP_USD_billion"].round(2)
     
     return extracted_data


  1. Load


Lastly, I wrote a function to save the transformed data to a CSV, JSON and SQLite database as a table. The process for the CSV and JSON was simple because Pandas has a straightforward way of loading data frames into these two formats. The slightly tricky one was loading to SQLite because I had to create and manage a connection.


def load_data(target_file, transfromed_data):
    conn = sqlite3.connect(db_name)
    extracted_data.to_sql(table_name, conn, if_exists='replace', index=False)
    conn.close()
    extracted_data.to_json(target_file)


Bonus


  1. I defined another function to log these processes with a message and a timestamp into a Txt file.

  2. I set up a public GitHub repository where I pushed the script and the outputs (including the logs). You can find the repository here, and can keep checking the repository to track my progress.


I hope I find the strength to keep up.