End to End Data Project: IMDB Database

Objective
The goal of this project is to practice your data engineering skills by extracting, transforming, loading (ETL), and visualizing data from the Internet Movie Database (IMDB). You will scrape data from the IMDB top movies list, clean and process the data, store it in a database for further analysis, and create visualizations to gain insights from the data.
If you get stuck, reference this GitHub repo.
Tasks:
- Web-scraping: Create a python file and name it first.py that uses Python libraries(request and beautiful soup) to get data on feature movie releases from IMDB’s website between the first day of the year to the present day. This script should be able to get data from every next page available.
(https://www.imdb.com/search/title/?title_type=feature&year=2023-01-01,%20today)
You should extract the following from the link:
- titles
- years
- ratings
- genres
- runtimes
- imdb_ratings
- metascores
- votes
Create another script named main.py that does the same thing and gets data within a 24 hour period. Your script doesn’t need to go to the next page for this.
- Data Cleaning: Clean and transform the data using pandas and present it in a dataframe. Write your code to account for scenarios where the request might return null for some values.
- Database creation: Create a PostreSQL database on any cloud service provider of your choice (AWS, AZURE or GCP) and insert your data into a new table. Verify that the data has been inserted correctly. Note: make sure your database firewall is not restricting access to IP addresses.
- With your first.py script, push the data frame to your database table.
- Add an SQL statement to your main.py file that removes duplicates from your database table which is a part of data quality then run your script and confirm that your data is getting stored in your database tables in the right format.
- Automate and Schedule your main.py script to run every 24 hours using Prefect.
- Connect your database to either Metabase or Power BI and visualise your data and get valuable insights.
Deliverables:
- A Jupyter notebook containing your Python code and explanations for each step.
- A database containing your cleaned and processed data.
- Visualizations created from the data.
- A slide deck explaining your process
Evaluation Criteria:
- Accuracy of the web scraping code in extracting the required information.
- Effectiveness of the data cleaning process in handling missing values and converting data types.
- Successful creation of a SQLite database and insertion of data into a new table.
- Insightfulness and clarity of the data visualizations.
- Clarity and organization of the Python code and explanations in the Jupyter notebook.
Tips:
- Always check the website’s
robots.txt
file and terms of service before scraping to ensure you’re allowed to scrape and that you’re doing so responsibly. - Be sure to handle exceptions and potential errors in your code to make it robust and reliable.
- Comment your code thoroughly to make it easier for others (and future you) to understand.
- Use libraries like matplotlib or seaborn for data visualization in Python.
Good luck! This project will give you practical experience with important data engineering skills, including web scraping, data cleaning, database management, and data visualization.