Walmart Store Sales Project
Business Objective:
Walmart, a multinational retail corporation, operates a chain of hypermarkets, discount department stores, and grocery stores. The dataset provided contains historical sales data for 45 stores located in different regions. Each store contains a number of departments, and the company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas.
The goal of this project is to analyze the sales data, understand the factors influencing the sales, and provide insights to the management team to help them make informed decisions. The factors can include holidays, temperature, fuel price, Consumer Price Index (CPI), and unemployment rates.
Dataset:
Download the dataset here.
The dataset provided contains historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and we are tasked with predicting the department-wide sales for each store. In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas.
The data includes the following fields:
Store
: The store numberDate
: The week of salesWeekly_Sales
: Sales for the given store in the given weekHoliday_Flag
: Whether the week is a special holiday week (1) or not (0)Temperature
: Average temperature in the regionFuel_Price
: Cost of fuel in the regionCPI
: Consumer Price IndexUnemployment
: The unemployment rate
Project Goals:
- Exploratory Data Analysis (EDA): Understand the data through statistical summaries and visualizations. Identify any trends, correlations, or anomalies in the data.
- Feature Engineering: Create new features that might help improve the performance of our predictive models.
- Data Visualization: Visualize the data to gain more insights. This includes sales trends over time, sales distribution across different stores, and the relationship between sales and other variables.
- Hypothesis Testing: Test hypotheses related to the impact of holidays, temperature, fuel price, CPI, and unemployment on sales.
- Machine Learning: Develop machine learning models to predict future sales based on the given variables. This includes regression models, time series forecasting models, and possibly more advanced models like neural networks.
- Evaluation: Evaluate the performance of the models using appropriate metrics and techniques.
- Insights and Recommendations: Provide actionable insights and recommendations based on the results of our analysis and modeling.
By achieving these goals, we aim to provide Walmart with a robust tool for predicting future sales and a comprehensive analysis of the factors affecting sales. This will support informed decision-making and strategic planning.
Deliverables
You can analyze the data in any tool you like (Tableau, Power BI, python, R, Excel, etc.)
Your manager would like a dashboard. The dashboard will be used by upper management to monitoring performance.
She would also like for you to generate a slide deck to present your analysis and recommendations to the VP of Human Resources of the company. She would like to know the factors that impact attrition and which areas of the company are impacted the most.
The slide deck can be done in Google Slides, PowerPoint, or any other tool. Just save it as a PDF.
Additional Instructions
Feel free to explore the data however you see fit. We have provided some guided questions to help direct your analysis and spark your own ideas.
Get the data here.
Guiding Questions
Exploratory Data Analysis
- How many unique stores are there in the dataset?
- Relevance: Understanding the number of unique stores can help us understand the scope of the data.
- What is the time range of the data?
- Relevance: Knowing the time range of the data can help us understand the period we’re analyzing.
- What is the average weekly sales across all stores?
- Relevance: This gives us an idea of the overall sales performance.
- What is the correlation between Temperature, Fuel_Price, CPI, Unemployment, and Weekly_Sales?
- Relevance: This can help us understand how these variables are related to each other.
- How many holiday weeks are there in the dataset?
- Relevance: This can help us understand the frequency of holiday weeks.
Data Cleaning and Preprocessing
- Check for missing values in the dataset.
- Relevance: Missing values can affect the performance of our models.
- If there are missing values, impute them using an appropriate strategy.
- Relevance: This can help us deal with missing values.
- Check for outliers in the Weekly_Sales, Temperature, Fuel_Price, CPI, and Unemployment variables.
- Relevance: Outliers can affect the performance of our models.
- If there are outliers, handle them using an appropriate strategy.
- Relevance: This can help us deal with outliers.
- Normalize the Temperature, Fuel_Price, CPI, and Unemployment variables.
- Relevance: Normalization can help improve the performance of our models.
Feature Engineering
- Create a new feature Month extracted from the Date.
- Relevance: This can help us analyze sales trends on a monthly basis.
- Create a new feature Year extracted from the Date.
- Relevance: This can help us analyze sales trends on a yearly basis.
- Create a new feature Week extracted from the Date.
- Relevance: This can help us analyze sales trends on a weekly basis.
- Create a new feature Is_Holiday_Week based on the Holiday_Flag (1 if it’s a holiday week, 0 otherwise).
- Relevance: This can help us analyze sales trends during holiday weeks.
- Create a new feature Sales_Per_Store which is the total sales per store.
- Relevance: This can help us understand which stores are performing better in terms of sales.
Data Visualization
- Plot the weekly sales over time.
- Relevance: This can help us visualize the sales trend over time.
- Plot the average weekly sales per store.
- Relevance: This can help us visualize the performance of each store.
- Plot the distribution of Temperature, Fuel_Price, CPI, and Unemployment.
- Relevance: This can help us understand the distribution of these variables.
- Plot the correlation matrix of Temperature, Fuel_Price, CPI, Unemployment, and Weekly_Sales.
- Relevance: This can help us visualize the relationship between these variables.
- Plot the average weekly sales during holiday weeks and non-holiday weeks.
- Relevance: This can help us understand the impact of holidays on sales.
Hypothesis Testing
- Is there a significant difference in sales during holiday weeks and non-holiday weeks?
- Relevance: This can help us understand the impact of holidays on sales.
- Is there a significant correlation between Temperature and Weekly_Sales?
- Relevance: This can help us understand if temperature affects sales.
- Is there a significant correlation between Fuel_Price and Weekly_Sales?
- Relevance: This can help us understand if fuel price affects sales.
- Is there a significant correlation between CPI and Weekly_Sales?
- – Relevance: This can help us understand if the Consumer Price Index affects sales.
- Is there a significant correlation between Unemployment and Weekly_Sales?
- Relevance: This can help us understand if unemployment rates affect sales.
Machine Learning
- Build a linear regression model to predict Weekly_Sales based on Temperature, Fuel_Price, CPI, and Unemployment.
- Relevance: This can help us understand the relationship between these variables and weekly sales, and can also be used for sales forecasting.
- Evaluate the performance of the linear regression model using appropriate metrics (e.g., RMSE, R-squared).
- Relevance: This can help us understand how well our model is performing.
- Build a time series forecasting model (e.g., ARIMA, SARIMA) to predict future Weekly_Sales.
- Relevance: This can help us forecast future sales.
- Evaluate the performance of the time series forecasting model using appropriate metrics (e.g., RMSE, MAPE).
- Relevance: This can help us understand how well our model is performing.
- Build a classification model to predict whether a week is a holiday week based on Weekly_Sales, Temperature, Fuel_Price, CPI, and Unemployment.
- Relevance: This can help us understand the relationship between these variables and whether a week is a holiday week.