Kaggle’s New York Stock Exchange S&P 500 dataset


In this project we will analyze real life data from the New York Stock Exchange. You will be drawing a subset of a large dataset provided by Kaggle that contains historical financial data from S&P 500 companies. We have created a smaller subset of the data that you will be using for the project.

What do I need to install?

You may use any spreadsheet application you like. This includes Google Sheets, Microsoft Excel, etc.

Why this Project?

This project will introduce you to the data analysis process that you will be using throughout the rest of the Nanodegree program. In this project, you will go through the process of calculating summary statistics, drawing an inference from the statistics, calculating business metrics and using models to forecast future growth prospects for the companies. The goal is for you to perform an analysis and also create visual tools to communicate the results in informative ways.

We have provided a clean data set for this project. Although in real life scenarios, data sets often need to be cleaned and processed before analysis can proceed. This project allows you to see what a clean data set should look like.


We used the Fundamentals.csv and Securities.csv files provided by Kaggle. The Fundamentals file provides the fundamental financial data gathered from SEC 10K annual filings from 448 companies listed on the S&P 500 index. The Securities file provided the industry or sector information the companies are categorized under on the S&P 500 index.

What skills will I use?

The main goal of this project is for you to demonstrate your ability to:

  • interpret the measures of central tendency and spread (mean, median, standard deviation, range)
  • use a combination of Excel or Google Sheets functions (e.g., IF statements, INDEX and MATCH, calculating descriptive statistics with the IF statement, drop downs, data validation, VLOOKUP).
  • analyze and forecast financial business metrics using Excel or Google Sheets.
  • create visualizations of a business metric and use Excel or Google Sheets to create a financial forecast model.

Project Set Up

This project is made of two parts. For each part, you will be using the same dataset, which you can find in the Supporting Materials as Projectdata NYSE.csv at the bottom of this page. If you are using Google Spreadsheets, you can access the link to the data here:

  1. The first part of the project is a set of quiz questions, which you will find in the upcoming concepts. These concepts are aimed to help you get familiar with the dataset and test that you have mastered the core concepts in the previous lessons. Correctly answering each of the quiz questions will assure you are on the right track before you dive into the second part of the project. This part of the project will not be submitted for review.

  2. The second part of your project is the portion you will turn in for review. You will need to create a presentation and spreadsheet to be reviewed. The details of this submission are provided in the last page in this lesson. Pay attention to the details of the Rubric to assure you have all deliverables. In order to have your presentation reviewed, you will need to save your slides as a PDF. You can save your spreadsheet as a Microsoft Excel workbook or Google spreadsheet.

Supporting Materials

( i put it in the link – project data-nyse)

Understanding the data :

Cleaning Up The Data

Although you do not need to follow these for setting up the dataset, these are some suggestions:

  1. Change all the column names to have no spaces, but still be informative. This isn’t necessary, but just a recommendation. Depending on what you do with the data in the future having spaces or special characters in the column names may not work nicely. You will see this in the next content on SQL.

The following information is included in the Project data NYSE file:

  • Ticker symbol: Stock symbol
  • Years: Number of years for which data is provided
  • Period ending
  • Total revenue
  • Cost of goods sold
  • Sales, General and Administrative expenses
  • Research and Development expenses
  • Other Operating expense items
  • Global Industry Classification Standard (GICS) Sector: Industry sector the company is categorized under (e.g., American Airlines with the ticker symbol AAL is categorized under Industrials.)
  • GICS Sub Industry: Sub-industry sector the company is categorized under (e.g., AAL is further categorized under the sub-category of Airlines industry.)

How Do You Complete this Project?

This project is connected with the Introduction to the Data part of the course, but depending on your background knowledge, you may not need to take this module to complete this project.


For the final project, you will conduct three tasks: 1) complete your own data analysis and create a presentation to share your findings, 2) develop a dashboard for a Profit and Loss Statement, and 3) create a Financial Forecasting Model using three scenarios. You should start by taking a look at your dataset and brainstorming which sub-category and company you want to focus your data analysis on – the questions leading to this page should have assisted in this process! Then you should use spreadsheets or another Excel-like software to conduct your analysis and choose a sub-category and company you are most interested in. This project is open-ended in that there is no one right answer.

Project Goals:

Here are the three tasks that you will complete in the final project.

Task 1:
a. Identify the question about the data that you will answer based on your data analysis, and include this in your slide presentation.

  • Your question should include at least one categorical variable (GICS Sector or GICS Sub Industry) and one quantitative variable (one of the financial metrics) and require the use of at least one of the summary statistics.
  • tab within the Excel spreadsheet that you submit should include the summary statistics [measures of central tendency (e.g., mean, median) and measures of spread (standard deviation and range)] you used to answer your question.
  • Deliverable: Slide presentation, Spreadsheet with tab for Summary statistics

b. Your slide presentation should provide at least one visualization to help with your answer.

  • This visualization might be a bar chart, histogram, scatterplot, box-plot or other visual that you learned to make. Include your insights from the measures of center and spread and at least one numeric summary statistic in the description.
  • Deliverable: Slide presentation (includes visualization)

Task 2:

  • Create a dashboard for a Profit and Loss Statement that calculates the Gross ProfitOperating Profit or EBIT for a company selected from a drop-down list.
  • Your drop-down list should pull historical fundamentals data to create the P&L Statement.
  • The P&L statement should include the Gross Profit, Operating Profit or EBIT values for all the years there is historical data available for that company in the dataset.
  • Deliverable: Spreadsheet with tab for Dynamic P&L statement

Task 3:

  • Create a financial model for a company (different from Task 2) of your choice that forecasts out the Gross Profit, Operating Profit or EBIT for two more years using three scenarios (Best case, Weak case and Base case).
  • Your assumptions for revenue growth, gross margin and operating margin should change for each scenario.
  • The forecasting model should be dynamic for the selection of the case (Weak, Base, Strong). However, the forecasting model can be static for the chosen company sticker symbol.
  • Deliverable: Spreadsheet with tab for Forecasting Model

Step One – Get Organized
When you complete your analysis and presentation you’ll want to submit your project. Get organized before you begin. I recommend creating a single folder that will eventually contain:

  • The presentation with the visual and summary
  • The original data set
  • A copy of the spreadsheet workbook you will use to do the analysis for your report that contains at least the following tabs:
    1. Data file
    2. Summary statistics
    3. P&L Statement Dashboard
    4. Forecast scenarios

Step Two – Analyze Your Data
Look through the Tasks described above and select the qualitative variable and quantitative variable you want to focus your analysis on for the various tasks. Then use the .csv file to conduct your data analysis.

Step Three – Create Your Presentation
Once you have finished analyzing the data, create a presentation that shares the visual and summary paragraph. The summary paragraph should clearly communicate your findings based on your analysis, and provide visual or numeric values associated with your summary.


The submission template is a Google Slides file. Make a copy of the submission template to complete your project. We suggest you use the layout provided, though it is not a requirement.

Step Four – Assemble your Worksheet You will need to include the Excel file with the summary statistics, dashboard and financial model scenarios.

Put your presentation and spreadsheet workbook you used to do the analysis in a folder and zip it. Then submit the zipped folder for your project.

Step Five – Check the Rubric
Use the Project Rubric located here. If you see room for improvement, keep working to improve your project.

Step Six – Assemble your folder ready for submission
If you are happy with your submission, then you’re ready to submit your project. Put your presentation and spreadsheet workbook in a folder and zip it. Then submit the zipped folder for your project.

Finished Example Slide

The above slide and graphs were generated with the project data and are meant to be examples. You can see how this example slide meets the rubric requirements.

  1. Clear question in the title indicating what is being investigated
  2. Descriptive title on each chart describing its contents
  3. y-axis title
  4. x-axis title
  5. Detailed insight based on the descriptive statistics.
  6. Summary statistics about the data

If you have more questions about what you need in your project, double check the rubric.

Helpful Ideas

Based on previous project submissions, this page is meant to review some ideas that are commonly missed.


In the last Excel lesson you were introduced to some ways to visually display your data. However, you should know that the plots you can make are tied to different data types. We go over those once again here.

Plots You Can Use For Categorical Variables

If you have categorical data, here is a list of the possible univariate (one variable) plots you can make:

  1. Bar Chart
  2. Pie Chart

Plots You Can Use For Quantitative Variables

If you have quantitative data, here is a list of the possible univariate plots you can make:

  1. Histogram
  2. Box Plot

Plots to Compare 2 Variables

If you are interested in comparing two quantitative variables, then the main way to perform this comparison is with a scatterplot. However, if one of the variables is related to time, then a line plot is frequently used.


Quantitative Variables

When describing quantitative variables, it is common to use the statistics discussed earlier:

  1. Measures of center – mean, median, mode
  2. Measures of spread – standard deviation, range, IQR

Categorical Variables

However, when you are analyzing categorical variables, measures of center and spread Do Not make sense.

In cases of describing categorical variables, you need to use percentages or counts. Not means, medians, modes, standard deviations, or ranges.

Important Last Thought

With this in mind, think of the variable type of the columns you are analyzing, and determine which plots and statistics make sense for your analysis.



What to include in your submission:

  1. A presentation file that should include a slide with:
    • A statement of the question you posed
    • Summary statistics and plots communicating your final results

Formatting of your submission:

  1. Feel free to use our template to develop the presentation.
  2. In order to submit your presentation for review, you will need to save your slides as a PDF or PowerPoint PPT file. You can create a PDF file from within Google Slides by selecting File > Download as > PDF Document.
  3. Excel Workbook or Google Sheets with tabs for each of the following:
  1. Dataset
  2. Summary statistics
  3. Profit and Loss statement dashboard
  4. Forecast model for three scenarios
  5. Your workbook can include additional tabs you may need for your project (e.g., pivot tables).

Formatting of your submission:

  1. The Forecast model should be set to the company of your choice. You may choose to have the ticker symbol or name of the company in text at the top for the forecast model.
  2. You will also need to save your spreadsheet in .xlsx format OR provide a link to your Google Sheets. You should provide the link to the Google sheet in your presentation slides since Google Sheets formulas do not download properly into Excel and the reviewer will not be able to see all the formulas.
  3. A list of websites, books, forums, blog posts, etc. that you referred to or used in creating your submission (add N/A if you did not use any such resources).
  4. Zip (compress) the folder and submit this zipped folder with both files in it.


Analyze NYSE Data

Submission Phase

Exploration of Summary Statistics

The student is able to calculate measures of center for quantitative data and interpret it correctly.Student uses the measures of center and spread and at least one numeric summary statistic to generate insights. Stating the summary statistics is insufficient. Please include in the written description a short insight related to each one. For example here is an insight based on mean:
The mean total revenue for companies categorized under Pharmaceutical industry ($26,325,440,909.09) was higher compared to mean total revenue for all healthcare industries ($23,142,217,458.76). It looks like companies in the Pharmaceutical industry have a higher total revenue on average than all industries categorized under Health Care.
The student is able to calculate measures of spread for quantitative data and interpret it correctly.Student uses standard deviation and range to generate insights. Stating the standard deviation and range is insufficient. Please include in the written description a short insight related to each one. For example, please review the finished slide example in the classroom, which can be found in the Analyze NYSE S&P 500 dataset project lesson (Finished Example Slide).
The student is able to build graphs for quantitative and categorical data.Student uses at least one plot to explore the data. The plots may include histograms, box-plots, scatterplots, and bar charts to explore data and gain insights. All slides must contain a visualization. Screenshots of values in a table does not count.
The student is able to present findings in an understandable way.An appropriate visual is chosen to present the data. All labels are legible and the visual has appropriate axis labels. Every visualization should have chart title (including which year’s data the chart depicts) x axis title x axis labels y axis title y axis labels Please refer to the finished slide example page in the classroom for an example.
The student has uploaded a PDF report necessary for review.  A PDF report have been uploaded as part of a zipped folder.
The student has provided link to Google Sheet or an Excel file necessary for review. This file should include their Profit and Loss statement and forecasts. In case the student did not include an Excel file as part of the submission, the Google link should be included in the PDF or slides document.Student provided an Excel file as part of a zipped folder or link to Google Sheet (in case the student used Google Sheets instead of Excel) necessary for review. This file should include their Profit and Loss statement and forecasts. The Google link should be included in the PDF or slides document. The spreadsheet (Excel or Google Sheets) should contain individual tabs for the dataset, calculation of the summary statistics, dashboard for Profit and Loss statement, and Forecasting model with scenarios. There can be additional tabs in the Workbook that are needed for the dashboard and forecasting model.

Communication Phase

The student is able to avoid making inferential or causal statements when using descriptive statistics.The results of the analysis are presented such that any limitations are clear. The analysis does not state or imply that one change causes another based solely on a correlation. The results do not imply facts about a larger group of individuals based on descriptive values. Language is only applied to the specific data provided, unless a correct analysis beyond the course material is conducted that allows for inference.
The student is able to choose the correct analysis or plot for a given data type.The analysis associated with answering a particular question uses the appropriate variables, summary statistics, and plots that could provide an answer.

Business Metrics

The student is able to correctly use the financial metrics.Student has input the correct formula for each business metric in the income statement and forecast model. Student has built a forecast model for any company of choice. A dropdown for a company in the forecast model is NOT required.
The student provides appropriate assumptions for the financial model scenarios.The student provides appropriate assumptions based on gross margin, revenue growth and operating margin for the financial model scenarios.

Excel Functions and Modeling

The student can use various Excel functions.Student demonstrates using VLOOKUP or INDEX and MATCH statements. The student can use the appropriate functions such as OFFSET and MATCH to create forecast scenarios.

