5 Reasons NOT To Include Projects of Public Datasets In Your Portfolio

Payton Soicher
4 min readJan 20, 2023
Tons of datasets to choose from, but which one’s should be in a portfolio?

When making the decision to get started in the field of data science, the very first thing to get employers and recruiters to take you seriously is to have a portfolio of projects you’ve analyzed. This can be done through simple data analysis, predictive modeling, or data extraction. Not every single project has to teach a car how to drive using machine learning, but it should at the minimum showcase some skill set an employer would find important in a job they are hiring for.

Sometimes the hardest part of a project isn’t building a machine learning model or creating a killer visualization, but actually picking the topic of the project you want to work on. Everyone has already analyzed the Kaggle Titanic dataset! Creating projects using data that has already been processed for you can be great to build your skill set in private, but should never be showcased in public.

But isn’t any accessible dataset on the internet public?

Sure, but there’s a difference between accessible and pre-collected. For example, you can extract sports statistics from a website. That is accessible data; however, you might want to look at a very specific problem in the data. You have to put in some work to get the exact kind of data and format it in a way to help showcase your results. These bland datasets already cleaned, processed, and have only 1 functional purpose. These datasets should not be included in your portfolio for the following reasons:

1. It shows your level of expertise.

Every intermediate data scientist has access to the same data. That also means the likelihood it will be on other portfolios is very high. Most easily accessible datasets are for intermediate data scientists for a reason. They’re great for learning the skill sets, but make no mistake, nobody will be fooled into thinking you can handle real world data just because you were able to create a solution to MNIST.

2. Getting my own data is difficult…exactly the point!

Just like at any company, obtaining the data is half the battle! Working with data already put into CSVs and put on a platter seems lazy. You tell me what sounds more impressive:

  • I created a machine learning model using data in a forked data repository
  • I created a web scraping algorithm to collect data, then created a machine learning model on it

The first scenario showcases one skill set (machine learning), while the second scenario showcases multiple skills (software engineering, data engineering, machine learning)

3. It sends a negative first impression

As the saying goes “You only get one chance at a first impression”, so make it a great one! Someone recognizing a common dataset won’t be excited about the work you did, no matter how interesting it is. If you want to stand out, you want to work on unique data that is interesting to whoever is viewing your work.

4. Businesses solve custom problems with non-standard solutions

Do you know how difficult it is to find a data scientist that can run a linear regression? Logistic regression? How about 99% of them on the planet. I’m not saying those aren’t important solutions to problems, but a personal portfolio should be something to show off your creativity. Public datasets typically have one theme trying to be solved. Whether it’s predicting an image correctly or solving for the next value in a time series, public datasets are trying to help beginners learn different ways to solve problems. A business on the other hand will assume you can solve the basic problems, so you need to show you can work on problems a beginner would struggle with.

5. Your Passion Isn’t Showing

When you invest the time and energy into finding data you find to be interesting and challenging, you will not skip on the small details in your project. It will force you to investigate deeper into finding additional sources of data to help solve or prove a point in your projects. Public datasets almost exclusively do the opposite. Nobody works with the Boston Housing dataset because they are in love with real estate data, but to learn more about how to build strong regression models. If you want to show you know how to build an accurate regression model, practice on the Boston dataset, but use a more unique dataset to boost your credibility.

It’s not difficult to get started with your own portfolio, but it is hard to get people to take your seriously in data science. It takes enough time as is to analyze one dataset, so use that time wisely building up a portfolio of datasets catching the eye of a recruiter. There are tons of datasets you can find that are rarely touched but have a lot of potential for showcasing skills. A portfolio isn’t supposed to show how well you can build an accurate machine learning model, but to show all of the different capabilities you have with handling data that a business believes having you on their staff will benefit them in the long run.

--

--