When it comes to breaking into the field of data science, you need to use every trick in the book to give yourself that one advantage that pushes you over the finish line. So why not try to emulate the habits of the best in the business? This article isn’t a “get rich quick” method to becoming an efficient data scientist. Instead, it shows the habits that have helped the best data scientists get to where they are. It’s often said that a data scientist’s worth is determined by the impact they can have on an organization. That impact begins with becoming an efficient and effective data scientist through the development of good habits.
How many current data science technologies arose only in the last ten or so years? Pretty much most of them.
By entering into the realm of data science with the motivation that you’re going to take a good crack at it, you’ve relegated yourself to a lifetime of constant learning. Don’t worry, it’s not as bleak as it sounds.
However, what should be kept in the back of your mind at all times is that to remain relevant in the workforce, you need to stay up to date with technology. So, if you’ve been doing data analysis with MATLAB your whole career, try learning to code in Python. If you’ve been creating your visualizations with Matplotlib, try using Plotly for something fresh.
How to implement this habit: Take an hour every week (or as much time as you can spare), and experiment with new technologies. Figure out which technologies are relevant by reading blog posts, and pick a couple you would like to add to your stack. Then, create some personal projects to learn how to use the new technologies to the best of their abilities.
I always seem to be blessed with getting to read and deal with code that has terrible documentation and no supporting comments to help me understand what the heck is going on.
Part of me used to chalk it up to the whistful meanderings of programmers, until one day, I realized that it’s just the sign of a bad programmer.
All good programmers I’ve dealt with are those who provide clear, concise documentation to support their work and litter their programs with helpful comments to describe what certain lines of code are doing. This is especially pertinent for data scientists who are using complex algorithms and machine learning models to solve problems.
How to implement this habit: Take some time to either read good code documentation or articles on how to write good code documentation. To practice, write documentation for old personal projects, or take some time to revamp the documentation of your current projects. Since a good portion of the data science world runs on Python, check out this really well-written article on how to document Python code:
The stereotype that developers are pasty-skinned social outcasts who lock themselves into solitude to write code destined for world domination is an outdated generalization that doesn’t reflect the modern complexities of the tech industry as a whole.
Refactoring is the process of cleaning up your code without changing its original function. While refactoring is a process born from necessity in software development situations, refactoring can be a useful habit for data scientists.
My mantra when refactoring is “less is more”.
I find that when I initially write code to solve data science problems, I usually throw good coding practices out of the door in favor of writing code that works when I need it to. In other words, a lot of spaghetti code happens. Then, after I get my solution to work, I’ll go back and clean up my code.
How to implement this habit: Take a look at old code and ask if the same code could be written more efficiently. If so, take some time to educate yourself on best coding practices and look for ways where you can shorten, optimize, and clarify your code. Check out this great article that outlines best practices for code refactoring:
There are so many productivity-enhancing extensions for IDEs out there that, surprisingly, some people haven’t chosen to optimize their workflows yet.
This habit is so unique to everyone that it really comes to down determining which tools, workspaces, and workflows make you the most effective and efficient data scientist you could be.
How to implement this habit: Once a year (or more often if that works better for you), take stock of your overall effectiveness and efficiency, and determine where you could improve. Perhaps this means working on your machine learning algorithms first thing in the morning, or sitting on an exercise ball instead of a chair, or adding a new extension to your IDE that will lint your code for you. Experiment with different workspaces, tools, and workflows until you enter your optimal form.
From what I’ve seen, data science is 75% understanding business problems and 25% writing models to figure out how to solve them.
Coding, algorithms, and mathematics are the easy part. Understanding how to implement them so they can solve a specific business problem, not so much. By taking more time to understand the business problem and the objectives you’re trying to solve, the rest of the process will be much smoother.
To understand the problems facing the industry you’re working in, you need to do a little investigation to gather some context with which to support your knowledge of the problems you’re trying to solve. For instance, you need to understand what makes the customers of a particular business tick, or the specific goals an engineering firm is trying to reach.
How to implement this habit: Take some time to research the specific company you’re working for and the industry that they’re in. Write a cheat sheet that you can refer to, containing the major goals of the company, and the issues it may face within its specific industry. Don’t forget to include algorithms that you may want to use to solve business problems or ideas for machine learning models that could be useful in the future. Add to this cheat sheet whenever you discover something useful and soon you’ll have a treasure trove of industry-related tidbits.
No, not in life. In your code and your workflow.
It’s often argued that the best data scientists use the least amount of code, the least amount of data, and the simplest algorithms to get the job done.
Though by minimalist I don’t immediately want you to assume scarcity. Often when someone discusses the importance of minimalism in code that leads people to try to develop outrageous solutions that use only a few lines of code. Stop that. Yes, it’s impressive, but is that really the best use of your time?
Instead, once you get comfortable with data science concepts, begin to look for ways that you can optimize your code to make it simple, clean, and short. Use simple algorithms to get the job done, and don’t forget to write re-usable functions to remove redundancies.
How to implement this habit: As you progress as a data scientist, begin to push yourself to write more efficient solutions, write less code, and use simpler algorithms and models to get the job done. Learn how to shorten your code without reducing its effectiveness, and leave plenty of comments to explain how contracted versions of code works.
I’ll be the first to admit that I severely neglect functions when I’m writing data analysis code for the first time. Spaghetti code fills my IDE as I struggle to reason my way through different analyses. If you looked at my code you would probably deem it too far gone and volunteer to take it out behind the barn to put it out of its misery.
Once I’ve managed to cobble together a half-decent result, I’ll then go back to try to fix the equivalent of a bad accident. By packaging my code into functions, I quickly remove unnecessary complexities and redundancies. If that’s the only thing I do to my code, then I will already have simplified it to a point that I can revisit the solution and understand how I got to that point.
How to implement this habit: Don’t forget the importance of functions when writing code. It’s often said that the best developers are lazy developers because they figure out how to create solutions that don’t require much work. After you’ve written a solution, go back and bundle redundant or complex code into functions to help organize and simplify your code.
Test-driven development (TDD) is a software development principle that focuses on writing code with incremental improvements that are constantly tested. TDD runs on a “Red, Green, Refactor” system that encourages developers to build a test suite, write implementation code, and then optimize the codebase.
TDD can be implemented successfully by data scientists to produce analytics pipelines, develop a proof of concept, work with data subsets, and ensure that functioning code isn’t broken during the development process.
How to implement this habit: Study up on test-driven development, and determine whether or not this technique can add something to your workflow. TDD isn’t the perfect answer for every problem, but it can be useful if implemented thoughtfully. Check out this article that gives a great description of TDD and offers an example of how to implement it into data science projects:
Ever make a pull request and have your computer blow up with error messages and issues coming out of the wazoo? I have. It sucks.
During those moments when you feel like introducing whoever made such a large commit to your fist, take a breath, and remember that this person obviously didn’t take the time to implement good habits growing up.
What’s the golden rule of team-based software development? Make small, frequent commits.
How to implement this habit: Get into the practice of frequently committing your code changes and just as regularly making pull requests to get the latest code. Every change you or another person makes could break the whole project, so it’s important to make small changes that are easy to revert and likely only affect one part or layer of the project.
Depending on who you ask, the industry either has too many data scientists or too few.
Regardless of whether the industry is becoming saturated or arid, you will be competing with tons of highly qualified, and often over-qualified, candidates for a single job. This means that in the lead-up to applying for jobs, you need to have already developed the habit of self-improvement. Today, everyone is obsessed with upskilling, and for good reason. This trend should be no exception to data scientists.
How to implement this habit: Make a skill inventory and see how you stack up to the requirements employers include in job postings. Are you a Pythonista who can efficiently use relevant libraries such as Keras, NumPy, Pandas, PyTorch, TensorFlow, Matplotlib, Seaborn, and Plotly? Can you write a memo detailing your latest findings and how they can improve the efficiency of your company by 25%? Are you comfortable with working as part of a team to complete a project? Identify any shortcomings and find some good online courses or resources to bolster your skills.
In 7 Habits of Highly Effective People, Stephen Covey discusses the principle of “beginning with the end in mind”.
To effectively relate this to data science projects, you need to ask yourself in the planning phase of a project what the desired outcome of the project is. This will help shape the path of the project and will give you a roadmap of outcomes that need to be met to reach the final goal. Not only that but determining the outcome of the project will give you an idea of the feasibility and sustainability of the project as a whole.
How to implement this habit: Begin each project with a planning session that lays out exactly what you hope to achieve at the end of the development period. Determine which problem you will be attempting to solve, or which piece of evidence you are trying to gather. Then, you can begin to answer feasibility and sustainability questions that will shape the milestones and outcomes of your project. From there, you can start writing code and machine learning models with a clear plan in place to guide you to the end of your project.
After attempting unsuccessfully to prepare a freshman lecture on why spin-V2 particles obey Fermi-Dirac statistics, Richard Feynman famously said “I couldn’t reduce it to the freshman level. That means we really don’t understand it.” Known as “The Great Explainer”, Feynman left a legacy that data scientists can only hope to emulate.
Data science, the art of using data to tell a compelling story, is only successful if the storyteller understands the story they are trying to tell. In other words, it’s your task to understand so that you can be understood. Developing this habit early on of understanding what you’re trying to accomplish, such that you can share it with someone else to a fair level of comprehension, will make you the most effective data scientist in the room.
How to implement this habit: Use The Feynman Technique to develop a deep level of understanding of the concepts you’re trying to discover and the problems you’re trying to solve. This method aligns itself well with the data science process of analyzing data and then explaining the results to generally non-data science stakeholders. In short, you refine your explanation of the topic to such a point that you can explain it in simple, non-jargon terms that can be understood by anyone.
In a field dominated by Masters and Ph.D. holders, research papers are often used to share industry news and insight.
Research papers are useful ways to see how others are solving problems, widen our perspectives, and keep up to date with the latest trends.
How to implement this habit: Pick one or two research papers to read every week that are relevant to your current work or to technologies that you’re interested in pursuing or studying. Try to set aside time for this literature review every week to make this a priority. Become familiar with the Three-Pass Approach to reading research papers, which helps you gather pertinent information quickly. To really solidify your understanding of the papers, try to implement something that you learned from your reading into a personal project, or share what you learned with work colleagues.
The world of data science is changing rapidly, from the technologies used to the goals being attained. Don’t be that data scientist who is stuck in their ways, unwilling to change.
Not only does being open to change force you to continue improving as a professional, but it also keeps you relevant in a quickly changing industry that will spit you out the moment you fall behind.
How to implement this habit: Whenever a new technology or practice makes the news, take a test-drive and see what that new technology or practice brings to the table. Even if you just read the documentation, you can keep yourself up-to-date on the changing trends of the industry.
Furthermore, you can bring a perspective on the technology to your company and help them navigate technological changes and advances. Being that person in the office with your ear to the ground can help you stay ahead of the curve, and can also help you guide your team and company to better, more efficient solutions.
Developing good habits, at any stage in your data science career, allows you to fulfill your potential of becoming an effective member of the team who makes a large impact on whatever problem they’re trying to solve.
There’s no better time than right now to take the time to set yourself up for future success.