Once you know the amount of data and the type of data analysis you’re working with, it’s very easy to tell which laptop is the best laptop for data analysis is for you.
If you are using R, a computer with extra RAM memory will do but for something more distributed like map-reduce, even a simple set of commodity machines will do.
Having extra RAM will somewhat help you when you set up a pseudo-distributed cluster on one machine (though it will still be very limited).
Lastly if your work involves GPU computing for parallel programming (e.g. using NVIDIA CUDA architecture for the optimizations of machine learning algorithms) obviously you want a good graphics card but your focus should be on the number of cores/shaders as opposed to vRAM when shopping for one.
If you didn’t get some of what I said…
Like every other post on our site, it’s meant to be written for by everyone.
So what we’ll do in this post is:
- Briefly go over how Data Analysis uses computer hardware.
- Explain what hardware requirements may be more more important in specific types of data analysis
- List the 2022 best laptops for the most common types of data analysis
- List laptops for people getting started with data analysis
There’s a problem with our plan though.
Going over how computer hardware is useful for very different types of data analysis and how each specific computer component can help make it faster can take a few pages of discussion.
So we’ll do this in two parts: we’ll go over the specs you need for data analysis and leave the rest of the details to the last section ( bookmark this post to read the last section later if you are serious about data analysis).
Recommended Specs for Data Analysis
The motto for data analysis is:
“With greater data sets comes greater insights”.
Consquently, greater data sets demands more hardware resources.
So what’s a good configuration to make you a data scientist?
A KDnuggets poll indicates a 3-4 cores w/ 5-16GB Windows systems.
A StackExchange thread averages a 16GB RAM, 1TB SSD Linux system with a dGPU.
A Quora thread nudge converges around 16GB RAM….
Enough jokes…let’s get started.
Experience will tell you RAM is the single most important factor for data science. As data sets grow larger, you’ll notice RAM being the biggest bottleneck. Things can speed up an order of magnitude when all your processing is in-memory(RAM) as opposed to storage drive.
16GB RAM: this is ideal but it isn’t always available on budget laptops. However, all laptops can be upgraded to 16GB and even 32GB.
Do not go below 8GB! I warn you.
Second biggest factor is storage. A budget SSD is 2-3 times faster than a regular a hard-drive. Good SSDs are 4-5 times faster. Recent NVMe-like SSDs on the latest laptops can read/write data up to x17 faster.
More processing power is always good but you’ll be bottlenecked by RAM and then by storage speed long before CPU comes into play. Think about it, if your CPU can do 10**5 calculations per second and your RAM/storage drive can only serve 1000 pieces of data per second, there’s obviously going to be a bottleneck.
After you max out on RAM & storage, you can then spend the rest of your budget on a “modern” CPU, not necessarily a fast “CPU” because for Data Analysis purposes they are all equally fast.
If you can afford to be picky, prioritize on number of cores first. (Don’t go for Xeon processors though*).
If we are talking about laptops, unlike RAM and storage, CPUs are not really upgradeable so get the best CPU you can afford.
If you work with deep neural networks or parallel.NN (parallell computing), then get a dedicated GPU with as many shaders or cores you can afford.
Getting a superb keyboard on a laptop isn’t always possible. If you can’t find one, get an external keyboard and a mouse/trackball.
Either way my advice to you is to get the best ergonomics:
RSI and tendonitis is nasty.
At least 15 inches. You will probably end up ssh’ing into more powerful machines at some point so you’ll need any extra interface space.
Make sure to have a thunderbolt (USB Type C) port. so you can transfer data to/from external drives at lightning fast speeds. Most laptops today have that so as long as you avoid old laptops, you should get one automatically.
Mac vs. Windows vs. Linux – it depends on industry\company you work for or your personal preference. But if buget allows going for a laptop that can support a Linux Flavored OS seemlessly like a Lenovo ThinkPad/MacBook/Dell XPS will help. Since a linux flavored OS may become at some point the only OS you work with (Win does not connect well
A linux flavoured OS may at some point become your default OS. (windows does not SSH well and it requires a lot of extras to set up with a typical workflow aimed towards the Cloud)
Though its not a deal breaker, all other laptops through some work-arounds can be made 100% fully compatible.
The best rule of thumb when choosing a laptop is to get as many cores from a processor as possible and make sure whatever laptop you end up with can support up to 16GB of RAM and can also support a nice PCIe NVMe blah blah, basically a SSD(Solid State Drive) storage.
So if you find RAM/Storage lacking, you can always upgrade them.
*Graphics Cards are not upgradeable so if you’re going for one get one with as many CUDA Cores as possible. More details on the last section.
Top 6 Best Laptops for Data Analysis
In this list I’ve tried to include a laptop for EVERYONE: beginners, students and Data Scientists (those into parallel programming, machine learning, deep learning and those Using AWS/Cloud Services,etc.)
Just read the descriptions carefully and you’ll be sure to find your best pick.
AMD Ryzen 5 5600H 6 Cores 4.2GHz*
NVIDIA GeForce GTX 1650 4GB vRAM
256GB PCIe NVMe
15” Full HD IPS 120Hz refresh rates
This is the most basic laptop yet higher performance laptop for pretty much ANY type of data analysis. It’s also more than ideal for anyone getting started with data analysis this includes students doing research with OR taking classes of Data Analysis.
This is enough for simple statistical and ML/DL models of small data sets.
Obviously, you can also run any Data Analysis package and software (R/MatLab/SAS,etc).
The question is then what is considered a small data set?
Small is anything around 300k rows with 4 variables each or data sets that takes 300MB of space.
Since windows itself eats 2GB, background programs 1GB , your still have 4.7G of RAM left.
The upper limit that this laptop can handle should be about:
5000MB(5G)/300MB~16*300k rows=4500K rows with 4 variables. The likelyhood someone getting started with Data Analysis has to work with a data set this big is very slim. If you ever need to work on bigger data sets remember the Cloud is always there for you for even much faster processing.
And actually that’s the truth about data ananlysis today.
Q: But but but…I don’t want to use the cloud. What if I run into even bigger data sets?
Like I said, you can still run MUCH bigger data sets on a laptop as long as you up your RAM. This laptop can be upgraded to 16GB and all it takes is to remove one scree and VOILA: you have the slot open for an additional 8GB RAM stick.
If you try to punch even bigger data sets, the computer will run out of RAM and resort to the storage drive.
In such scenario(~12000K rows w/ 4 variables), comuters with a PCIe NVME SSD will not be as slow as HDD computers since like I mentioned before these are about 5-17 times faster at reading/writing data.
Before you set out to cheap out on a laptop, be aware that cheaper laptops will not have ANY type of dedicated GPUs. Even those with dedicated GPUs will either be more expensive or cost you just a tad less than this one and they will most likely come with a 1050Ti/MX450 which are several times weaker than a 1650GTX(these two have much less CUDA cores).’
The truth is, all laptops with a dGPU are good enough to tinkle around with parallel computing, however why pay 30 bucks less for a much weaker GPU? What if the times comes when you come across an industry that makes good use of parallel processing ( currently this aspect is growing across many many libraries).
*When you compare AMD chips to Intel Chips, it’s more about “clock speed performance” rather than which one has more “clock speed” on paper. One may claim to have more yet it may be outdone by CPUs with lower clock speeds.
Best Cheap Laptop For Data Analysis
Intel Core i3-1115G4 2 Cores 4.1GHz
Intel UHD Graphics
15” full HD IPS
This is the type of model that I’d recommend for those that are REALLY just getting started with Data Analysis OR a data scientist who will PURELY use the CLOUD FOR ANY PROCESSING.
Since you’ll be mostly taking classes and/or teaching yourself how to code/use libraries which have very small samples, you will need nothing more than a cheap laptop with Windows 10 on it. This is also the case for someone who’s just using this laptop to code and a cloud service to do all the processing (although you can still use it to handle very BIG data sets such as those in the first laptop as long as you up the RAM accordingly).
Well if you are a student or if you are someone who is going to let the mounstrous data sets run on the Cloud, then you have the luxury to choose portability and battery life over power.
Q: Hold on, hold on, this laptop is way too cheap is it really going to be as good as the first laptop?
For data science purposes, YES.
Most people have no clue about how fast modern processors are regardless of how cheap they are and the fact that it’s all about RAM and then storage. Look at it closely, it also has an NVMe SSD.
It also has a FHD resolution which means it’s just going to give as much space as the first laptop (two windows open side by side: one for documentation and another one for the IDE/Visualization of data is comfortable).
Note: this laptop doesn’t have a dedicated GPU for which any parallel processing will be restricted to using the number of threads given by the CPU which is 4.
M1 Pro Chip 8-10 Cores
16-64GB RAM DDR4
Up to 32-core GPU
512GB-2TB PCIe SSD
The release of the M1 Pro MacBooks has made going for an Apple laptop one of the smartest choices for ANY DATA SCIENTIST. If you are already making the big bucks in this field, it is definitely a no brainer, it’s probably not going to get any better than a M1 Pro Chip MacBook.
Here are the reasons:
- The UNIX-like environment. As laid out in the “how to buy the best computer for data science” section, OSX is the most work-efficient OS for data science, just as good as a Linux system. It’s not just about how python just works best on a UNIX system but also how all the software packages and languages are readily available out of the box.
- Portability and unrivaled battery life: for those always shh’ing their way into a sever, these two features becomes super super handy. You can move from place to place and still run huge chunks of data from anywhere and at any time (as long as you have a Wifi Connection or even Cellular Data).
And lastly the M1 Pro CHIP:
- For Data Science purposes it completely blows any “mobile” CPU from Intel/AMD out of the water, performing faster in this field because the parallel processing due to its high core count
- Like I said, it’s a high core count, it’s higher than what you will find on any mobile Intel/AMD chip.
- The API and architecture are said to be optimized for “machine learning” and it’s true.
There’s one big catch though:
- For GPU processing purposes, not many libraries have adopted the GPU’s architecture. You can still find the GPU useful in some applications where it will actually outperform any CPU (again due to high core count).
MacBook Air M1 Pro vs MacBook M1 Pro :
85% of data science users will do okay with the MacBook Air though since it has even more “DATS SCIENCE” power than most gaming laptops for non parallel processing purposes (RAM can be upgraded to 16GB and SSD to 1TBNVMe).
The remaining 15% may have to consider the 13” MacBook Pro if they wish to run data sets that require 32GB of RAM.
Best Windows UltraBook For Data Science
Core i5-10210U 4 cores 4.2GHz
8GB RAM DDR3
256GB SSD M.2
13” full HD 1080p IPS
You can get into the same groove with any laptop just like you would with a MacBook, as long as you install LINUX on it .
If you have the budget is always good to go with a Dell XPS or a Lenovo ThinkPad imo, for two reasons:
- Both of these will work right out of the box after you set up a Linux Distro on it. There will be zero compatibility issues with every piece of hardware.
- You will get ALMOST the same build quality/battery life and weight distribution of a MacBook.
Dell XPS 15: Dedicated GPU
The model I’m showing is a Dell XPS 13 and it doesn’t have a dGPU.
If you know you’re going to be using CUDA technology for parallel processing then you might as well get the Dell XPS 15.
Unlike the MacBook Pro, it has a NVIDIA GPU and the current model has a mid-range chip which has several more times the amount of CUDA cores as the first laptop.
Why you keep pushing for a Unix-Like Environment? Why should I bother with it? What’s wrong with Windows?
Yes there’s nothing wrong with Windows. Most, if not all, the statistics platforms like R, scikit-learn, or the many many others avaiable, all work through all three operating systems.
However from my experience it’s always best to the unix terminal readily available and this is only available on Linux and Mac systems (although windows has released its terminal too it isn’t nearly as good as these two).
You will find it super useful if you are getting started in this field. I myself found it a blesing when I was a grad student and had to use a Dell machine with Ubuntu on it. It wasn’t just the terminal though I was also able to connect to computing infrastrutures much much more easily.
Unix systems just make it so much easier (and natural) to work with large data sets and it doesn’t have to be a MacBook. MacBooks will accomplish the same tasks (though in a fancier way).
Obviously, you are not going to get all the data from one company and process it all through your Dell XPS or MacBook Pro. What these will do for you instead is to test a large sample of these data sets before you ssh into computer farms where the real processing and the final results will come to fruit.
Best Windows Laptop for Data Analysis
AMD Ryzen 5 5500U 6 Cores 4GHz
16-24GB DDR4 (Up to 64GB)
Intel UHD Graphics 620
512GB-1TB NVMe SSD
15” FHD IPS
The last laptop that will let you seaminglessly work with Unix-like environments that has all the necessary specs to crunch a very large sample of data on it is the Lenovo ThinkPad.
In fact, the thinkpads are the de-fact choice for anyone who wants to use a Linux Distro on a laptop.
There are several models and some versions of the ThinkPads and most will let you choose whatever RAM size you want (all the way to 64GB) and you also have a wide variety of choices for the processor too as shown here: Lenovo OEM E15 ThinkPad. Though it’s always best to do the upgrades yourself because RAM costs almost nothing making it far far cheaper.
No dedicated GPU:
None of the ThinkPads offer a dGPU. So if you are thinking of running parallel computing for Depp Learning/Machine Learning / Neural Networks on your RIG (w/o the use of a farm service) then you should take a look at other options.
Just like the MacBook Pro and the Dell XPS, you can use the ThinkPads to test the biggest largest sample of your data set that can fit it in-memory (play variations with it) but leave the final processing to a computer farm..
Why do you still mention Cloud Computing?
Well, REALISTIC modern data (although they may fit into laptops and desktop’s RAM) are usually done on computer farms anyways because they will be processes instantly (in hours or even minutes as opposed to days).
And when you throw yourself into the industry looking for a job, that’s going to help you stand out as a candidate. It isn’t just about experience it’s about the willingness and the ability to be acquianted with a lot of stuff.
For cloud computing purposes, you just need a basic machine really and there’s no need to invest on laptops like the MacBook Pro, Dell XPS or even these more budget friendly ThinkPads. However, I know when I started I wasn’t super ready to switch from my laptop ( a cozy local environment) to a remote cloud service or just any server with better resources. Most of you still want to play around and test as much data as possible on a little workhorse.
Best Laptop Large Data Analysis
AMD Ryzen 9 5900HX 8 Cores 3.6GHz
32GB DDR4 (Up to 64GB)
NVIDIA GeForce RTX 3080
1TB PCIe NVMe SSD
15” FHD 300Hz IPS
If you want to get as much computing goodness out of a laptop, then you have to for the high-tier gaming laptops: ASUS ROG Scar Strix, Razer Blade, Alienware and MSI Stealth are all good model.
These will always come with the latest of the latest CPUs and GPUs released. The thicker they are and bigger they are, the more RAM and extra storage drives they can fit in.
I personally like MSI and ASUS, they have shown to have the best cooling systems which keeps their laptops up and running for several years without the need to downclock CPU/GPU speeds.
The 3080RTX is one of the latest and most powerful GPUs avaialble (bested by the 3090RTX and the Ti variants) with an insame amount of CUDA cores ~4000. For machine learning/deep learning purposes this means you have approximiately extra 4000 cores (although working at low clock speeds), this will speed up data crunching especially from deep learning scripts from one week to one day.
How To Buy the Best Laptop or Desktop Computer For Data Analysis
This section will put an emphasis on how to get as much computing as possible from a computer or a laptop for each type of data analysis.
I will start with some basics because I know many of the people who’ve landed on this page are just getting started and teaching themselves along the way.
Doing Data Analysis
There are two ways to do Data Analysis: using the cloud or using your rig.
A) Using the Cloud (I highly recommend you start doing it)
Using the cloud is another way to say “rent a computer service” from big companies like Amazon to do all calculations through their huge cluster of computers.
A good cloud environment like Amazon Web Services will give you access to on-demand EMR multi-machine clusters per hour. It includes access to all of their data stores like ElasticSearch and Redshift and more.
To use it, all you need is a computer with 4-8GB RAM and an internet connection. This is not just time saving but also money saving. Extra battery is always a good thing so you can check the progress away from home.
Using the cloud is not as uncommon as you think. I myself had to work a lot with hadoop clusters eventually I left the real machine learning and data munging to a computer farm.
Ever since, all I do now is download a small sample and test it on my laptop before using these clusters.
All you need for this is a laptop with a terminal to ssh into.
B) Using your Rig at Home
Building a rig for “big data analysis” can be time consuming and challenging. You’ll need several machines with at least:
- A Multi Core CPU (multi core AMD CPUs have better specs/money)
- 16GB of RAM.
- Solid State Drives in RAID configurations.
If you are on a low budget but still want your own little cluster back home, get a for refurbished server set up:
- Browse around Amazon, Ebay or any other e-commerce site.
- Join a data science Facebook group and make a post asking if theres anyone selling their old server
When I tried to set up my server that’s what I did and ended up with 32-core and 64 GIGs Linux server for about 400 bucks.
Software & Specs
With clusters out of the way, let’s now talk about single-machine set ups for Data Analysis and the hardware you need for that. Before we get into the very details of how each computer component helps with data analysis, let us go through the most common software and libraries and briefly mention hardware specs that help.
A) Learning Data Science
As a a student you’ll end up using a combination of the following software/languages:
(Except for a few) Most of these are just libraries, any laptop with at least a dual core CPU w/ 2.5GHz and 8GB RAM can handle these silly libraries and languages.
You won’t need to do any big data crunching for sure. If you do, universities have loads of servers for that.
The modules for Data science available in R and Python take a while to install so the only real difficulty here is to get the ecosystem ready to go.
The first time I had to install these I had to spend a whole week to make them work with each other. You don’t have to spend a whole week though since there are guides everywhere.
Whether the process will be difficult or not will depend on the OS you’re working with. I think it’s far easier to install it on Linux systems than MacBooks and it’s much much easier to instal them on MacBooks than Windows machines.
If you aren’t willing to use a Linux System. I recommend OSX as a back-up choice. If price is an issue, then you probably want to look at older models, they will all work fine since they still have their OS regularly updated.
B) As a Data Scientist
There are three types of data science depending on the problem: volume, velocity or variety.
For volume and velocity problems: it’s always best to just get a computer/laptop device that allows you to seaminglessly connect to cloud services.
For variety problems: It’s just better to have a rig back home or a laptop fully specced out for data science which will be the main point of this post.
Good results from machine-learning algorithms are highly dependent on the size of data, the more you have the better the results. This translates to more memory and CPU power.
If your focus is machine learning then your budget should be spend on RAM memory and #CPU cores. If you use R (RevoScaleR package) for this, then you can also use GPU cores/shaders to speed up the process.
Machine learning or not, working with large data sets through R is much much easier with more cores.
However, only up to a point since the main bottleneck is still disk I/O and memory. In other words, you will run out of “storage space” to hold on the data you have long before you run out of #cores. Given the constraints of computers/laptops, the upper limit is about 8 cores.
Hadoop has an innovative way to improve models with limited data sets. It rose out of the need to use large datasets in view of the price and technolgical constraints at some point in time.
Since machine-learning algorithms output better results with more and more data(particularly for techniques such as clustering, outlier detection and product recommenders) a good approach is now to use “small sample” of the full data set, the small sample will basically be whatever amount can fit into your computer’s memory so you can get run explotatory tasks and get back results on full datasets without sampling all of it.
All you have to now is write a map-reduce jo(PIG or HIVE script) launch it directly on Hadoop over the full dataset to get the results back to your laptop.
Regardless, Hadoop ALSO provides linearly scalable storage and processing power so you can now store ALL of the data in RAW format, and use the FULL dataset to get better and more accurate models.
Most people also use pandas to read CSV and Excel files for cleaning, filtering, partitioning, aggregating and summarizing data with the intent to produce simple charts. This doesn’t ask for any special hardware, you can do so on any laptop whatsoever.
Likewise, if you have an application that requires the fusion of large tables with billions of rows to create a vector for each data object you can use HIVE or PIG scripts which should make this job very efficient on any computer/laptop.
On the other hand, training a heavy neural network is definitely out of reach for laptops because it has to do repeated measurement analysis (which increases the variance covariance exponentially making your computer run out of resources).
Now that we got the software and the basic ideas of hardware requirements out of way, let’s get into the details of exactly you need to know when building a desktop or buying a laptop for data science. If there’s any computer terminology you don’t understand check my posts on the sidebar.
Quick GPU Lesson
Older generation CPUs aren’t necessarily a bad thing. For Data Science purposes, 8th generation CPUs are just as good as 12th generation CPUs (since the bottleneck will usually be #Cores/RAM memory).
There are two types : low voltage “U” processors and high graphics performance “H” processors. Ex: Core i5-11100U is a low voltage CPU, they have basically low clock speed performance compared to High performance CPUs who may also have more cores (usually 2 more).
Using the Cloud
If you are going to use the cloud for large datasets and leave the rest to your computer. Basically, any CPU regardless of generation will do. They all have more than 2 cores(four threads – if you are worried about “multitasking”) and have +3GHz clock speeds.
Even 6th and 5th generation CPUs which are much much cheaper will be great for testing code, programming and uploading the work up in the cloud.
I would particularly recommend 8th generation “low voltage” Intel Core i3 CPUs or 3rd generation Ryzen CPUs because they have at least 2 cores (4 in the case of Ryzen), they consume much less power (this means more battery) AND are all very very cheap.
Using your Laptop or Desktop
You do not need to worry about type of generation, just get more cores. Although you will find that the most recent “H” High performance processors have more cores and clock speeds than low voltage CPUs anyways.
2022 Intel CPUs*
2022 AMD CPUs
|Ryzen 9 5900HX||3.3||4.6||8|
|Ryzen 9 4800HS||2.2||4.4||8|
|Ryzen 7 5800H||3.3||4.4||8|
|Ryzen 7 3750H||2.3||4||4|
|Ryzen 7 5800U||1.9||4.4||8|
|Ryzen 7 5700U||1.8||4.3||8|
|Ryzen 7 3700U||2.3||4||4|
|Ryzen 5 5600H||3.3||4.2||6|
|Ryzen 5 3550H||2.1||3.7||4|
|Ryzen 5 5500U||2.1||4.4||6|
|Ryzen 5 3500U||2.1||3.7||4|
|Ryzen 3 5300U||2.6||3.8||8|
|Ryzen 3 3300U||2.1||3.5||4|
CPUs for Laptops
If we are talking about laptops, you’ll get the best performance out of Ryzen 7/9 or Core i7/Core i9 CPUs. I would not get a Core i9/Ryzen 9 CPU though for two reasons:
- They are very very expensive
- They will probably not be very helpful since your main bottleneck will be RAM/Storage.
- You only gain a few hundreds of extra MHz over Ryzen 7 / Core i7 CPUs.
- By the time you really need that extra boost in CPU power, you’ll find out you’ll be better off using a cloud service.
*Note that in the case of laptops, getting a high performance CPU ensures you can upgrade the RAM to 32 or even 64GB (though you will still find low voltage CPUs reaching 32GB if you look around long enough).
*Likewise, only the most recent CPUs (6th generation and 3rd generation for Intel and Ryzen respectively) ensure the storage is compatible for NVMe PCIe SSD upgrades.
CPUs for Desktops
If we are talking about desktops, the situtations changes you can choose whatever you want.
- Desktops can be the closest thing to a server because RAM and storage can be upgraded to crazy amounts. Even low voltage CPUs can be upgraded AT LEAST TO 32GB.
- You don’t have to worry about dangerous temperatures so you can leave that CPU crunching data for days.
The single most important component for Data Science. Luckily for data scientists, it’s the cheapest to upgrade.
How does it work?
If data is written on the front page of a piece of paper, it’s hard to read if you have the paper backwards, it’s possible but it’s much slower.
When the data set is loaded into RAM memory, it’s like the computer turns over the data page on it’s head making much easier and faster to read.
To make data crunching tasks relatively fast, it’s hard to give a ballpark of how much RAM is good to have without knowing how much data there is. If you are using….
A) The Cloud
You will only need about 8GB RAM because:
- You don’t need RAM to upload data to a server
- You only need this much to create a resonable amount of test data to use on your desktop or laptop first before uploading it to the cloud.
- This much is sufficient for the multitasking that comes with it too.
B) A Laptop or a Desktop
First, we need to know how much a RAM memory a specific amount of data set takes.
4GB: enough for a small data set
A data set of 100,000 to 200,000 rows with 200 variables each will be take up around 300 MB of RAM.
So assuming you work with up to this much of data and you are not doing anything crazy with it (like tyring to visualize all of this data at once) even a small laptop with 4GB will be fine.
8GB: Large Data Sets
A large data set is about x25 the size of a small data set. The minimum size (x25 * 200,000 rows w/ 200 variables ~) will barely make it in 8GB of RAM because software and the OS also takes up RAM space.
However ,if you data analysis/scripting/programming skills are good you can leverage more cores/threads in your programming to make it fit into 8GB RAM. This is good practice you should start learning how to use such scripts right away.
16GB: The Best
You can do amazing stuff with x2 RAM of your largest chunk of data, however, so I recommend you quickly upgrade your RAM if you barely make it.
Here’s an example:
Experience leads me to believe that 75% would be happy at 8GB, and 85% at 16GB and 95% at 32GB.
Q: How is 16GB going to help me if I have a small data set?
A: Well even if your data set takes 1GB, having more RAM means you have to spend less time thinking about how to use a new local variable to store some permutation of your data set.
Another reason is to have multiple versions of the same data set when you’re doing the experimentation.
Q: How to find out how much exactly will I need
A: If you don’t really know how much RAM memory your data set takes.
1. Press CTRL+ALT+DEL to open the task manager.
2. Under the performance task click Memory to see RAM usage
3. Look at the “memory” and “virtual memory” columns. This will give you a sense of the memory footprint used when large datasets are open.
Write down that number and aim to have 2x the memory footprint + OS Overhead+ Apps (~500MB).
OS OS Overhead is approx. 3GB.
Q: Why is my entire system running so slow when running large data sets?
When you don’t have enough in-RAM memory, the OS will start to “thrash” which means it’ll remove some things from memory to let others run.
Q: But Quora told me RAM doesn’t matter!? My laptop can still run large data sets regardless of how much RAM I have…
That’s partly true.
Let’s say you have a dataset that’s about 6GB. (let’s assume you only need mem allocation for the dataset)
A laptop with 4GB RAM can run the script no problem ONLY IF you divide the dataset into reasonable sized partitions and process them separately. You can combine results later to get the full view and you’re done.
However, if you have 8GB RAM, this is enough RAM for to handle the data as a whole and process the whole thing in one go. This will be much faster than the first option, the other option is doable but slow.
Q: What about the Data Preparation Process?
How much RAM you need also depends on the data preparation step process.
What’s data preparation?
Data scientists have two set of skills: preparing big data (usually in disk processing through Unix Grep, AWK, Python, Apache Spark,etc) AND in-memory analytics (R, Python, Scipy,etc) programming skills.
When you have RAM memory to spare, YOU DON’T NEED TO THE FIRST SKILL most of the time. It will become important sometimes though such as in text analytics where the amount of input data will be naturally big .
Q: How much RAM is good for Machine Learning?
Real life ML models generally involve cluster computing times, with in-memory datasets(no storage drives) which is spread over tens or hundreds of machines, each with 32, 64 or 128 GB of RAM.
So if you want to get started with Machine Learning and try it out on your rig (desktop or laptop). Start with 16GB making sure your laptop has more slots (should be at least upgradeable to 32GB).
Hard Disk Drives vs Solid State Drives
There are basically two types of storage drives: Solid State Drives and Hard Disk Drives.
Despite being more expensive to manufacture, Solid State Drives are nearly ubiquitous
in computers and laptops and as you probably know they are much faster at reading/writing speeds than Hard Disk Drives.
Hard Disk Drives however can still be found and if you are building a desktop they are much much cheaper. On laptops they have the advantage of being a nice cheap upgrade when you need more storage too.
However, you should still prioritize on getting an SSD over HDD whenever possible.
A) Using the Cloud
The truth is if you are using the Cloud, you don’t need to worry about storage types or capacities. However, the SSD will always come in handy if you want to speed up your workflow (not the data crunching process) because everything else will load up fast (the OS, the software, the terminal) and looking for a particular piece of code somewhere in a doc in your computer can be found in literally split seconds.
B) Using your Rig
If you are using your own server/laptop/desktop and your data sets are BIGGER than what your RAM can handle, then Solid State Drives are no longer option, they are a MUST-HAVE, that is if you want to get results quicker. When your computer runs out of RAM, it’s going to resort to the storage drive as the “reservoir” where the data crunching takes place, so it’s important to get the fastest you can afford if upgrading RAM is no longer possible.
Obviously, if your data fits in RAM-memory, then you don’t need to worry about it. Although data is always stored on your storage drive and then transfered onto RAM for data crunching, this process isn’t necessarily faster with SSDs.
However you should still get an SSD… for the reasons I mentioned above.
If you are on a budget….I recommend you have two disks: 1 SSD for the operative system and 1 old slow HDD for the data. The SSD should be at least 64GB. The first one will load the OS/software and the second will store any data for crunching. If you know your PC is going to run out of memory, be sure to move all the data to be processed to your SSD.
GPUs have only recently become important in data science. Parallel processing has found its way in nearly almost every application. Neural networks for example now benefit from dedicated GPUs.
NVIDIA: not Intel, not AMD and not M1 Pro Chips
This is only true for NVIDIA GPUs though since most apps in data science have only implemented “CUDA core” tools.
AMD , although they much traction in the dGPU business, they don’t have much in the parallel processing world like NVIDIA, although some apps have implemented their architecture into their data science tools.
Q: So which Data Science Software/Service/Tools make use of NVIDIAs CUDA core technology?
It’s only legacy software that doesn’t. Even AWS provides GPU capabilities for processing with CUDA capability at this point.
You still should double check whether or not a library/tool uses parallel processing. You’d be suprised to find out how many people think that plugging a GPU into their computer/AWS package will somehow bring parallel computation capability into some old non parallel legacy software.
Machine and Deep Learning
Most of the deep learning libraries and machine learning libraries (tensorflow & torch) use GPUs from NVIDIA now.
In fact, deep learning algorithms are now optimized to run on GPU instead. Algorithms that took a week with a CPU now only take a day with dGPU.
Image Analysis has been using CUDA cores for ages now.
Q: Wait, that’s only for desktop GPUs right? I’ve heard laptop’s dGPUs are useless.
Wherever you read that was probably written 10 years ago. Laptop GPUs (especially from the 10th generation) are basically the same GPUs used on desktops (except that they have been downclocked to accomodate the high temperature environments of laptops).
You’d be surprised to know that even the cheapest dedicated graphics card found a laptop (MX450) will give you a 50-100 performance boost for data analysis.
Q: Which dGPUs do you specifically recommend? Which ones should I avoid?
2022 Consumer Gaming GPUs
- I would avoid any of those GPUs in gray. They will work fine however they’re deprecated for best performance across all apps.
- Pick 10th generation GPUs over older generation GPUs whenever possible. They have way more CUDA cores as you can see on the table.
Size & Resolution
You need a large screen. Staring at large data sets for a long time on a laptop is not easy on the eyes. Besides the extra screen space will help :
- SSHing into more poweful machines/cloud services
- Getting a better view of graphics and visualizations
FHD resolution: Make sure it has a FHD display too. This helps makes even more space by reducing everything to smaller sizes. DO NOT GET HD or HD+ displays unless you have eye issues.
All laptops have conectors for an external display and you should take advantage of that. A second screen to dock into at home is going to do wonders to your productivity and it’s going to be much much easier on your poor eyes. You can even get two external displays, you can use for writing scripts or develop software and the other to read documentation and write reports.
Cloud Services (For Newbies)
When a data scientist is working with a larger set of data which requires more computational resources than their desktop or laptop, we use a more powerful computer called a server.
A server is generally a very powerful computer which is dedicated to a specific task (for example running a file system, running a database, doing data analysis, running a web application or even all of the above!).
For example if you are dealing with a set of data which is 100 GB, one option for a computer with not enough RAM would be to load the data in a database and database analytics.
A faster (and arguably better) option would be use a server with enough ram (more than 100 Gb) and all of the analytics in RAM like is done with smaller sets of data.
The benefit always works out against buying a “better laptop” and that will hold true for a very long time with Linode, AWS, Microsoft, and Digital Ocean selling incredibly cheap compute power.
As of today, I have a subscription on two of these : Digital Ocean and AWS and it’s allmost nothing to what I’ve saved up by not buying dekstops with 128GB RAM.
AWS (Amazon Web Services)
AWS is the biggest dog in the Cloud Service Market. Sooner or later you’re going to have end up using AWS or another cloud service. It’s not just that real data is too big to fit on a local machine now, it’s also a crucial skill in the market place right now.
Moreover to get the real flavor of being a data scientist you actually have to work with cloud systems sooner or later.
If you planned on doing more intense stuff (Neural Networks, Support Vector Machines on alot of data) even the most powerful desktop GPU will not cut it and you will be better off renting out an AWS instance.
Note that AWS has a free tier for you to get stated with so you’ve got nothing to lose at this point.
Using a VNC (Virtual Network Computing)
Oh I tried it. I already had built my ideal (i.e. powerful) data analytics computer about a year prior, but it was a desktop.
I figured I could actually just buy a really cheap laptop, keep my desktop running all the time, and then use RDP*, Teamviewer*, or a VNC* programme to connect to it whenever I needed to do some data analysis.
So I bought a cheap 350$ windows laptop and started trying to set up a VNC.
I got it working. But not only did it mean that I had to always leave my desktop running: it was fairly laggy too.
Amazon AWS EC2
That’s how I fed up with it and discovered Amazon AWS EC2.
This service actually does something similar. It lets you create virtual computers with any operating system you want and customize how you access them.
I set up one of these (Linux), then taught myself how to use Linux.
The most useful thing about it is that I installed a web based IDE for R on it (Rstudio), which allowed me to go to a website hosted by my EC2 server and use R as if I was sitting at that computer.
Now, whenever I want to do some work, I can do it from any computer in the world with an internet connection, simply by visiting a website, and all the processing is done on the Amazon server.
Cost: You have to pay for the server, but they are inexpensive, and you pay different amounts based on the (virtual) processor, RAM, GPU etc of the server.
Also, there is a 1yr free trial which let’s you use the least powerful virtual server at no cost.
I understand that R may not be the only language you wish to use, but given that you can install anything you want on your server, it seems like a viable option for pretty much any data scientist.
Can access server from any device with the internet
Files are always accessible. Don’t even need to download them (like you would with drop box), just view on the server
Costs much less than powerful laptop
Server can be programmatically designed to scale depending on analysis needs using an API
Laptop screen is quite small, but I now find I access the server mostly from other desktops or a 17” laptop.
Requires internet connection to use
Can take some time to learn how to use Elastic Computing Cloud (C2)
OS: Mac vs Windows vs Linux
For some it may seem like only Mac and Linux are the way to go. But it’s all down to preference anyway. Most of the packages you will need work across all plataforms (Octave and R are good examples and have been availvable in all OSs for ages).
There are still a few advantages and disadvatanges with each though:
Working with Python on OSX devices is much much due to better package management. Since Python is still the most widely used language for Data Science that may imply that Macs/Linux distros are the only option.
In a way that’s true because that also means you’ll have access to the latest libraries.
If you use Windows, you may have to wait for libraries to be compiled as binaries.
You will be required to do a few more tweaks to suscesfully set up your marchine to run Data Science scripts. Unfortunately, most of the (sporadic and poorly written) documentation available for cutting edge data science tools assumes you are working on an Unix system.
Let us not forget the almighty advantage of getting a windows laptop though: the hardware.
Cheaper Hardware, dGPUs and more
One can get the same specs for half the price on a windows laptop you also have the option to choose the style , format and specs to your liking.
There’s also the upgradeability. Only windows machines let you do the upgrade on your own. Need a bigger faster SSD? No biggie, you can upgrade it youself. More memory – a 5 year old with a screwdriver can do it.
You can also repair PCs, while Macs are in a sealed box (open and you lose the warranty).
NVIDIA GPUs: These have never been mainstream on Macs. Only very few older models have them.
Excel: Excel, which in my mind is still one of the most effective tools for data analysis works best on Windows. If you’re going deep into pivot tables and more complex models that involve macros, you either need a PC or you need to run avirtualized copy of Windows on a Mac.
Budget issues out of the way , it comes down to whether or not you need an NVIDIA dGPU.
MatLab, S-Plus and SPSS, Python, Pandas, all the machine learning/deep learning algorithms, databases: PostgreSQL/MySQL will all work on either.
If you have any questions, questions or any suggestions. Please leave a comment below. Your input is taken seriously in our posts and will also be used for future updates.