Without knowing exactly the type and mount of data and more importantly the type of analysis, it is very hard to just go ahead and tell you this is the “best laptop for data analysis”.
If you are using R, having more memory is always a good thing but for something distributed like map-reduce, even a simple set of commodity machines would suffice.
Having more memory also helps set up a pseudo-distributed cluster on a single machine to some extent, but it will still be very limited.
Lastly if you are into GPU computing for parallel programming (e.g. using Nvidia’s CUDA architecture for the optimizations of machine learning algorithms) having a better graphics card (more number of cores) will be very helpful.
If you did not understand anything I just mentioned…
Don’t worry that’s fine.
Like every other post on this website, it’s meant to be read by everyone.
So What we’ve done with this post is:
- Go briefly over what Data Analysis means for computer hardware.
- Explain why some hardware requirements are important for certain types of data analysis
- List the current best laptops for the most common types of data analysis being done today
- List laptops for those who are just trying to get started in the field.
There’s one problem though, going over data analysis and explain how each hardware component helps making data analysis faster can take a few pages of discussion.
For now we’ll quickly summarize the specs you need for data analysis and leave all the details to the last section (you should give it a read later if you are serious about data analysis).
Recommended Specs for Data Analysis
The general motto for data analysis is:
“With greater data sets comes greater insights”.
Unfortunately, that also translates to bigger demand for hardware resources.
So what’s a good configuration to get started for someone who wants to be a data scientist?
A KDnuggets poll indicates a 3-4 cores w/ 5-16GB Windows systems.
A StackExchange thread recommends a 16GB RAM, 1TB SSD Linux OS with a GPU.
A Quora thread nudges converges around 16GB RAM….
Ok let’s be more practical.
My experience is that RAM is the most important thing for data science because it is the biggest bottleneck with large datasets. Things speed up an order of magnitude when all your processing is in-memory or in-RAM. A 16GB RAM is ideal but this isn’t always available on laptops ~600 bucks but you can always upgrade a cheap 350$ to 16GB though.
Do not go below 8GB! I warn you.
Second is the hard drive. An SSD is going to make an enormous difference, a budget SSD is going to be 2-3 times faster than a regular a hard-drive. A good SSD is going to be 4-5 times faster, an NVMe SSD found in a Macbook Pros and the newest laptops can be x17 faster.
Processing power is always good but you’ll more than likely be bottlenecked by either Storage Speed or RAM.
No point in being able to do a million calculations per second if your hard drive can only serve up 1000 pieces of data per second.
After you max out on these invest the rest of your budget on a “modern” CPU, not necessarily a fast “CPU”, because they’re all fast today. Note that unlike RAM and storage, these are not upgradeable so try to get the fastest one you can afford.
If you work with deep neural network or just NN (parallell computing), get the graphics card with as many CUDA cores/Shaders you can afford. NVIDIA or AMD, not Intel HD cards.
Getting a superb keyboard with all the computer goodness mentioned isn’t always possible. So if you’re going to be doing a lot of typing get an external keyboard and mouse/trackball.
My recommendation is to make sure they’re ergonomic:
RSI and tendonitis is nasty.
Minimum 15 inches. You will probably end up ssh’ing into more powerful machines at some point so the interface/real state screen becomes super important too.
Another plus it’s to make sure your laptop has a thunderbolt port (USB Type C) so you can transfer data to/from external drives at lightning fast speeds. Most laptops today have that so as long as you avoid cheap old laptops, you should get one automatically.
Mac vs. Windows vs. Linux – it depends on industry\company you work for or your personal preference. But I’d recommend going for a laptop that can support a Linux Flavored OS seemlessly like a Lenovo/MacBook.
A linux flavoured OS (windows does not connect well and requires a lot of extras to fit in with a typical workflow that will end up on the cloud) may at some point become your default OS.
The best rule of thumb when choosing a laptop is to get as many cores from a processor as possible and make sure your laptop can support up to 16GB of RAM.
So if these come too weak in terms of Storage or RAM you can always do the upgrade.
*Graphics Cards are not upgradeable so get the one with as many CUDA Cores as you can afford too. If you know you need them of course, more details on the last section.
Top 6 Best Laptops for Data Analysis
In this list I’ve included laptops for beginners, students and every type of Data Scientist (those into parallel programming, machine learning, deep learning and those Using AWS/Cloud Services,etc.)
Just keep scrolling down and read the descriptions carefully and you’ll be sure to get the best bang for your buck.
1. Acer Nitro 5
This is the most basic laptop for pretty much any type of data analysis and it’s the ideal for those getting starting with Data Analysis too especially students doing research or taking classes w/ Data Analysis.
RAM siting at 8GB is enough simple statistical and ML/DL models of small data sets. Although the GPU is definitely too much for anything too simple.
Obviously, you’ll also be able to run any Data Analysis package/software (R/MatLab/SAS,etc). The question isreally what size is a small data set?
I consider small anything around 300k rows with 4 variables each or anything that takes 300MB. Since windows itself eats 2GB, other Programs 1GB , your left with 4.7G of RAM.
The biggest data set this laptop can handle (before resorting to using your disk storage “as memory”) should be about
7000MB/300MB~20*300k rows=6000k rows with 4 variables.
The likelyhood of you as a student or someone starting with Data Analysis encountering a data set of this size is very slim
And if You do you can always hop into the Cloud for faster processing.
I know half of you are aware of what I’m going to say (and I talk about it in the last section): it is far easier now to start doing data analysis and data science with a really basic laptop because as long as you can pay for a cloud-based service to host your files you can do computation remotely on a cluster.
However, if you still want to run larger data sets on this laptop, you also have the option to upgrade it to 16GB. All it takes is removing one screw and inserting another RAM stick.
In the scenario you run into even bigger data sets(~12000k rows w/ 4 variables), this laptop will not be that slow because it has a Solid State Drive and when your computer starts using it, your “Disk Storage”, for RAM, it’s goign to be about 5 times faster than it would’ve been with a traditional cheap laptop with an HDD.
Lastly, before resorting to other cheaper or similar models from other brands, note that this laptop also comes with a mid- range level late generation GPU(the GTX 1650, w/ a lot more CUDA cores than the 1050Ti/MX150/MX250 of laptops with almost same price).
So if you want to start tinkle around with parallel processing tasks (which you should start doing as this is an essential tool in the industry growing and expanding across many libraries) it becomes even more important.
Best Cheap Laptop For Data Analysis
Core i3-8100U 3.4GHz
8GB RAM DDR4
128GB PCIe NVMe SSD
15” IPS full HD 1080p
This is one is for students.
Since it’s mostly classes and learning to code/use libraries using small samples, most students will not even need the power of the Acer Nitro. What will be required is portability and battery life. Lecture halls… scoffee bars…. libraries will all have their sockets taken.
Although at first glance, the price may suggest that it’s a weak laptop (and the fact that most people have no clue about Ryzen CPUs), this laptop has plenty of power to use Data Analysis Software/learn libraries, etc (not for medium to large data sets though).
This is portable, portable is expensive but it’s got battery life and it’s pretty amazing too: ~10 hours and on top of that it sports a full HD resolution which is a must if you are just beggining(this will help you having two windows open side by side: one for documentation, another for your IDE/Visualize Data).
All of that for 350$-450$ (350$ if you got a copy of Windows 10). Unlike the older version of the Acer Aspire, the new one is not as heavy, it’s actually 1lb lighter.
But if you are looking for a laptop to lug everywhere and still feel like you are just carrying a book, you need to get one weighing 3lbs.
Note: this laptop doesn’t have a dedicated GPU for which any parallel processing will be restricted to using the number of threads given by the CPU which is 4.
|ASUS ImagineBook||Core M3-8100Y||4GB||Windows 10S||349$|
|Acer Aspire 5||Ryzen 3 3200U||4GB||Windows 10S||349$|
|Acer Aspire 5||Core i3 1005G1||4GB||Windows 10S||399$|
|ASUS VivoBook 15||Ryzen 3 3200U||8GB||Windows 10 HOME||450$|
|HP 15.6″||Core i3 1005G1||4GB||Windows 10 HOME||424$|
|Lenovo IdeaPad 3||Ryzen 5 3500U||8GB||Windows 10 HOME||449$|
3. MacBook Air
Best Mac Laptop For Data Analysis
Intel Core i5 2.9GHz
8GB RAM LPDDR3
13” 1440×900 TN
A much better choice for any data scientist is sticking to a UNIX-like environment, the MacBook Air here is adequate for both real Data Scientists and students.
It’s probably the most user-friendly option for those people getting into the field too. Installing libraries, getting python to work and anything else is just a command away when using the terminal.
Although you have to be extremely careful on how you install python, we’ll write a tutorial on that.
If you do end up doing all the heavy lifting on cloud services or by shh’ing your way into a server because your moving around all over the place (from conferences to coffee shops and what not), then you’ll also appreciate the portability only the Air has: 13+ hours of battery life and 3lb.
Note that I am talking about the Old MacBook Air, the New Air is a bit more powerful and has a much better display but at the cost of having a shorter battery life ~10 hours and an okay keyboard compared to the Air.
One of the main issues of the Air is that it typically doesn’t have the biggest RAM options, it’s only limited to 16GB, at the most. Luckily this may be sufficient for 85% of data scientists out who plan on running simulations on it.
The remaining 15% of data scientists who still want all the perks of the Air and the UNIX-like environment should consider the 13” MacBook Pro which has the option to up the RAM to 32GB!
Best Dell Laptop For Data Analysis
8GB RAM DDR3
Intel UHD 620
13” full HD 1080p
I am aware you can get into the same groove(or even better) with a Linux Distro installed on a laptop, that is, get all the advantages of a full UNIX environment at a much more affordable price.
I’d recommend you carefully reconsider buying a decent laptop to install a Linux Distro on it since this will be your main tool of work.
The Dell XPS 13(along with the Lenovos) gives you one of best compatibility out of the Box with all Linux flavors (especially Ubuntu) while keeping almost the same design of the Air (thinness weight battery life).
Compared to the Air, the Dell XPS 13 has a full blown Core i5 CPU though which hasn’t been downclocked or modified this comes with the caveat of only giving you 10 hours of battery life instead of the ~13 hours you’d get with the Air.
Dell XPS 15:
If you have to use CUDA for parallel computing at some point, you also have the Dell XPS 15, which has the same power as the weaker version of the 16” MacBook Pro but again it does have a NVIDIA GPU instead of a AMD GPU on it.
Why you keep pushing for a Unix-Like Environment? Why should I bother with it?
We are aware that most of the statistics platforms like R, scikit-learn, or the many many others, are relatively independent of the big OSs.
But from my experience the most important thing for me was to have a unix terminal readily available.
I’ve found this useful since I was a grad student and postdoc and had to use a Dell (not this one) with Ubuntu on it to be able to connect to a computing infrastructure.
Having an Unix-Like environment just made it so much easier and natural to work with larga data sets.
MacBooks , which I was always provided with for these tasks, will accomplish the same thing (though in a little bit fancier way) but I know there’s still a big stigma against Apple and of course they’re expensive too. This is why I’m listing some Windows options that can can accomplish the same tasks.
Obviously, you can’t just download all the data from big infrastructures and use the Dell XPS or MacBooks to process it. What these can do for you though is test your code on as much data sets these machines can handle for you to later ssh into computer farms where the real processing comes into play.
Best Windows Laptop for Data Analysis
Core i5-8265U Up to 3.9 GHz
8-40GB DDR4 RAM
Intel HD 620
14” TN FHD
Another laptop capable of holding Unix-Like environment seamlessly with all the powerful specs needed to run large data sets are the Lenovo ThinkPads.
In fact, they are the de-facto choice for anyone who wants to install Linux Distros on a Windows laptop.
You can configure the ThinkPads to whatever RAM size you think you are going to need (8-40GB) and the processor too.
Note that the thinkpads do not have a dedicated GPU so you won’t be able to take of advantage of parallel computing for Deep Learning/Machine Learning/Neural networks.
Just like the MacBook Pro and the Dell XPS 15, you can use the ThinkPads to test your code with as much data you can fit into its hardware resources and even play with variations of that same data before shh’ing into computer farms.
Why am I still mentioning Cloud Computing?
Well to be honest (and I discuss this in the last section tool) , real modern data (although they may fit into laptops and desktop’s RAM) will be done several times (I mean hours as opposed to days) using AWS or any other computer farm service.
And if you decide to go into industry, that will help you stand out as a candidate (not just the experience, but rather the self drive and the initiative to do everything).
Although for that you just need a basic machine ( and no need to invest on the laptops like the Pro, XPS 15 and high end ThinkPads) I’m aware that most people are not really ready to switch from a laptop to a cozy local environment to a remote AWS machine for analytics and will rather test as much data as possible on their little workhorses.
These are basic machines that have the sleek design and portability of the MacBook Air and Dell XPS 13 . These ones will serve you just as well if you plan to do most of your work on the Cloud rather than your laptop.
If you are not ready for that yet, then the ThinkPads like this one are your best cheapest choice if you plan to run heavy calculations on a Windows Machine. Provided there is no parallel processing that requires dGPUs, if you require that, then the Dell XPS 15 is a better “portable choice”.
Best Laptop Large Data Analysis
Intel Core i7-10750H
NVIDIA GeForce RTX 2080 Super
512GB PCIe NVMe SSD
15” full HD 300Hz IPS
Lastly, If you want to get as much computing goodness as you can to process all of your data on it(no matter how big the data set is), your best bet is to go for high end gaming laptops. These are the ones that have all their specs nearly maxed out: CPU, GPU, RAM and Storage. Workstation laptops are not really that much more useful and powerful unless their GPUs have way more CUDA cores but they can get really really expensive.
I personally like the MSI brand due to specs/money ratio they offer and the cooling systems that keeps their laptops up and running for several years without having you to downclock CPU/GPU.
The GPU here is the most powerful (after its desktop version and a workstation GPU that sells for like 5000$) so you can speed up any GPU process like Image Analysis or parallel computing (Deep Learning, Machine Learning,etc).
Unfortunately, the battery life and weight are pretty bad and you also end up with a few specs that are just useless for a data scientist but useful for gaming : G-sync, 240Hz refresh rates but there’s not much you can do about you won’t find a laptop without those. Try to find cheaper laptops by looking for low refresh rates and no G-sync though.
How To Buy the Best Laptops For Data Analysis
Like I mentioned before there is no best laptop for Data Analysis out there.
In fact, any laptop would be good for analysis purposes if you do all the computing in the cloud.
So this section will be mainly focused for those trying to do as much computing as possible on their new rigs and this in turn depends on the kind of software they use and the type of data analysis as well.
I’m going to start with the basics for those who are just getting started into the field, perhaps using Lynda.com, or teaching themselves the tools along the way. If you are not a beginner and plan to do all your data analysis back at home just skip over to the hardware section.
Doing Data Analysis
There are two ways to do Data Analysis: using the cloud or with your own rig.
A) The Cloud – Recommended for Learning Data Analysis
Using the Cloud means renting computing services from big companies like Amazon. You are basically leaving all the computing/processing to their huge clusters of computers.
If you opt for a good cloud environment with an AWS subscription, you’ll get access to on-demand EMR multi-machine clusters at hourly rates. You’ll also get access to their other data stores like ElasticSearch and Redshift and so on.
All you need at home is a basic laptop or desktop with 4-8GB RAM and just a decent internet connection (1mbps). Not only will this save you a ton of money but time as well.
Another specs to consider when going this route is a long battery life (so you can do this away from home too), multiple core CPU (so you can smoothly multitask) and perhaps a backlit keyboard to work at night.
As for me during my past role as a data scientist, I worked a lot with hadoop clusters but the real machine learning and data munging was done on computer farms where I just needed a terminal to ssh into.
On my personal laptop I might download a small sample to test my code before going to the big machines and this is what I suggest you do.
B) Building a Rig at Home
Building a rig back home for “big data analysis” is quite challeging. Laptops are out of the question. You’ll need multiple machines with:
- Multi-core processors (8-core AMDs are cheaper)
- Minimum of 16GB of RAM per machine.
- Storage Drives in RAID configurations
On the other hand if you are on a shoestring budget and still would like to build a cluster back at home, you can always go for an used server set up:
- Go through listings on Amazon, Ebay or any other e-commerce site.
- Make a post on social media and ask if anyone’s selling their old server
When I got started I set up a 32-core and 64 GIGs Linux server for about 400 bucks.
Software & Specs
Just saying statistical analysis doesn’t really tell you what exactly you are going to need in a laptop.
So in this section I’m going to go briefly go over the most used software in Data Analysis and talk about the specs you should be focusing on.
These will be highlighted as well.
If you are a student you’ll probably end up using a combination of the following software/languages:
- Rapid Miner
For that you’ll just need a laptop with a decent workspace (keyboard + display) since modern laptops today have enough CPU & RAM for all these silly languages and software. Any Laptop with +2.5GHz and 2 cores + 8GB RAM should make working with all of that a breeze.
Besides, What won’t be required is any big-data crunching. Universities have loads of servers and things for that stuff.
What’s going to be a real pain is to get the ecosystem fully installed and working in your machine. Both R and Python have dozens of modules you can install for Data Science, none of them are easy to install. The first time I had to install these I spend a whole week trying all of them to work with each other.
There are guides everywhere but it’s also a matter of luck sometimes it may be easy dependong your OS and depending how exactly you install each of these.
I found them much much easier to install on Linux Systems than MacBooks, although I did manage to install it on a OSX.
If you can’t stand a Linux system I would recommend a MacBook any would do fine even the old models since they still have their software updated.
The software is pretty much the same, perhaps with the adittion of RStudio, Rapid Miner,Spotfire and most importantly Hadoop.
The latter implies of course using data sets in the range of GB.
I’d say there are three types of data scientists depending on the problem they wish to solve: volume, velocity or variety.
If you are a volume or velocity type of Data Scientist, the best laptop rig you should get is a laptop that allows you to easily connect to the cloud environments described before.
If you’re frequently working on the third V, variety problems. You will benefit much more from an expensive laptop (relatively speaking).
And If you deal with machine-learning algorithms then as you probably know you’ll have better results with more and more data, this translates to algorithms that are both CPU & memory hungry. If you plan to do your data analysis on your laptop, then focus CPU and Memory.
If you use R and especially the RevoScaleR package, you might go as far as need more cores even from your GPU. So pay close attention to the CPU/Memory/GPU sections.
Dealing with larga data sets with R is also easier with more cores.
Getting more cores can also help, but only up to a point.
R itself can generally only use one core at a time internally.
In addition, for many data analysis problems the bottlenecks are disk I/O and the speed of RAM, so efficiently using more than 4 or 8 cores on commodity hardware can be difficult.
data set sizes will range into the GB. Maybe a few others, but my lecturers haven’t returned my emails yet.
A common approach is to use a sample of the large dataset, a large a sample as can fit in memory. With Hadoop, you can now run many exploratory data analysis tasks on full datasets, without sampling.
Just write a map-reduce job, PIG or HIVE script, launch it directly on Hadoop over the full dataset, and get the results right back to your laptop.
In many cases, machine-learning algorithms achieve better results when they have more data to learn from, particularly for techniques such as clustering, outlier detection and product recommenders.
Historically, large datasets were not available or too expensive to acquire and store, and so machine-learning practitioners had to find innovative ways to improve models with rather limited datasets.
With Hadoop as a platform that provides linearly scalable storage and processing power, you can now store ALL of the data in RAW format, and use the full dataset to build better, more accurate models.
Data analysis: using pandas to read CSV and Excel files, to clean, filter, partition, aggregate and summarise data, and to produce simple charts
Similarly, if your application requires joining large tables with billions of rows to create feature vectors for each data object, HIVE or PIG are very useful and efficient for this task.
Training a heavy neural network might be out of reach for any laptop, as doing a way to big repeated measurements analysis (the variance/covariance matrix explodes exponentially)
All the answers are great.
Pay close attention to those sections.
Most of the algorithms are CPU Intense and Memory hungry. Look out for processor which is currently best processor and 4 core is ideal when you have to take advantage of threading for big data sets. Remember I am also talking about Data munging work along with computation.
CPU generation is the first digit of the 4 digit model number of the CPU, e.g: i7- 8750H is a 8th gen whereas i7-6700HQ is a 6th gen CPU. Also there are two variants of i7 , i5 chips out there: the low voltage dual core CPUs (Model number ends with a U, mostly used for ultrabooks) and the performance oriented four to six core chips (Model number ends with either HQ or H, mostly used in high end gaming and performance oriented computers).
If crunching large datasets , your main focus is obviously to get one with an H/HQ series CPU.
Using the Cloud
You’re only going to need a decent processor for multitasking. Anything from the 8th generation will work wonders even if it’s a Core i3. Ex: Core i3 – 8100U, Core i3-8145U Core i5-8250U, etc.
Even 7th generation CPUs which are far cheaper will do for testing code, before uploading the work up in the cloud.
Just avoid Core i7 processors altogether , they’re expensive , you don’t need them and you’ll get less battery out of your laptop unless of course that particular model has perks/hardware that no other model offers.
Using your Laptop
You need to get the best fastest processor your budget can afford. You will be stuck with whatever processor you decide on because you can’t upgrade it. If it doesn’t have the power you need for your work, you’re out of luck!
If your laptop needs more memory or a faster storage, you can easily do the upgrade and it’s not too expensive.
Here’s a list of the specs of the most popular & fastest processors out theah:
Note that having a core i7 not only gives you an edge in computing power but also in upgradeability since most core i7 laptops are bulky enough to make insertion of a new RAM or SSD relatively simple.
Opting for a late generation processor (6th 7th or 8th 9th 10th) will also you give the possibility of PCIe NVMe SSD support which is x17 faster than HDDs.
As a data scientist you also will run multiple applications in parallel and/or run data analysis apps which can support parallel processing.
So the number of cores is more essential than the clock speed.
Although two cores will run faster on day to day use, the four cores pays for itself when running algorithms.
Two cores will save you money, but four-six cores should be preferred.
BEWARE: You may have to dig into the technical specs on the processor to see how many cores it offers. it’s not an easy science for you to figure it out just from the label.
Depending on your specialization you may need to work for real with Hadoop Stack, Sols or other tools that require a AWS subscription or other cloud SaSS providers.
For now you should get a reasonably fast processor (Core i5 or Core i7).
Core i5: only take from the 8th generation onwards. These have 4 cores and easily go past +3GHz.
Core i7: can be taken from the 7th or 8th generation onwards. All of them have +4 coures (the latter has 6) and go up as much as 4GHz.
If you use MapR SanBox, then a four core CPU is really a priority. Beware that most Core i3 and Core 5 processors even 7th generation Core i7 only give you 2 cores.
Probably the most important component luckily the easiest/cheapest to get on laptops. So this section is going to be quite lengthy to try and convince to get as much RAM goodness as you can.
You can say that the computer turns over the data set on it’s head when doing data analysis or simply that the data set is loaded from the storage device to the RAM when doing any kind of computation (though the storage device can also be used for this but several times slower).
It’s hard to say how much RAM you may need as this depends on the size of the data set.
A) Using the Cloud
Get a minimum 8GB of RAM (so you can load/create a reasonable amount of test data then use the Cloud) plus 8GB also allows for smooth multitasking.
B) Using your Laptop
We need to figure out how data sets relates to RAM.
What’s a small data set?
A data set of 100,000 to 200,000 records with about 200 variables each will be around -300 MB. Assuming you are not doing anything memory intensive(like visualizing this data) 4GB in this case may be enough(as you can see there’s plenty left).
When is 8GB a minimum?
If you are working with large datasets(x30 bigger than the above). 8GB might just barely make it.
Why? Most software like R will usually load everything in to memory.
In general if working on R or Python 8–16 GB should be enough.
Note that you can always reduce the need for so much RAM memory if your data analysis/scripting/programming skills are so you can leverage more cores/threads in your programming. There are a few tutorials on how to do this, for example Microsoft has written one here. It’s good practice to learn these tricks anyways.
16GB RAM – The best
However there’s a general good rule of thumb.
A data scientist can do amazing things with about twice the RAM as their largest chunk of data.
Not the whole data set, just some complete chunk.
My experience leads me to believe that 75% would be happy at 8GB, and 85% at 16GB and 95% at 32GB.
Getting more memory will not only let you finish your data analysis but speed things up several times fold.
As an example, Algorithms on large data sets that can take 4 hours with 8GB to run can take 20 minutes with 16GB. Talk about savings!
16GB RAM for small-medium large data sets?
Even if you will use your laptop for small data analysis (e.g. less than a GB), but the more RAM you have the less you have to think about using a new local variable to store some permutation of your data.
Another reason is that having memory to have multiple versions of the same thing around during experimentation is really useful.
How to find out how much exactly will I need
If you’ve landed on this post, my guess is that most of you don’t know what a large data set looks like and if the memory you currently have may be enough for that.
Press CTRL+ALT+DEL and .. look at the “memory” and “virtual memory” columns to get a sense of the memory footprint your computer uses when you open up large datasets. This will give you some idea.
You should write that number anbut you should aim for 2x the memory footprint plus OS overhead in case the application tries to copy the dataset.
Data Preparation vs In-Memory Analytics
Another aspect of the memory issue is the data preparation step. Today data scientists need two set of skills — preparing “big data” (usually in-disk processing using Unix grep, awk, Python, Apache Spark in standalone mode etc..) and in-memory analytics (R, Python scipy).
However, if you have a large amount of memory you may not need need the first skill because you can prepare data in R or Python directly.
This is even more important for text analytics where the amount of input data is naturally huge.
So, data processing becomes simplified with the large amount of memory in your machine.
Why is my system running so slow under large data sets?
This is because your operating system starts to “thrash” when it gets low on memory, removing some things from memory to let others continue to run. This can slow your system to a crawl.
C) Machine Learning
Machine learning is the most dependent on RAM memory size – more memory is always better for machine learning.
Note that real life ML models generally involve cluster compute time, with in-memory datasets, spread over tens or hundreds of machines, each with 32, 64 or 128 GB of RAM.
For Machine learning in your own rig I suggest you get 16GB and possibly check if your laptop has another slot for a future 32GB upgrade.
But I’ve heard RAM isn’t that important, my laptop can still handle larger sizes than my what my RAM allows
Let’s you have a dataset that is about 6GB in size. (For the sake of convenience, I’ll assume that you only need mem allocation for the dataset to analyze)
You can still run it in a laptop with 4GB RAM, if you divide the dataset into reasonable sized partitions and process separately.
Then, you can combine the results later to gain the full view.
On the other hand, you can run it in a laptop with 8GB RAM, which will have enough RAM space to handle the data as a whole and process it all in one go. The latter will be faster, but the former is still doable but slow.
First of all, data is always stored on your storage drive and then transfered onto the RAM for computing. This process can be done is not necessarily faster with Solid State Drives.
A) Using the Cloud
If you are using the Cloud, you don’t need to learn about hard drive type/speed. Just get the largest capacity you can afford.
Most laptops today offer you with 1 TB of storage. This should be enough.
B) Using your Rig
On the other hand, if your data takes much more space than what your RAM can fit (even if you have 32GB), then your analysis will be I/O bound. In this particular scenario, you will benefit quite a lot from a Solid State Drive.
On the other hand, if your data fits in memory, then most data access is sequential and you don’t need to worry about it. Just make sure to have a 1TB for space, this is enough for the avg. data scientist.
Should I still get an SSD?
You should. You’d benefit from a Solid State Drive with decent space ~256GB just to launch your software, operating system and everything else run/open in a flash. Although not necessarily your analysis, your computer will fly with on-board.
As for “me laptop”: I used to have 2 disks, 1 SSD for the operative system and 1 old slow HDD for the info.
You want the first one at least 64 GB ad the second 1TB HDD.
The first one for OS/software and the second to carry data with you.
Keep an encrypted partition for security reasons.
Note that you can get two SSDs on a laptop. I have currently upgrade to this set up. You can see in this tutorial.
An SSD has no moving parts and is therefore much quieter. SSD’s are usually more expensive and often store less.
Size & Resolution
It is a no brainer that you need a large display. Staring at large data sets on a laptop is not easy. Besides you will need to:
- SSH into more poweful machines
- Use graphis, visuzalitions
So anything below 14” is really asking for trouble. Ideally you’d want 15” + full HD and above. Note that resolution also plays a huge role in being able to see more data at once. Do not settle for HD or HD+ resolutions, only look at full HD models.
All laptops have conectors for an external display. A second screen may be useful if you have to develop software (or scripts) as well as writing reports.
Today with parallel processing finding its way in nearly application, it is not uncommon to have data analytics apps make use of GPUs. For example neural networks, most of the time do benefit from dGPUs.
AMD vs Intel vs NVIDIA
By dedicated GPUs I specifically mean NVIDIAs line of GPUs that have “CUDA Core” technology.
Intel HD Chips aren’t used for parallel processing and although AMD makes great products they don’t have as much traction in the parallel processing world like NVIDIA.
Which Software/Apps/Type of Data Analysis make use of it?
It’s great and all that even AWS provides a graphics upgrade for parallel processing with CUDA capability but that doesn’t mean that non-parallel applications like ARC can use it. You need to make sure your applications is capable of GPU parallel processing.
You’d be suprised to find out how many think that plugging a GPU into their desktop/AWS package will somehow bring parallel computation capability into some old non parallel legacy software. It won’t.
Machine and Deep Learning
There are several books and articles written about it and you could also check the application’s website. But most of the deep learning libraries and machine learning libraries (tensorflow, torch) are now using CUDA from NVIDIA processors. In fact, for the case of deep learning, most algorithms are optimized to run on GPU instead of the CPU. An algorithm that takes a week for CPU to run will take one day with a GPU.
It goes without saying learning deep learning with a CPU is going to be a strugle these days.
Image Analysis is also making use of CUDA cores.
Which GPU to use?
While some may say that GPUs from laptops are useless, whatever you read was probably written several years ago. Today’s laptop GPUs are nearly on par with the performance of their desktop counterparts. This is more true for the 10th generation (the latest and current) released by NVIDIA.
If your app does make use of it, you’d be surprised to know that even the low end graphics card found on laptops can easily give you 50-100 times speed improvement for data analysis.
- At least GTX1060. Not the 9 series, they’re fine but they’re deprecated for the best performance. Just take a look at this table:
- 10th,20th generation GPUs have way more CUDA cores than their 9th generation counterparts.
Cloud Services (For Newbies)
When a data scientist is working with a larger set of data which requires more computational resources than their desktop or laptop, we use a more powerful computer called a server.
A server is generally a very powerful computer which is dedicated to a specific task (for example running a file system, running a database, doing data analysis, running a web application or even all of the above!).
For example if you are dealing with a set of data which is 100 GB, one option for a computer with not enough RAM would be to load the data in a database and database analytics.
A faster (and arguably better) option would be use a server with enough ram (more than 100 Gb) and all of the analytics in RAM like is done with smaller sets of data.
The benefit always works out against buying a “better laptop” and that will hold true for a very long time with Linode, AWS, Microsoft, and Digital Ocean selling incredibly cheap compute power. As of today, I have a subscription on two of these : Digital Ocean and AWS and it’s allmost nothing to what I’ve saved up by not buying dekstops with 128GB RAM.
AWS (Amazon Web Services)
AWS is the biggest dog in the Cloud Service Market. Sooner or later you’re going to have end up using AWS or another cloud service. It’s not just that real data is too big to fit on a local machine now, it’s also a crucial skill in the market place right now.
Moreover to get the real flavor of being a data scientist you actually have to work with cloud systems sooner or later.
If you planned on doing more intense stuff (Neural Networks, Support Vector Machines on alot of data) even the most powerful desktop GPU will not cut it and you will be better off renting out an AWS instance.
Note that AWS has a free tier for you to get stated with so you’ve got nothing to lose at this point.
Using a VNC (Virtual Network Computing)
Oh I tried it. I already had built my ideal (i.e. powerful) data analytics computer about a year prior, but it was a desktop.
I figured I could actually just buy a really cheap laptop, keep my desktop running all the time, and then use RDP*, Teamviewer*, or a VNC* programme to connect to it whenever I needed to do some data analysis.
So I bought a cheap 350$ windows laptop and started trying to set up a VNC.
I got it working. But not only did it mean that I had to always leave my desktop running: it was fairly laggy too.
Amazon AWS EC2
That’s how I fed up with it and discovered Amazon AWS EC2.
This service actually does something similar. It lets you create virtual computers with any operating system you want and customize how you access them.
I set up one of these (Linux), then taught myself how to use Linux.
The most useful thing about it is that I installed a web based IDE for R on it (Rstudio), which allowed me to go to a website hosted by my EC2 server and use R as if I was sitting at that computer.
Now, whenever I want to do some work, I can do it from any computer in the world with an internet connection, simply by visiting a website, and all the processing is done on the Amazon server.
Cost: You have to pay for the server, but they are inexpensive, and you pay different amounts based on the (virtual) processor, RAM, GPU etc of the server.
Also, there is a 1yr free trial which let’s you use the least powerful virtual server at no cost.
I understand that R may not be the only language you wish to use, but given that you can install anything you want on your server, it seems like a viable option for pretty much any data scientist.
Can access server from any device with the internet
Files are always accessible. Don’t even need to download them (like you would with drop box), just view on the server
Costs much less than powerful laptop
Server can be programmatically designed to scale depending on analysis needs using an API
Laptop screen is quite small, but I now find I access the server mostly from other desktops or a 17” laptop.
Requires internet connection to use
Can take some time to learn how to use Elastic Computing Cloud (C2)
OS: Mac vs Windows vs Linux
Although it may look like Mac and Linux are the way to go. These days come down to preference at the end of the day. Most packages one will need for data exploration and analysis work on all platforms – Octave and R are great examples and are very widely used.
I have to admit that working with Python on Mac is much easier than Windows and even Linux due to better package management. Python being the most widely used language for Data Scientis then may imply that you have to go Mac.
In a way tat’s true, if you go Linux or Mac, it also means you’ll have access to the latest libraries, while using it on Windows means you’ll often have to wait for libraries to be compiled as binaries.
A Windows machine will require far more tweaks to successfully run code than the Apple machines. Most of the (sporadic and poorly written) documentation available for cutting edge data science tools assumes you are working on an Apple machine.
Or at least Unix.
However the advantage of going windows will never change: you can get the same hardware for half the price in the Windows world. And a choice of style, format and features.
There’s also the issue of upgradeability: If you need a bigger hard drive, you can fit one yourself. More memory – no problem.
You can repair PCs, while Macs are a sealed box – you are stuck with the hardware you thought you needed (or could afford) at the time you bought it.
There’s also the issue of NVIDIA GPUs for parallel processing, these are still only available in Macs.
Lastly, Excel, which still in my mind one of the most effective tools for data analysis works best on Windows, and works in most cases okayish OSX. If you’re going deep into pivot tables and more complex models that involve macros, use a PC or run a virtualized copy of Windows on your Mac; Office Mac can still be somewhat frustrating.
Either is fine when…
Using MatLab, S-Plus and SPSS.
In terms of the databases, you’re also free to choose any platform. PostgreSQL and MySQL will work on any platform. If you’re dealing with a Hadoop cluster, you’ll be connecting remotely, so any client operating system will work.