1) Learn about matrix factorizations
- Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numerical Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix decomposition algorithms are fundamental to many data mining applications and are usually underrepresented in a standard “machine learning” curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout  are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines  work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines. Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites. I’d recommend these resources for self study/reference material:
- See and
2) Learn about distributed computing
- It is important to learn how to work with a Linux cluster and how to design scalable distributed algorithms if you want to work with big data ( ).
- Crays and Connection Machines of the past can now be replaced with farms of cheap cloud instances, the computing costs dropped to less than $1.80/GFlop in 2011 vs $15M in 1984: .
- If you want to squeeze the most out of your (rented) hardware it is also becoming increasingly important to be able to utilize the full power of multicore (see )
- Note: this topic is not part of a standard Machine Learning track but you can probably find courses such as Distributed Systems or Parallel Programming in your CS/EE catalog. See , , , and for starters: .
- After studying the basics of networking and distributed systems, I’d focus on distributed databases, which will soon become ubiquitous with the data deluge and hitting the limits of vertical scaling. See , and for starters: and .
3) Learn about statistical analysis
- Start learning statistics by coding with R: and experiment with real-world data:
- compiled some great materials on computational statistics, check out his lecture slides, and also
- I’ve found that learning statistics in a particular domain (e.g. ) is much more enjoyable than taking Stats 101. My personal recommendation is the course by at Columbia (also available on ).
- You can also choose a field where the use of quantitative statistics and causality principles  is inevitable, say molecular biology , or a fun sub-field such as cancer research , or even narrower domain, e.g. genetic analysis of tumor angiogenesis  and try answering important questions in that particular field, learning what you need in the process.
4) Learn about optimization
- This subject is essentially prerequisite to understanding many and algorithms, besides being important in its own right.
- Start with ‘s video lectures and also
5) Learn about machine learning
- Before you get to think about algorithms look carefully at the data and select features that help you filter signal from noise. See this talk by :
- Also see and
- Statistics vs. machine learning, fight!:
- You can structure your study program according to online course catalogs
and curricula of MIT, Stanford or other top schools. Experiment with
data a lot, hack some code, ask questions, talk to good people, set up a web crawler in your garage:
- You can join one of these startups and learn by doing:
- The alternative (and rather expensive) option is to enroll in a CS
program/Machine Learning track if you prefer studying in a formal
- Try to avoid overspecialization. The breadth-first approach often works best when learning a new field and dealing with hard problems, see the on the adventures of an ingenious young data miner.
6) Learn about information retrieval
- Machine learning Is not as cool as it sounds:
7) Learn about signal detection and estimation
- This is a classic topic and “data science” par excellence in my opinion.
Some of these methods were used to guide the Apollo mission or detect
enemy submarines and are still in active use in many fields. This is
often part of the EE curriculum.
- Good references are Robert F. Stengel’ lecture slides on optimal control and estimation: , Alan V. Oppenheim’s . and A good topic to focus on first is , widely used for forecasting.
- Talking about data, you probably want to know something about information: its transmission, compression and filtering signal from noise. The methods developed by communication engineers in the 60s (such as , now used in about a billion cellphones, or widely used in ) are applicable to a surprising variety of data analysis tasks, from to understanding the organization and function of . A good resource for starters is . Also
8) Master algorithms and data structures
If you do decide to go for a Masters degree:
10) Study Engineering
I’d go for CS with a focus on either IR or Machine Learning or a combination of both and take some systems courses along the way. As a “data scientist” you will have to write a ton of code and probably develop distributed algorithms/systems to process massive amounts of data. MS in Statistics will teach you how to do modeling and regression analysis etc, not how to build systems, I think the latter is more urgently needed these days as the old tools become obsolete with the avalanche of data. There is a shortage of engineers who can build a data mining system from the ground up. You can pick up statistics from books and experiments with R (see item 3 above) or take some statistics classes as a part of your CS studies.
At DL Recruiting Partners, our experienced recruiters can help you find the temporary, contract, or direct-hire talent you need, and help you extend an offer once you have found the perfect employee. Contact us today to learn more.