Machine Learning Next Big Thing
One thing that I’ve been wrestling with recently is where is applied machine learning going? Part of this is pure curiosity but there are also parts around whether there are opportunities to take advantage of and also a somewhat defensive angle – as in, is my job at risk?
When I started in consulting, you could make a nice living just going to a client, pulling some data, running some analysis and giving them a deck or a piece of code. Those days are pretty much over. Now, most firms will have at least one person that can manage data and do solid analysis in Excel and be on their way in R, Python or Tableau. The stuff we do now is more relating to managing very large data systems (usually Hadoop or some other distributed computing environment), managing portfolios of models and all of the people issues that come along with these. There are occasional point solutions but they need to be built quickly and in a standard way.
Unfortunately, one area I’m behind on is looking at new products in the various parts of a knowledge management system. For instance one issue that we wrestle with on a regular basis is code standardization. This is nothing technically complex – just how do you get multiple, geographically dispersed people writing similar code onto the same page?
There are a couple ideas that I’m working thru such as creating R packages to build functions that are reusable or seeing if a GUI-based tool like Alteryx can help, but I need a working platform in terms of data, models and associated viz / monitoring tools to have a baseline. With that in mind I spent the weekend putting one together.
I started out by getting a new laptop – a fairly cheap, 11.6 inch, HP one with 4GB of RAM, a Pentium processor and a 500 GB hard drive (link). This might seem small but I’m not looking to go crazy here just enough to build a reasonable working environment and something light enough to travel with. Also, considering my last non-work laptop was a Lenovo 100S with 2 GB of RAM and 32 GB of storage, this is a huge step up.
Next, I installed and transferred a bunch of software including Office, R, Git, SQL Server Express and Sublime Text. This all took several hours but was not as bad as I anticipated. The next step is to find a data set that changes with a pretty high degree of regularity. The first one that came to mind was securities prices. In the past I’ve used EODData.com as it’s fairly cheap and I already have some existing ingestion scripts. I went and found some old data (you can find a sample here: http://www.eoddata.com/NASDAQ_20101105.zip) and used it to build some tables in SQL server. The scripts I used will be in a GitHub folder that I’ll share later.
One point of note here is that in my experience, it’s very important to build out the entire system (even if it’s pretty poor quality) just to figure out if everything connects and to discover the potential pain points. I’d use the analogy of building a car where you can put a rough prototype together to see how all the necessary systems fit together versus building a perfect steering wheel and realizing it doesn’t fit with the steering column. Andy Ng has a lecture somewhere about a colleague who was building a facial recognition system and spent months on the eye-recognition portion only to discover it was fairly worthless in the overall scheme of things.
I’ll end here and do another post once I’ve made some progress.