Mind your mining

I have been drifting lately from one project idea to the next one and yet, have been unable to complete any of them for the next app that I would like to eventually publish at maisonlabs.com. As this process seems to be taking longer than expected, I decided to cover in this blog post the outcome of a small exercise I ran during this past long Easter weekend. The topic relates to achieving remote-browser-driving as it applies to data mining purposes. The tools selected for my experiment were selenium-grid and python. In my particular case remote-browser-driving was achieved in a host-guest environment resulting from setting up a virtual machine in my home laptop with the help of Oracle Virtualbox.

I was introduced to Virtualbox by my Russian developer friend Eugene. After spending a few minutes looking at the documentation and noticing their Linux support, I had to give it a try as my VM platform. For my Ubuntu host "Trusty" 14.04 LTS I downloaded the 64 bit version of Virtualbox from the official repository here. Installation with dependencies can be done by simply using the "gdebi" command. However, notice that in my case I had to disable the "secure boot" feature in the PC BIOS otherwise the kernel modules for Virtualbox would not have built. Researching a bit online I came across this solution as a way to keeping secure boot enable while getting the modules to build. Despite its sophistication I did not use it and felt content with secure boot disabled as all I was trying to achieve was to put together an MVP for this post.

With Virtualbox operational, the next step was to pick an OS to install on the guest virtual machine. The choice was a no brainer for a cheapo like me. Linux of course!... and for a bit of flavour settled for Mint running the Cinnamon desktop. Installing Linux Mint on the VM did not require any special trick but just the steps that can be found online in the official site and many others. The next step was to install the python bindings for selenium which can be easily achieved using the "pip" command. If you want to run your python code in a very neat interacting way I suggest for you to install the jupyter notebook as well. At this point there were only two more steps left to be able to drive browsers remotely in my particular environment. The first one comprised downloading the selenium java server (make sure the java version on both host/guest complies with the one employed to compile the .jar server file) and the second one was to configure "port forwarding" in the network interface of the VM (NAT by default) to the port where the selenium server listens for connections (4444 by default). After such steps all left to do was to write some code to drive browsers using the "webdriver.Remote" package from selenium and launch hub/node selenium grid connections on the host/guest. And voila! this concludes my brief post for today. For the grand finale see the short recording posted below of what my browser driving exercise looked like (python code running on host jupyter notebook and driven firefox browser running on VM guest).


PS: If you are wondering what is that I had in mind beyond amusing myself running this exercise over the long weekend here is a clue: Distil Networks is a rising star of the SF tech scene providing bot mitigation services for web apps. Many sites with very interesting "free" business data sources such as Owler are now Distil customers. For those looking to mine online business data for non-commercial purposes here is a new motto as it applies to you and the data: "Until Distil due us apart..." You up for a challenge?

Comments !