Friday, May 3, 2013
As shown in the Notebook, Google dominates the search engine market in most countries, except China where government regulation has largely pushed Google out of the market. Russia also tells a slightly different story as Google has fairly strong competition from the local search engine, Yandex.
Friday, March 15, 2013
I recently came across an article demonstrating how to count words in a txt file using GPUs with a MapReduce algorithm. Having access to a monster rig at work with 4 NVIDIA Tesla C2075 GPUs, I decided to give it a try. Their code didn't work out of the box, but I made some changes to it and managed to get it working while also speeding it up a bit.
Here is my version of their script:
from pycuda import gpuarray from pycuda.reduction import ReductionKernel import pycuda.autoinit import numpy as np import time def createCudawckernal(): # 32 is ascii code for whitespace mapper = "(a[i] == 32)*(b[i] != 32)" reducer = "a+b" cudafunctionarguments = "char* a, char* b" wckernal = ReductionKernel(np.dtype(np.float32), neutral="0", reduce_expr=reducer, map_expr=mapper, arguments=cudafunctionarguments) return wckernal def createBigDataset(filename): print "Reading data" dataset = np.fromfile(filename, dtype=np.int8) originaldata = dataset.copy() for k in xrange(100): dataset = np.append(dataset, originaldata) print "Dataset size = ", len(dataset) return np.array(dataset, dtype=np.uint8) def wordCount(wckernal, bignumpyarray): print "Uploading array to gpu" gpudataset = gpuarray.to_gpu(bignumpyarray) datasetsize = len(bignumpyarray) start = time.time() wordcount = wckernal(gpudataset[:-1], gpudataset[1:]).get() stop = time.time() seconds = (stop-start) estimatepersecond = (datasetsize/seconds)/(1024*1024*1024) print "Word count took ", seconds*1000, " milliseconds" print "Estimated throughput ", estimatepersecond, " Gigabytes/s" return wordcount if __name__ == "__main__": print 'Downloading the .txt file' from urllib import urlretrieve txtfileurl = 'https://s3.amazonaws.com/econpy/shakespeare.txt' urlretrieve(txtfileurl, 'shakespeare.txt') print 'Go Baby Go!' bignumpyarray = createBigDataset("shakespeare.txt") wckernal = createCudawckernal() wordcount = wordCount(wckernal, bignumpyarray) print 'Word Count: %s' % wordcount
Assuming you have GPUs on your machine and Pycuda installed, run the code by saving the script above as gpu_wordcount.py, then open up a terminal and run:
The output on my machine looks like this:
Downloading the .txt file Reading data Dataset size = 101250379 Uploading array to gpu Word count took 8.30793380737 milliseconds Estimated throughput 11.3502064216 Gigabytes/s Word Count: 17726106.0
The script downloads a 1MB txt file, reads it in as a numpy array, makes the array 100 times longer by duplicating the data 100 times (just to try it out easily with a dataset that is significantly larger than 1MB), then counts the number of words in the array by splitting on white spaces.
Comparing the GPU script to a regular Python script that does the same thing, the GPU script was 355 times faster! Eventually I'd like to build out this example to count the frequency of all the unique words in a txt file, rather than just counting the number of total words. Easier said than done, so if you have any advice I'm all ears!
Thursday, March 7, 2013
In this IPython Notebook, I go through some statistical tests in Python with Google Domestic Trends data using searches by automotive buyers (queries such as "cars, kelly blue book, auto, used cars, toyota, autotrader") to try and predict the volume of search queries related to automotive financing (queries such as "lease, mileage, loan calculator, auto loan, car payment").
I also do some basic tests of periodicity in the data, as well as provide a Python wrapper for querying Google Domestic Trends to return a pandas DataFrame.
Friday, December 28, 2012
Take a look at this Github repo for a Python script that can be used to fetch data from the Google Ngram Viewer. See the README file in the repo for instructions on how to use the script, as well as the PLOTTING file for instructions on how to plot the data in Python using pandas.
The Python script is a modified version of the Python script from the culturomics.org website.
To see an example of the type of research that's being done using the Google Ngram Viewer, check out this TED Talk by the creators of culturomics.org at Harvard.
NOTE: Be nice to the Google servers and don't beat them to death with this script.
Monday, December 17, 2012
Check out this repo of mine on Github for a set of Python scripts for scraping various data from Newegg.com and storing it in an SQLite database.
The scripts use the mobile Newegg.com site to retrieve the list of all the products in a category, then uses the product ID for each product in that category to fetch and parse data from the Newegg JSON API before transforming it into a pandas DataFrame and dumping it into a table in the SQLite database.
The reason for using the mobile Newegg.com site is because it is lighter weight than the desktop version of the site.
As of right now, I have scripts in the Github repo setup to collect data on the following:
- Desktop CPUs
- Desktop Memory
- Hard Drives
- LCD/LED/Plasma TVs
- PS3 Games
- XBox 360 Games
Each script dumps the data into a separate table in the SQLite database file. Feel free to tweak these scripts however you like, perhaps to retrieve different data from the Newegg JSON API for each product or even to change it to grab data from another set of products on Newegg. Enjoy!
Monday, April 2, 2012
The application that's analyzed in the tutorial involves determining the factors that influence the price of a desktop CPU, using the prices and features of processors that are currently listed at newegg.com as the data set for the model.
Lastly, if you haven't been to econpy.org in a while, check out the recently updated interface which now includes an IPython terminal right in the browser -- special thanks to PythonAnywhere on that one!
Sunday, November 20, 2011
from pandas.io.data import DataReaderDataReader is what we'll use to retrieve YF stock price data. The DataReader class can also retrieve economic data from two other remote sources, namely the St. Louis Federal Reserve's FRED database and Kenneth French's Data Library, both of which will be the topic of future posts.
from datetime import datetime
For now, let's say we want to grab historical stock price data on Microsoft (MSFT). We'll do so by defining an object msft that contains daily prices for Microsoft as so:
msft = DataReader("MSFT", "yahoo")The first input of DataReader is the name of the dataset (e.g. the stock ticker for YF) and the second is the location of the dataset ("yahoo" for YF). The msft object is a DataFrame object whose rows are days of the year and columns are the prices for each day, labelled Open, High, Low, Close, Volume and Adj Close, respectively.
By default, the data contains observations from the past year, but that can be changed by providing a datetime object as the third input to DataReader:
msft = DataReader("MSFT", "yahoo", datetime(2009,1,1))The msft object now contains daily price data from the start of 2009 up to today's date. To print a particular column of msft, such as the stock's daily volume, enter:
print msft["Volume"]As another example, to print the adjusted closing price of MSFT, but only for the last 100 days, enter:
print msft["Adj Close"][-100:]That's it for now. Expect a follow-up post soon that'll include a more involved example using the FRED database.