I recently came across an article demonstrating how to count words in a txt file using GPUs with a MapReduce algorithm. Having access to a monster rig at work with 4 NVIDIA Tesla C2075 GPUs, I decided to give it a try. Their code didn't work out of the box, but I made some changes to it and managed to get it working while also speeding it up a bit.
Here is my version of their script:
from pycuda import gpuarray
from pycuda.reduction import ReductionKernel
import pycuda.autoinit
import numpy as np
import time
def createCudawckernal():
# 32 is ascii code for whitespace
mapper = "(a[i] == 32)*(b[i] != 32)"
reducer = "a+b"
cudafunctionarguments = "char* a, char* b"
wckernal = ReductionKernel(np.dtype(np.float32), neutral="0",
reduce_expr=reducer, map_expr=mapper,
arguments=cudafunctionarguments)
return wckernal
def createBigDataset(filename):
print "Reading data"
dataset = np.fromfile(filename, dtype=np.int8)
originaldata = dataset.copy()
for k in xrange(100):
dataset = np.append(dataset, originaldata)
print "Dataset size = ", len(dataset)
return np.array(dataset, dtype=np.uint8)
def wordCount(wckernal, bignumpyarray):
print "Uploading array to gpu"
gpudataset = gpuarray.to_gpu(bignumpyarray)
datasetsize = len(bignumpyarray)
start = time.time()
wordcount = wckernal(gpudataset[:-1], gpudataset[1:]).get()
stop = time.time()
seconds = (stop-start)
estimatepersecond = (datasetsize/seconds)/(1024*1024*1024)
print "Word count took ", seconds*1000, " milliseconds"
print "Estimated throughput ", estimatepersecond, " Gigabytes/s"
return wordcount
if __name__ == "__main__":
print 'Downloading the .txt file'
from urllib import urlretrieve
txtfileurl = 'https://s3.amazonaws.com/econpy/shakespeare.txt'
urlretrieve(txtfileurl, 'shakespeare.txt')
print 'Go Baby Go!'
bignumpyarray = createBigDataset("shakespeare.txt")
wckernal = createCudawckernal()
wordcount = wordCount(wckernal, bignumpyarray)
print 'Word Count: %s' % wordcount
Assuming you have GPUs on your machine and Pycuda installed, run the code by saving the script above as gpu_wordcount.py, then open up a terminal and run:
python gpu_wordcount.py
The output on my machine looks like this:
Downloading the .txt file
Reading data
Dataset size = 101250379
Uploading array to gpu
Word count took 8.30793380737 milliseconds
Estimated throughput 11.3502064216 Gigabytes/s
Word Count: 17726106.0
The script downloads a 1MB txt file, reads it in as a numpy array, makes the array 100 times longer by duplicating the data 100 times (just to try it out easily with a dataset that is significantly larger than 1MB), then counts the number of words in the array by splitting on white spaces.
Comparing the GPU script to a regular Python script that does the same thing, the GPU script was 355 times faster! Eventually I'd like to build out this example to count the frequency of all the unique words in a txt file, rather than just counting the number of total words. Easier said than done, so if you have any advice I'm all ears!
I believe there is a major issue with this implementation. Although you have a major speedup from the linked article, you have now created an issue where a lot of words are not being counted. I have used http://www.gutenberg.org/cache/epub/10/pg10.txt as my dataset. Using 2 word counters online and microsoft word, this program falls well below the count of the others. I believe this is because endline and carriage return characters are not being removed, so every single first word of a line doesn't get counted. In a dataset, this word look something like this:
ReplyDelete"this is part of the dataset\nAnd the start of a new line"
Notice how dataset\nAnd will be considered one word if we don't remove the carriage return?
Anyway, just a warning for others attempting to use this for a reference.
Thank you for very usefull information.. word counter tool
ReplyDelete