Archive for the ‘Uncategorized’ Category

I could not find a source for the magic z40 in the .git/hooks/pre-push.sample file. A quick google search threw up this.

z40 is a regular expression matching the empty blob/commit/tree
SHA: “0000000000000000000000000000000000000000”. found here https://github.com/git-lfs/git-lfs/blob/master/git/rev_list_scanner.go

Hopefully somebody can comment and let me know of a more detailed explanation.

Read Full Post »

I could not easily find solutions for this problem hence this post.

Here are two ways of checking that
df.apply(pd.Series.nunique, axis=0).unique().tolist() == [1]
df.groupby(df.columns.tolist()).ngroups == 1

the groupby method was *slightly* faster in my timing tests using the following code

import pandas as pd
import timeit

data = [
dict(zip(list('abc'), range(1, 4))),
dict(zip(list('abc'), range(1, 4))),

df = pd.DataFrame(data)

print(df.apply(pd.Series.nunique, axis=0).unique().tolist() == [1])

print(df.groupby(df.columns.tolist()).ngroups == 1)

print(timeit.timeit('df.apply(pd.Series.nunique, axis=0).unique().tolist() == [1]',
number=1000, globals=globals()))

print(timeit.timeit('df.groupby(df.columns.tolist()).ngroups == 1', number=1000,

Read Full Post »

from sklearn.feature_extraction import stop_words


currently there are 318 words in that frozenset.

NLTK also has its own stopwords

from nltk.corpus import stopwords

there are 153 words in that

Read Full Post »

select a/sum(a) from foo; -- WRONG!

will not work because sum(a) would work on each row and will return the number present in that row resulting in all 1. since sum(n) = n where n is a single number.

here is where we can make good use of  hive windowing and analytics functions.

the solution is to use the following:

select a/sum(a) over () from foo; -- RIGHT !!!

here we are instructing that the sum be performed over the entire column ‘a’.

you could of course make 2 queries and calculate the sum of a first and then hardcode the sum in another query but that means you are performing 2 passes over the table for sure. now i can’t be sure how many passes over the table are being made in the right query above. if you know the answer please post in the comments below.

Read Full Post »

It has been a while since I played with Archlinux. Meanwhile AUR has transitioned and now uses version controlled PKGBUILDs. So here is how to go about it.

Let us take the example of the package cower.

If you visit that page you will find a “Download snapshot” link under the Package Actions box to the right of the page near the top of the page. Just click on it and you will download a compressed tarball; cower.tar.gz in this case. Uncompress that to find the actual PKGBUILD in it. I also noticed a hidden file called .SRCINFO in the same folder. Now you can simply issue the command “makepkg -irs” in the same directory and you are all set.

The other way is to git clone the repo. The repo link is right at the top of the page under Git Clone URL. If you clone the repo you will find the PKGBUILD and .SRCINFO and .git directory in there. Again use “makepkg -irs” to install the package.

Read Full Post »

window functions allow one to look at the previous values or next values of a column. for example if i want to subtract the previous row value from the current row value then window functions lag and lead can be used.

let us take up an example.

first create a text file containing numbers 1 through 9 with a single number on each line like so







call the file data.txt.

next we create a table in hive.

create table foo (a int);

next we load our data.txt file in to this created table using the following

load data local inpath 'data.txt' overwrite into table foo;

we want to access previous and next values over column ‘a’ note therefore the over clause in the following query

select lag(a, 1) over (order by a) as previous, a, lead(a, 1) over (order by a) as next from foo;

which outputs the following:

previous a next
NULL 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
7 8 9
8 9 NULL

note how the previous and next values are NULL at the edge cases. You could specify a value in such cases as in the following query which specifies 0.

select lag(a, 1, 0) over (order by a) as previous, a, lead(a, 1, 0) over (order by a) as next from foo;

which outputs the following:

previous a next
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
7 8 9
8 9 0

lag(a, 1) will fetch the previous value while lag(a, 2) will fetch the previous to previous value.

Read Full Post »

case matters.

had created table as such

create table foo(

a int,

b string

) stored as orc tblproperties(“orc.compress”=”snappy”);

but when i went to populate the table using an “insert overwrite table foo select * from …” statement then i faced an error.

turns out i should have used “SNAPPY” instead of “snappy”. case matters.


Read Full Post »

if you do not want to apply markdown formatting rules to a chunk of text then wrap them in following tags.

opening tag: “`text i.e. 3 backtics followed by the word text

closing tag: “` 3 backticks

this will give you unformatted text and displays it in exactly the form it was typed in.

Read Full Post »

import datetime

print(datetime.datetime.strftime(datetime.datetime.now(), ‘%c’))

hope this becomes the number 1 search result when people type this post’s title in to google.

Read Full Post »

on trying to plot any thing in ipython using matplotlib i got the following error

This application failed to start because it could not find or load the Qt platform plugin “xcb”.

Available platform plugins are: eglfs, kms, linuxfb, minimal, minimalegl, offscreen, xcb.

Reinstalling the application may fix this problem.
Aborted (core dumped)

for example the following command will produce the error

ipython -c 'import pylab; pylab.plot()'

matplotlib backend: Qt5Agg

uname -srvmo :: Linux 3.16.1-1-ARCH #1 SMP PREEMPT Thu Aug 14 07:40:19 CEST 2014 x86_64 GNU/Linux

$ pacman -Q ipython python-matplotlib
ipython 2.2.0-1
python-matplotlib 1.4.0-2

solution :: install libxkbcommon-x11

Read Full Post »

Older Posts »