Sorting Columns¶
Use the sort
function in Python or the arrange
function in R to create a new frame that is sorted by column(s) in ascending (default) or descending order. Note that when using sort
, the original frame cannot contain any string columns.
If only one column is specified in the sort, then the final results are sorted according to that one single column either in ascending (default) or in descending order. However, if you specify more than one column in the sort, then H2O performs as described below:
Assuming two columns, X (first column) and Y (second column):
H2O will sort on the first specified column, so in the case of [0,1], the X column will be sorted first. Similarly, in the case of [1,0], the Y column will be sorted first.
H2O will sort on subsequent columns in the order they are specified, but only on those rows that have the same values as the first sorted column. No sorting will be done on subsequent columns if the values are not also duplicated in the first sorted column.
# Currently, this function only supports `all.x = TRUE`. All other permutations will fail.
library(h2o)
h2o.init()
# Import the smallIntFloats dataset
X <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/synthetic/smallIntFloats.csv.zip")
X
C1 C10
1 68379 -1.618668e+07
2 67108864 3.276800e+04
3 32768 -8.709456e+08
4 32 1.310720e+05
5 268435456 -2.910033e+01
6 105383117 -2.397206e+08
[180000 rows x 2 columns]
# Sort on the first column only in ascending order (default)
X_sorted1 <- h2o.arrange(X,C1)
X_sorted1
C1 C10
1 -1073593184 7.474380e+05
2 -1073563127 -2.097152e+06
3 -1073521109 5.110769e+06
4 -1073416724 2.220942e+06
5 -1073361973 -5.707598e+00
6 -1073357712 -4.650334e+03
[180000 rows x 2 columns]
# Sort on both columns in descending order, specifying to sort on C1 first
X_sorted2 <- h2o.arrange(X, desc(C1),desc(C10))
X_sorted2
C1 C10
1 1073593184 256.000000
2 1073521109 -128.000000
3 1073257966 15.616867
4 1073072648 1.884208
5 1072757094 441.816579
6 1072669626 -512.000000
[180000 rows x 2 columns]
# Sort on the second column in descending order
X_sorted3 <- h2o.arrange(X, desc(C10))
X_sorted3
C1 C10
1 321417689 1073662860
2 448 1073574390
3 85 1073288384
4 -4096 1072908385
5 28 1072890306
6 -4194304 1072750253
[180000 rows x 2 columns]
import h2o
h2o.init()
# Import the smallIntFloats dataset
df1 = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/synthetic/smallIntFloats.csv.zip")
df1
C1 C10
---------------- ----------------
68379 -1.61867e+07
6.71089e+07 32768
32768 -8.70946e+08
32 131072
2.68435e+08 -29.1003
1.05383e+08 -2.39721e+08
350191 21551.4
-188 2.39872e+07
493 525.825
9.31041e+07 -1.63828e+08
[180000 rows x 2 columns]
# Sort on the first column only in ascending order (default)
sorted_column_indices=[0]
df2 = df1.sort(0)
df2
C1 C10
------------ ----------------
-1.07359e+09 747438
-1.07356e+09 -2.09715e+06
-1.07352e+09 5.11077e+06
-1.07342e+09 2.22094e+06
-1.07336e+09 -5.7076
-1.07336e+09 -4650.33
-1.07326e+09 -1.04858e+06
-1.07307e+09 8192
-1.07291e+09 -1.49017
-1.07291e+09 -9337.5
[180000 rows x 2 columns]
# Sort on both columns in descending order, specifying to sort on C1 first
df3 = df1.sort([0,1], ascending=[False, False])
df3
C1 C10
----------- -----------------
1.07359e+09 256
1.07352e+09 -128
1.07326e+09 15.6169
1.07307e+09 1.88421
1.07276e+09 441.817
1.07267e+09 -512
1.07233e+09 1444.14
1.07184e+09 -231812
1.07096e+09 2.00296e+07
1.07082e+09 5.36871e+08
[180000 rows x 2 columns]
# Sort on the second column in descending order
df4 = df1.sort(1, ascending=False)
df4
C1 C10
----------------- -----------
3.21418e+08 1.07366e+09
448 1.07357e+09
85 1.07329e+09
-4096 1.07291e+09
28 1.07289e+09
-4.1943e+06 1.07275e+09
6.61688e+06 1.07254e+09
-50127 1.07235e+09
-262144 1.07207e+09
55 1.07175e+09
[180000 rows x 2 columns]