Vaex uses several sites:
Vaex is open source software, if you need support, contact us at https://vaex.io
Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).
ds.mean<tab>
, feels very similar to Pandas.vaex-core
: Dataset and core algorithms, takes numpy arrays as
input columns.vaex-hdf5
: Provides memory mapped numpy arrays to a Dataset.vaex-arrow
: Arrow support for
cross language data sharing.vaex-viz
: Visualization based on matplotlib.vaex-jupyter
: Interactive visualization based on Jupyter
widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet.vaex-astro
: Astronomy related transformations and FITS file
support.vaex-server
: Provides a server to access a dataset remotely.vaex-distributed
: (Proof of concept) combined multiple servers
/ cluster into a single dataset for distributed computations.vaex-qt
: Program written using Qt GUI.vaex
: meta package that installs all of the above.vaex-ml
: Machine learning with automatic pipelines.Using conda:
conda install -c conda-forge vaex
Using pip:
pip install vaex
Or read the detailed instructions
We assuming you have installed vaex, and are running a Jupyter notebook server. We start by importing vaex and ask it to give us sample example dataset.
import vaex
ds = vaex.example() # open the example dataset provided with vaex
Instead, you can download some larger datasets, or read in your csv file.
ds # will pretty print a table
# | x | y | z | vx | vy | vz | E | L | Lz | FeH |
---|---|---|---|---|---|---|---|---|---|---|
0 | -0.777470767 | 2.10626292 | 1.93743467 | 53.276722 | 288.386047 | -95.2649078 | -121238.171875 | 831.0799560546875 | -336.426513671875 | -2.309227609164518 |
1 | 3.77427316 | 2.23387194 | 3.76209331 | 252.810791 | -69.9498444 | -56.3121033 | -100819.9140625 | 1435.1839599609375 | -828.7567749023438 | -1.788735491591229 |
2 | 1.3757627 | -6.3283844 | 2.63250017 | 96.276474 | 226.440201 | -34.7527161 | -100559.9609375 | 1039.2989501953125 | 920.802490234375 | -0.7618109022478798 |
3 | -7.06737804 | 1.31737781 | -6.10543537 | 204.968842 | -205.679016 | -58.9777031 | -70174.8515625 | 2441.724853515625 | 1183.5899658203125 | -1.5208778422936413 |
4 | 0.243441463 | -0.822781682 | -0.206593871 | -311.742371 | -238.41217 | 186.824127 | -144138.75 | 374.8164367675781 | -314.5353088378906 | -2.655341358427361 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
329995 | 3.76883793 | 4.66251659 | -4.42904139 | 107.432999 | -2.13771296 | 17.5130272 | -119687.3203125 | 746.8833618164062 | -508.96484375 | -1.6499842518381402 |
329996 | 9.17409325 | -8.87091351 | -8.61707687 | 32.0 | 108.089264 | 179.060638 | -68933.8046875 | 2395.633056640625 | 1275.490234375 | -1.4336036247720836 |
329997 | -1.14041007 | -8.4957695 | 2.25749826 | 8.46711349 | -38.2765236 | -127.541473 | -112580.359375 | 1182.436279296875 | 115.58557891845703 | -1.9306227597361942 |
329998 | -14.2985935 | -5.51750422 | -8.65472317 | 110.221558 | -31.3925591 | 86.2726822 | -74862.90625 | 1324.5926513671875 | 1057.017333984375 | -1.225019818838568 |
329999 | 10.5450506 | -8.86106777 | -4.65835428 | -2.10541415 | -27.6108856 | 3.80799961 | -95361.765625 | 351.0955505371094 | -309.81439208984375 | -2.5689636894079477 |
Using square brackets[], we can easily filter or get different views on the dataset.
ds_negative = ds[ds.x < 0] # easily filter your dataset, without making a copy
ds_negative[:5][['x', 'y']] # take the first five rows, and only the 'x' and 'y' column (no memory copy!)
# | x | y |
---|---|---|
0 | -0.777471 | 2.10626 |
1 | -7.06738 | 1.31738 |
2 | -5.17174 | 7.82915 |
3 | -15.9539 | 5.77126 |
4 | -12.3995 | 13.9182 |
When dealing with huge datasets, say a billion rows (10^9), computations with the data can waste memory, up to 8 GB for a new column. Instead, vaex uses lazy computation, only a representation of the computation is stored, and computations done on the fly when needed. Even though, you can just many of the numpy functions, as if it was a normal array.
import numpy as np
# creates an expression (nothing is computed)
r = np.sqrt(ds.x**2 + ds.y**2 + ds.z**2)
r # for convenience, we print out some values
<vaex.expression.Expression(expressions='sqrt((((x ** 2) + (y ** 2)) + (z ** 2)))')> instance at 0x11bcc4780 values=[2.9655450396553587, 5.77829281049018, 6.99079603950256, 9.431842752707537, 0.8825613121347967 ... (total 330000 values) ... 7.453831761514681, 15.398412491068198, 8.864250273925633, 17.601047186042507, 14.540181524970293]
These expressions can be added to the dataset, creating what we call a virtual column. These virtual columns are simular to normal columns, except they do not waste memory.
ds['r'] = r # add a (virtual) column that will be computed on the fly
ds.mean(ds.x), ds.mean(ds.r) # calculate statistics on normal and virtual columns
(-0.06713149126400597, 9.407082338299773)
One of the core features of vaex is its ability to calculate statistics on a regular (N-dimensional) grid. The dimensions of the grid are specified by the binby argument (analogous to SQL's grouby), and the shape and limits.
ds.mean(ds.r, binby=ds.x, shape=32, limits=[-10, 10]) # create statistics on a regular grid (1d)
array([15.01058183, 14.43693006, 13.72923338, 12.90294499, 11.86615103, 11.03563695, 10.12162553, 9.2969267 , 8.58250973, 7.86602644, 7.19568442, 6.55738773, 6.01942499, 5.51462457, 5.15798991, 4.8274218 , 4.7346551 , 5.1343761 , 5.46017944, 6.02199777, 6.54132124, 7.27025256, 7.99780777, 8.55188217, 9.30286584, 9.97067561, 10.81633293, 11.60615795, 12.33813552, 13.10488982, 13.86868565, 14.60577266])
ds.mean(ds.r, binby=[ds.x, ds.y], shape=32, limits=[-10, 10]) # or 2d
ds.count(ds.r, binby=[ds.x, ds.y], shape=32, limits=[-10, 10]) # or 2d counts/histogram
array([[22., 33., 37., ..., 58., 38., 45.], [37., 36., 47., ..., 52., 36., 53.], [34., 42., 47., ..., 59., 44., 56.], ..., [73., 73., 84., ..., 41., 40., 37.], [53., 58., 63., ..., 34., 35., 28.], [51., 32., 46., ..., 47., 33., 36.]])
These one and two dimensional grids can be visualized using any plotting library, such as matplotlib, but the setup can be tedious. For convenience we can use plot1d, plot, or see the list of plotting commands
Continue the tutorial or check the examples
If you like vaex, please let us know by giving us a star on GitHub,
Regards,
The vaex.io team
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。