The post Bitcoin: how big is the bubble? appeared first on EZNumeric.
]]>Why are we looking at Bitcoin, while there are more than a thousand cryptocurrencies out there?
A few reasons are that
Apart from the intrinsic value, Bitcoin has a bunch of subjective added values that are impossible to quantify:
The added value specifically for us as mathematicians is that the markets for crypto currencies are not regulated. This is a big opportunity for machine learning algorithms: the raw data is “clean” (compared to a regular stock: no merges, no acquisitions, no dividend payments and other steps you need to calculate before you could actually see what is going on with the stock price).
How much value is in Bitcoin? This is a hot topic for every one who is trading cryptocurrencies. This question is impossible to answer, since the value of Bitcoin is determined by the trust and belief in the new technology. How do you estimate value of trust? Most of the articles I could find on the value of Bitcoin are building up on the market capitalisation of Bitcoin.
At the time of writing this article the market cap for Bitcoin is just above 66 billion USD, which is larger than the CocaCola brand (55.6 billion USD). The amount on Bitcoins in the circulation almost reached 80% of the 21 millions possible mined Bitcoins.
There are some calculations that “in 10 years Bitcoin’s market capitalization would be 10 times the average daily volume, giving a figure of $1.75 trillion for the market cap. In 10 years, the analyst thinks that there will be 17 million Bitcoin in circulation, up from the current 16.3 million figure. If the potential 17 million of Bitcoins in supply is divided by the $1.75 trillion market cap estimate, then each Bitcoin would be worth just over $100,000“.
There is a problem with the value of Bitcoin that is based only on market capitalisation. It only reflects the value at a given moment in time, with no reference to the price developments in the past. Moreover, it is nearly impossible to predict market cap in the future.
Bitcoin is a currency with the fixed limit of total supply, which is 21,000,000 Bitcoins. Assuming the mining power stays constant, the last Bitcoin will be mined in 2140. Because of this property, Bitcoin is often compared to the gold.
The gold market is estimated at 8.7 trillions USD. If we assume that gold will be replaced by Bitcoins, then one Bitcoin should be worth ~414 thousands USD, which is 100 times more than the current price of Bitcoin.
The same calculations can be done with the world’s money that is estimated at 5.2 trillions of USD. If we assume that eventually Bitcoin will replace all the fiat money, then 1 Bitcoin will be worth ~247 thousands USD, which is 60 times more than the current price of Bitcoin.
If you do believe that Bitcoin will replace money or gold, then there is definitely room for growth. However, this is based on a belief. How can we quantify the value of a Bitcoin?
We thought that the only value we could possibly estimate is the amount of invested money. At least this will give us a lower boundary for the intrinsic or fundamental value of Bitcoin.
Unfortunately, also in this case we can not just read the invested amounts from the blockchain. So here is a way how to estimate them.
We know that several transactions are validated and combined in a block during Bitcoin mining. For this the miner gets a reward: 12,5 Bitcoins plus transaction fees. The amount of Bitcoins per block is halving every 4 years. Currently it is 12,5 Bitcoins, in 2020 it will be 6,25. Since transaction fees are negligible compared to the 12,5 bitcoins, we skip them in our calculations. Basically, every Bitcoin block “costs” 12,5 Bitcoins. If we assume, that the miner sells mined Bitcoins shortly after they are mined, then we can estimate the invested money for 1 block.
Now we take cumulative sum over all mined blocks and we get an amount of injected/invested money in the Bitcoin blockchain.
The rate of block creation is predetermined: about 6 blocks per hour. It is adjusted every two weeks, so that for our exercise we can assume it is constant over time. That means 144 block within a day. Note, that crypto markets are functioning 24/7 continuously. If we multiply the number of blocks per day by the investment per the block then we arrive at the investment per day.
Summing up over all days of Bitcoin life, we finally obtain the cumulative injected money in the Bitcoin chain.
Of course, not all of the mined Bitcoins are sold shortly after they were mined. Because the blockchain does not contain information about it, we make an assumption to get an estimate.
To get the daily Bitcoin price, we use cryptocompare API and Python. This post describes nicely how to do it. The cryptocompare API provides the prices from 2010-07-17. To get earlier cumulative prices we consult blockchain.
The estimated injected dollars in the Bitcoin blockchain is about 3 billion USD (the final date is 2017-09-20). The market cap, however, at 2017-09-20 is just above 66 billion USD. For the prices this means that on average the Bitcoin community has paid about 200 USD for one Bitcoin, whereas the current price is just above 4100 USD.
Is Bitcoin a bubble?
As you can see, the market cap is much larger than invested dollars, so the answer might be: yes, it seems like a bubble. If we take the market cap as 100% then invested dollars comprise roughly 5%.
How big is the bubble?
The difference between invested dollars and the market cap is factor of 20 (at 2017-09-20).
The figure below shows the historical market cap in USD as a green curve and cumulative injected dollars (USD) as a blue curve, respectively. The horizontal axis represent the time starting at 2010-07-17 as the data is available at cryptocompare. The vertical axis represents the amount of USD on a logarithmic scale.
If we have a look at the development over time, then we see that before 2011 the market cap and the injected dollars were overlapping. However, starting from 2011 till now, the market cap curve is above the injected dollars curve with few exceptions. This means that Bitcoin might be currently overvalued compared to the investment. Of course as we mentioned earlier this is only one part of the intrinsic value, there are number of subjective added values that we can’t quantify.
The technology in the crypto currencies is very promising and can be applied to many aspects, not only in finance. Whether Bitcoin is going to make it to the future is still a question. Right now Bitcoin is the market leader for the crypto currencies.
Whether Bitcoin is a bubble or not, the fact is that its market capitalisation is much larger than amount of invested dollars. If Bitcoin owners decide to sell more than 5% of their Bitcoins at the same time, the crypto currency exchanges will not be able to pay out.
What do you think is the real value of Bitcoin? Is Bitcoin a bubble? Let us know in the comments.
Liked this article? Get EZNumeric’s future articles in your inbox:
The post Bitcoin: how big is the bubble? appeared first on EZNumeric.
]]>The post VCRS matrix compression for GPUs appeared first on EZNumeric.
]]>Well, the good news is that we have developed a matrix compression scheme that you can use on GPUs. We call it VCRS – Very Compressed Row Storage. In this post, firstly, we focus on “why do we need matrix compression”. Secondly, we describe “what is VCRS” and “how to use it for compression”. Finally, we give few examples from a real-world applications.
If you are doing modelling or simulation of a physical process, most of the time, you end up with differential equations describing this process. Very often we can’t solve these differential equations analytically in the continuous space. Therefore, we need to discretise these and solve numerically. Discretisation can be viewed as representation of your differential equation as a system of linear equations in a matrix form
where is the matrix, is the solution vector, is the right-hand side vector.
Most of the time, the matrix is huge (millions of rows) and sparse (few non-zero elements in a row). We can’t just invert this matrix to get the solution, since the matrix inversion is very costly in terms of memory and computational costs. To solve this system of linear equations, we use iterative methods. Very often we can speedup iterative methods using a preconditioner such that
Basically, the preconditioner is a matrix which is close to the inverse of the matrix , in other words .
If is the identity matrix, then we automatically obtain solution of our system of linear equations.
Sometimes, the original matrix can be implemented matrix-free, meaning that the elements are calculated on the fly and not stored. However, most of the time we have to store the preconditioner matrix in the memory. The less storage both matrices take, the larger problems we can compute. It is especially important if we use accelerators, for example a GPU with limited memory.
An important aspect of a preconditioner is that it does not have to be exact. It is acceptable to have an preconditioner that is an approximation of an approximation.
Therefore lossy compression of the preconditioner enables bigger problems to be computed on accelerators.
There are many ways to compress a matrix. In this blog we suggest to use a compressed matrix using VCRS compression. This way we take two pigeons with one bean:
VCRS stands for Very Compressed Row Storage format. This method was developed during Hans‘s PhD at TU Delft. VCRS format was inspired by the well-known CSR (Compressed Sparse Row) format.
To illustrate compression, let’s consider a small matrix from a one-dimensional Poisson equation with Dirichlet boundary conditions, see picture above.
CSR format consists of two integer and one floating point arrays:
To take advantage of the redundancy in the column indices of a matrix constructed by a discretization with finite differences or finite elements on structured meshes, we introduce a new sparse storage format: VCRS.
VCRS format consists of five integer and one floating point arrays:
At a first glance, it seems that the VCRS format is based on more arrays than the CSR format, six versus three, respectively. However, the large arrays in the CSR format are cidx and data, and they contain redundant information of repeated indices and values of the matrix. For small matrices, the overhead can be significant, however, for large matrices it can be beneficial to use the VCRS, especially on GPUs with a limited amount of memory.
Summarizing, the following factors contribute to the usage of the VCRS format:
Of course you can already use VCRS format as described above. If you want to get even more advantages of the VCRS format, here we list two mechanisms to adjust the data redundancy:
Quantization is a lossy compression technique that compresses a range of values to a single value. It has well-known applications in image processing and digital signal processing.
However, we need to make sure that the effect of the data loss in lossy compression does not affect the accuracy of the solution. The simplest example of quantization is rounding a real number to the nearest integer value.
The quantization technique can be used to make the matrix elements in different rows similar to each other for better compression. The quantization mechanism is based on the maximum and minimum values of a matrix and on a number of so-called bins, or sample intervals.
Figure above illustrates the quantization process of a matrix with values on the interval [0, 1]. In this example the number of bins is set to 5, meaning there are 5 intervals . The matrix entries are normally distributed between 0 and 1, as shown by the black dots connected with the solid line. By applying quantization, the matrix values that fall in a bin, are assigned to be a new value equal to the bin center. Therefore, instead of the whole range of matrix entries, we only get 5 values. Obviously, the larger number of bins, the more accurate is the representation of matrix entries.
Next, we introduce row classification as a mechanism to define similarity of two different matrix rows.
Given a sorted array of rows and a tolerance, we can easily search for two rows that are similar within a certain tolerance. The main assumption for row comparison is that the rows have the same number of non-zero elements.
Let be the -th row of matrix of length and be
the -th row of .
The comparison of two rows is summarized in Algorithm 3 below. If is not smaller than and is not smaller than , then the rows and are “equal within the given tolerance “.
Algorithm 4 then describes the comparison of two complex values and Algorithm 5 compares two floating-point numbers.
Figure bellow illustrates the classification of a complex matrix entree . Within a distance the numbers are assumed to be equal to . Then, is smaller than the numbers in the dark gray area, and larger than the numbers in the light gray area.
The number of bins and tolerance have influence on
From our previous posts, you might know that one of the problems we had to solve is the Helmholtz equation – the wave equation in the frequency domain.
Using the VCRS format to store this matrix results in three to four times smaller memory requirements and three to four times faster matrix-vector multiplication, depending on the compression parameters. In general, we observed the compression factor between 3 and 20 depending on the matrix.
Based on our experiments, the most reasonable parameter choice would be a tolerance λ = 0.1, and number of bins equals to 100 000.
Summarising our experiments for the whole solver (Bi-CGSTAB preconditioned with shifted Laplace matrix-dependent multigrid method), we can conclude that the VCRS format can be used
The VCRS compression can also be used in other applications where the solvers are based on a preconditioned system of linear equations. For example, an iterative solver for linear equations is also an important part of a reservoir simulator.
It appears within a Newton step to solve discretized non-linear partial differential equations describing the fluid flow in porous media. The basic partial differential equations include a mass-conservation equation, Darcy’s law, and an equation of state relating the fluid pressure to its density. In its original form the values of the matrix are scattered. Although the matrix looks full due to the scattered entries, the most common number of non-zero elements per row is equal to 7, however the maximum number of elements per row is 210.
The distribution of the matrix values is shown in the figure below. Note, that the matrix has real-valued entries only. It can be seen that there is a large variety of matrix values, that makes the quantization and row classification effective.
Using the VCRS format to store this matrix results in two to three times smaller memory requirements and two to three times faster matrix-vector multiplication, depending on the compression parameters.
In this post
What matrix compression are you using? Let us know in the comment box below.
Liked this article? Get EZNumeric’s future articles in your inbox:
The post VCRS matrix compression for GPUs appeared first on EZNumeric.
]]>The post How to Solve the “Dependency Hell” with Python appeared first on EZNumeric.
]]>There is another dimension to this problem – different platforms have different package systems. To name a few: Mac OSX uses MacPorts or brew, Ubuntu uses apt and CentOS uses yum. If you are like me, then you might prefer to use Mac OSX at home and one of the Linux distributions on a computational cluster. All those systems/distributions have their own libraries with different versions, compiler versions, versions of dependencies, etc.
Even if you get libraries you need, they might have different versions on different platforms. Because distributions ship their own versions. There is an official term called dependency hell.
How nice it would be to have it done automatically? No more configure scripts! Huge time saver!
The idea is to have the same set of libraries across all (Unix-like) systems. Ideally, you want to compile your software on different platforms and you want to have the same version of the required libraries across different platforms.
What libraries are we talking about? If you are programming, then there is good chance you want to install CMake, gcc, Qt, you name it.
We can break down the installation of a library into 6 actions for each platform that you are using:
There are two ways to implement the solution:
To make use of the smart approach, meaning to automate the installation process, we suggest to write a Python script. Actually, two Python scripts: one will contain the description of the classes (we will discuss it below) and the second one will apply these classes to each library you need to install.
Let’s start with the first Python script that contains a base class called ExternalLibrary().
class ExternalLibrary(object): def __init__(self): name=self.GetName() all_libraries[name]=self self.environment_file_=""
The most important function in this class is the function Run() which executed the steps 1-6 from the previous section. Note that this function is recursive, because in the step 1 “Installing dependencies” also calls the function Run() for each dependency.
The code snippet below shows the function Run() and the steps 1-6 to install a library.
def InstallDependencies(self): # Step 1 dependencies=self.GetDependencies() for dependency in dependencies: external_library=all_libraries[dependency] external_library.Run() def Download(self): # Step 2 if self.IsDownloaded(): print self.GetArchiveFilename(),"already downloaded" return filename=os.path.join( self.GetCommonDownloadDirectory(),self.GetArchiveFilename() ) url = self.GetArchiveURL() if type(url) is list: url = url[0] self.DownloadFile( url, filename ) def ApplyPatch(self,source_dir,patch_directory): # Step 3 pass def ConfigureBuild(self): # Step 4 # This function will be overwritten depending on the configuration system build_directory=self.GetPackageBuildDirectory() self.CreateDirectory(build_directory) dir_cmd="cd {0}" .format( build_directory ) print "dir_cmd=",dir_cmd configure_command="{0} {1}".format(self.GetConfigureCommand(), self.GetConfigureFlags() ) cmd="{0} && {1} && {2}".format( self.GetEnvironmentCommand(), dir_cmd, configure_command ) print "configure command chain=",cmd t=os.system(cmd) if t!=0: raise RuntimeError("Unable to configure") def Compile(self): # Step 5 dir_cmd="{0} && cd {1}" .format( self.GetEnvironmentCommand(), self.GetPackageBuildDirectory() ) print "dir_cmd=",dir_cmd print "#cores=",self.GetNumberOfCores() make_command = "{0} && make -j {1} {2}".format( dir_cmd, self.GetNumberOfCores(), self.GetCCCommand() ) print "make_command=",make_command t=os.system(make_command) if t!=0: raise RuntimeError("Unable to make") def Install(self): # Step 6 dir_cmd="{0} && cd {1}" .format( self.GetEnvironmentCommand(), self.GetPackageBuildDirectory() ) make_command = "{0} && make install".format(dir_cmd) print "make install command=",make_command t=os.system(make_command) if t!=0: raise RuntimeError("Unable to do make install") def Run(self): # Preparations self.environment_file_=self.CreateEnvironmentFile() # Create directories self.CreateDirectory( self.GetCommonDownloadDirectory() ) self.CreateDirectory( self.GetCommonSourcesDirectory() ) self.CreateDirectory( self.GetCommonInstallDirectory() ) # Step 1 - Install dependencies self.InstallDependencies() name=self.GetName() # Go to the install directory and check whether the library is already installed install_dir=self.GetPackageInstallDirectory() self.CreateDirectory( install_dir ) if self.IsInstalled(): print name,"is already installed" return print name,"is not installed" # Step 2 - Download package self.Download() self.ExtractSource() # Step 3 - Apply patch is needed self.ApplyPatch(self.GetPackageSourceDirectory(), self.GetPatchDirectory() ) self.CreateDirectory( self.GetPackageBuildDirectory() ) # Step 4 - Build configuration self.ConfigureBuild() # Step 5 - Compile try: self.Compile() except: print "Compile failed, try again..." self.Compile() # Step 6 - Install self.Install()
Let’s us have a look closer at the step number 4: configure. Configuration is done by build systems. Build systems create a make file to compile the library self and its dependencies. Different libraries use different build systems. To name a few: Qt and gcc use gnuconf, CMake and MySQL use CMake on Linux platforms. To make a distinction between different build systems, we have created two derived classes: ExternalLibraryCMake() and ExternalLibraryGNUConf(). The only difference between these two is the Configure() function.
The second script is a list of the libraries you need with specific parameters. For example, let’s consider installing CMake 3.4.3. Of course, we have to know
Then the Python script will look like this:
#!/usr/bin/python from ExternalLibrary import * import os class ExternalLibraryCMake_3_4_3(ExternalLibrary): def GetName(self): return "CMake343" def GetArchiveURL(self): if self.LocalDownload(): url="http://10.0.0.17/Downloads/cmake-3.4.3.tar.gz" else: url="https://cmake.org/files/v3.4/cmake-3.4.3.tar.gz" return url def GetPackageSourceDirectory(self): source_dir=os.path.join( self.GetUnpackDirectory(),"cmake-3.4.3" ) print "source_dir=",source_dir return source_dir def GetFileForTestingInstallation(self): return os.path.join( os.path.join( self.GetPackageInstallDirectory(), "bin" ,"cmake" ) ) def ConfigureBuild(self): dir_cmd="cd {0}" .format( self.GetPackageSourceDirectory() ) print "dir_cmd=",dir_cmd configure_command="./configure --prefix={0}" .format( self.GetPackageInstallDirectory() ) cmd="{0} && {1} && {2}".format( self.GetEnvironmentCommand(), dir_cmd, configure_command ) t=os.system(cmd) if t!=0: raise RuntimeError("Unable to configure")
As you can see, there are some checks to avoid doing the same operation twice. I.e, the package has already been downloaded or extracted.
A recursive function can easily produce the dependency graph. This graph is a visual check that all dependencies are correctly set-up:
A package update can easily be added by adding another small class that depends on the earlier version of the package. Only a couple of functions need to be changed. Once a more recent version become available, for example to install Qt 5.7.0, we have added a simple derived class ExternalLibraryQt570(ExternalLibraryQt560) based on previous library version Qt 5.6.0. The code will look like this:
class ExternalLibraryQt570(ExternalLibraryQt560): def GetName(self): return "Qt570" def GetArchiveURL(self): if self.LocalDownload(): url="http://10.0.0.17/Downloads/qt-everywhere-opensource-src-5.7.0.tar.gz" else: url="http://download.qt.io/official_releases/qt/5.7/5.7.0/single/qt-everywhere-opensource-src-5.7.0.tar.gz" return url def GetPackageSourceDirectory(self): return os.path.join( self.GetUnpackDirectory(),"qt-everywhere-opensource-src-5.7.0" )
A side effect of using a Python script for automatic installation of 3rd-party libraries is parallel compilation. At the runtime, your Python script can check how many processors are available on your computer. Then your Python script automatically launches a parallel compilation and utilises all CPU power you’ve got.
Let us summarise the advantages of using the Python scripts to automate the installation of the 3rd-party libraries and solve dependency hell:
To quote Elsa from the Disney movie Frozen: “Recursivity never bothered me anyway”
How do you update 3rd-party libraries?
Let us know in the comment box below if you would like to have the code in GitHub.
Get EZNumeric’s future articles in your inbox:
The post How to Solve the “Dependency Hell” with Python appeared first on EZNumeric.
]]>The post Heterogeneous Map-Reduce: Scientific Visualisation (Part 3/3) appeared first on EZNumeric.
]]>In the previous posts we have introduced the heterogeneous map-reduce framework and applied it on a seismic imaging problem. Here, we will use the heterogeneous map-reduce framework for scientific visualisation.
Scientific visualisation is a way to showcase the numerical modelling or physical simulations with computer graphics. Why is it important? Because using scientific visualisation we can observe a three-dimensional structures and processes in (close to) real time and zoom in or our as we wish. Also, it is a possibility to produce cool movies like this using a rendering software (ray-tracer):
Rendering is an automatic process of generating images (also called frames) from 2D- or 3D-models that build a scene. Basically, the scene is what is viewed by the camera and it includes objects, lightening, shading, viewpoint, etc.
To make a simulation movie, we need to import the simulation results or objects into the scene. Afterwards, we need to render the scenes and produce frames/images. Finally, the images are collected into a movie.
Paraview makes it possible to visualise scientific data, but it renders the images with OpenGL. This means that the image can be rendered very quickly, but at the cost of quality. On the other hand, a ray-tracer will be slower but will produce better quality images (even photo realistic quality).
Level 1: Since a movie consists of frames (separate images that compose a moving picture), the highest level of parallelism is over the frames.
Level 2: The second level of parallelism is either the internal multi-threading or multi-GPU implementation of the rendering software. In our case, we can use either the CPUs or GPUs to render the image, so 2 instances of the ray-tracer can run on one machine.
The idea is to render each individual image or frame in parallel which makes it easy to assign one task to one frame. By using multiple threads for rendering, an image can be processed in parallel within a task.
It sounds pretty easy: one task for one frame. The complexity comes with the monitoring the tasks in real time. Ideally we would like to know what is each compute node doing at each moment in time and will be able to add/remove compute nodes after the computations have started.
For this we have developed a web server for the task system. The web server allows monitoring, as well as interacting with the server. Through the web-interface, we can check progress, and other indicators such as memory usage or CPU load.
In the figures below you will see snapshots of the web server while doing the scientific visualisation: list of frames to be rendered in the available queue, list of running tasks, statistics, settings and system information such as CPU load and memory usage in real-time.
Firstly, to make a movie we need to import the objects computed from a numerical simulation to the scene. In the previous example of the heterogeneous map-reduce (Part 2/3) we described the seismic migration. There, the objects are images of seismic migration. For each frame, a Python script would read and import the objects of interest in the ray-tracer and create the scene.
Secondly, the scene has to be rendered.
We observed that most of the time was being spent in the importing the objects. The total time per frame was too high and had to be done in a parallel, or distributed way. So that, we could use the task system for importing the objects as well as for rendering of the scenes.
Here is the example of using the task system for the scientific visualisation from Hans. During the final stage of his PhD work at TU Delft, he implemented several seismic migration algorithms in parallel (on CPUs and GPUs). He thought it would be nice to show how migration images from each shot are forming final migration image in a movie (see more details here).
The Little Green Machine (which is the Dutch smallest supercomputer build from several CPUs connected to two GPUs) was perfect for that as he could use the power of GPUs for rendering. Many thanks to his supervisors Prof. Kees Vuik and Prof. Kees Oosterlee for giving access to the Little Green Machine.
In this series we introduced the heterogeneous map-reduce approach as a universal parallel framework. A very important tool within this framework is the task system that allows to split the work amongst compute nodes and monitor the execution.
We have shown in the previous post how to use the task system in the seismic imaging application to do seismic migration in parallel distributed way. Also, in this post we have shown how to use the task system in the scientific visualisation for importing and rendering the images.
What tools do you use for scientific visualisation?
Get EZNumeric’s future articles in your inbox:
The post Heterogeneous Map-Reduce: Scientific Visualisation (Part 3/3) appeared first on EZNumeric.
]]>The post Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3) appeared first on EZNumeric.
]]>In the previous part we have described the heterogeneous map reduce framework. Here, we will start with an example from a seismic imaging application.
The oil and gas industry makes use of computational intensive algorithms such as reverse-time migration and full waveform inversion to provide an image of the subsurface. The image is obtained by sending a wave energy into the subsurface and recording the signal required for a seismic wave to reflect back to the surface from the interfaces with different physical properties. A seismic wave is usually generated by shots at known frequencies, placed close to the surface on land or to the water surface in the sea. Returning waves are usually recorded in time by hydrophones in the marine environment or by geophones during land acquisition. The goal of seismic imaging is to transform the seismograms to a spatial image of the subsurface.
Migration algorithms produce an image of the subsurface given seismic data measured at the surface. In particular, pre-stack depth migration produces the depth locations of reflectors by mapping seismic data from the time domain to the depth domain, assuming that a sufficiently accurate velocity model is available. The classic imaging principle is based on the correlation of the forward propagated wavefield from a source and a backward propagated wavefield from the receivers. In frequency domain the image is calculated as follows:
where denotes the wavefield propagated from the source and from the receivers respectively, denotes the frequency. That means for every shot and every frequency we need to simulate the wave propagation twice.
Level 1: The highest level of parallelization for frequency-domain migration is over the shots. Each shot is treated independently. We assume that the migration volume for one shot is computed on one compute node connected to none, one or more GPUs.
Level 2: The next level of parallelism involves the frequencies. For each frequency, a linear system of equations needs to be solved.
Level 3: The third level of parallelism includes matrix decomposition, where the matrix for the linear system of equations is decomposed into subsets of rows that fit on a single GPU.
Level 4: The last level of parallelism for migration in frequency domain is parallelization of matrix-vector multiplications (MVMs) and vector-vector operations.
Here we describe how the task system works for the migration in frequency domain.
Mostly we will use the first level of the parallelism. The server or ‘master node’ creates one task per source. Each task is added to a “Available” queue. When a client requests a task, a given task is moved from the queue to the “Running” list.
As we have seen earlier, the migration algorithm consists of forward and backward modelling in frequency domain for each source. Therefore, the second level of parallelism for migration consist of parallelization over all frequencies for each source.
Let’s assume that we have sources and frequencies. Then, one task consists of computations of one frequency for a given source , . In total, we have tasks.
For each frequency, a linear system of equations needs to be solved. The matrix size and memory requirements are the same for each frequency, but the lower frequencies require less compute time than the higher ones. Here, we assume that one frequency for one source in the frequency domain fits in one compute node. At this point, the auto load-balancing of the tasks comes into play. There is no need to know beforehand how to distribute the shots over the compute nodes.
If a compute node is connected to one or more GPUs, we can make use of the third level of parallelism and decompose the matrix across GPU(s). However, this is done within a task.
The movie starts with the velocity model. We run migration on the SEG/EAGE Overtrust model, which represents an acoustic constant-density medium with complex, layered structures and faults. Then the camera rotates and it shows the migrated image, which appears to be empty. The migrated data from each shots is added and shown in the movie as it is being received by the server (‘Reduce’ part of the algorithm). We can see how layers become visible. At the end we see the final image that we obtained using migration in frequency domain.
The volume has a size of 1000×1000×620 m3. The problem is discretized on a grid with 301×301×187 points and a spacing of 3.33 m in each coordinate direction.
The discretization for migration in frequency domain is 2nd-order in space. A Ricker wavelet with a peak frequency of 15 Hz is chosen for the source and the maximum frequency in this experiment is 30 Hz. Note that by reducing the maximum frequency, we can increase the grid spacing. For instance, by choosing a maximum frequency of 8 Hz, the grid spacing can be chosen as 25 m in each direction. The line of sources is located at a depth of 10 m and is equally spaced with an interval of 18.367 m 4 in the x-direction.
The receivers are equally distributed in the two horizontal directions with the same spacing as the sources, at the depth of 20 m. The sampling interval for the modelled seismic data is 4 ms. The maximum simulation time is 0.5 s. For migration in the frequency domain we chose a frequency interval of 2 Hz.
In this post we showed how to use the heterogeneous map-reduce framework and the task system on an application from seismic imaging : migration in frequency domain.
In the next post of this series we are going to have a look at scientific visualisation.
Get EZNumeric’s future articles in your inbox:
The post Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3) appeared first on EZNumeric.
]]>The post Heterogeneous Map-Reduce: meet the Task System (Part 1/3) appeared first on EZNumeric.
]]>Did it happen to you: you just finished a parallel implementation of a project and happily enjoying the speedups, until the next project arrives, where you have to use completely different hardware, and you have to start over again with the parallel framework for the new project? It happened to me several times! And Hans as well! Until he thought of a heterogeneous map-reduce approach, that ones implemented it can be easily adjusted for different hardware architectures (CPUs, GPUs, Intel’s Knight Hill, you name it).
The idea of the heterogeneous map-reduce approach is a universal parallel framework. It assumes commodity hardware with several types of processors with different capabilities and performance (for example a cluster with some computers having one GPU, other having two GPUs or none), hence the name “heterogeneous”. The performance can also be affected by the number of users sharing a node.
The heterogeneous map-reduce approach is actually a way to distribute the computations across available hardware to achieve a good load balancing. After all, you would like to make use of the available compute resources.
The “map-reduce” component comes from the setup of the task system, which runs computations in parallel.
There is no need to know explicitly how to distribute the work beforehand.
The task system allows to split the work amongst compute nodes and monitor the execution. By a compute node we assume a multi-core CPU that might be connected to one or more GPUs, where GPUs can be used as a replacement or as an accelerator.
The common approach for parallelization across multiple CPU nodes in the cluster is the so-called server-to-client approach. The server is aware of the number of nodes and the amount of work involved. It equally distributes the work-load among the nodes. This approach is efficient on clusters with homogeneous nodes as all CPU nodes have the same characteristics.
Since we are targeting the heterogeneous clusters, we propose a client-to-server approach where clients request tasks to the server.
The philosophy behind a task system is that a compute node will ask for a task, process it, and ask for another task as soon as the previous one has finished. A compute node is “busy” 100% of the time, regardless of its performance. The work is spread dynamically according to the speed of the compute nodes and load balancing is automatically achieved. Moreover, the task system gives the flexibility to launch new child-processes if one or more compute nodes crash or hang.
Each task is added to an “Available” queue. When a client (one compute node) requests a task, a given task is moved from the “Available” queue to the “Running” list. When the task is done, it is removed.
It can happen that a node will crash due to a hardware failure. In that case, the tasks in the “Running” queue will be ranked against the starting time. Eventually, the task with the earliest starting time will be assigned to a new compute node. It may happen that 2 compute nodes might work on the same task. In this case, the first compute node that manages to compute the output “wins”.
We have chosen the client-to-server approach over MPI for two reasons:
You still can use MPI to spawn the child processes.
Depending on your algorithm, you could use GPU(s) at the program or the loop level.
In this post we have defined the heterogeneous map-reduce approach and the task system, that allow to spread work dynamically across your computing system and achieve automatic load balancing.
The next step is to demonstrate it on some applications from different fields. One example we will take from seismic imaging and another one from scientific visualisation. We will cover these in our next posts.
How many levels of parallelism do you use in your application?
Get EZNumeric’s future articles in your inbox:
The post Heterogeneous Map-Reduce: meet the Task System (Part 1/3) appeared first on EZNumeric.
]]>The post 2 ways of using a GPU to speedup computations appeared first on EZNumeric.
]]>Did you know that you can use GPU in different ways to speedup your computations? Let’s have a closer look.
High-performance computer architectures are developing quickly by having more and faster cores in the CPUs (Central Processing Units) or GPUs (Graphics Processing Units). Recently, a new generation of GPUs appeared, offering tera-FLOPs performance on a single card.
The GPU and CPU architectures have their own advantages and disadvantages.
CPUs are optimized for sequential performance and good at instruction level parallelism, pipelining, etc. With a powerful hierarchical cache, and scheduling mechanism, the sequential performance is very good.
In contrast, GPUs are designed for high instruction throughput with a much weaker cache or scheduling ability. In GPU programming, users have to spend more time to ensure good scheduling, load balancing and memory access, which can be done automatically on a CPU. As a result, GPU kernels are always simple and computationally intensive.
The GPU was originally designed to accelerate the manipulation of images in a frame buffer that was mapped to an output display. GPUs were used as a part of a so-called graphics pipeline, meaning that the graphics data was sent through a sequence of stages that were implemented as a combination of CPU software and GPU hardware. Nowadays GPUs are more and more used as GPGPU (General Purpose GPU) to speedup computations.
A GPU can be used in two different ways:
In the first case, the algorithm is split to solve a number of independent sub-problems that are then transferred to the GPU and computed separately (with little or no communication). To achieve the best performance, the data is kept on the GPU when possible. As GPUs have generally much less memory available than CPUs, this impacts the size of the problem significantly.
In the second case, the GPU is considered as an accelerator, which means that the problem is solved on the CPU while off-loading some computational intensive parts of the algorithm to the GPU. Here, the data is transferred to and from the GPU for each new task.
Let’s take the wave equation as an example. The wave equation can be formulated in time or frequency domain. The wave equation in the time domain is usually solved using a time-stepping scheme, which does not require a solution of the linear system of equations. The wave equation in the frequency domain (Helmholtz equation), in opposite, is solved with an iterative method that contains the solver of the linear system of equations in matrix form.
The simplicity of the time-stepping algorithms makes it easy to use GPUs of modest size as accelerator to speedup the computations.
However, it is not trivial to use GPUs as accelerators for iterative methods that require solution of a linear system of equations. The main reason for this is that the most iterative methods consist of matrix-vector and vector-vector operations (e.g. matrix-vector multiplication). By using the GPU as an accelerator, the matrices need to be distributed across GPUs. The vectors would “live” on the CPU and are transferred when needed to the relevant GPU to execute matrix-vector multiplications.
Ideally, GPUs would be used as a replacement but the limited memory makes this difficult for large numerical problems. There seem to be a trend where CPUs and GPUs are merging so that the same memory can be accessed equally fast from the GPU or the CPU. In that case the question “accelerator or replacement?” would become irrelevant as one can alternate between both hardware without taking into account the data location.
How do you use GPU: as replacement or as accelerator? Let us know in the comment box.
Wave propagation simulation in 2D
Parallel wave modelling on a 8-GPU monster machine
Please leave your email address to register and receive our newsletter:
The post 2 ways of using a GPU to speedup computations appeared first on EZNumeric.
]]>The post Parallel wave modelling on a 8-GPU monster machine appeared first on EZNumeric.
]]>For our numerical experiments NVIDIA provided a Westmere based 12-cores machine connected to 8 GPUs Tesla 2050 as shown on the figure above. The 12-core machine has 48 GB of RAM. Each socket has 6 CPU cores Intel(R) Xeon(R) CPU X5670 @ 2.93GHz and is connected trough 2 PCI- buses to 4 graphics cards. Note that two GPUs are sharing one PCI-bus connected to a socket. Each GPU consist of 448 cores with a clock rate of 1.5 GHz and has 3 GB of memory.
There are several approaches to deal with multi-GPU:
We have chosen the data-parallel approach and split of the algorithms and made a comparison between multi-core and multi- GPUs. We leave out the domain decomposition approach because the convergence of the Helmholtz solver is not guaranteed.
You can do computations on multi-GPU by either:
For our purposes we have chosen the second option, since it was easier to understand and implement.
Implementation on multi-GPUs requires careful consideration of possibilities and optimization options. The issues we encountered during our work are listed below:
The Helmholtz equation represents the time-harmonic wave propagation in the frequency domain and has applications in many fields of science and technology, e.g. in aeronautics, marine technology, geophysics, and optical problems. In particular we consider the Helmholtz equation discretized by a second order finite difference scheme.
As a solver for the Helmholtz equation (wave equation in frequency domain) we use Bi-CGSTAB with a shifted Laplacian multigrid preconditioner.
Since the Bi-CGSTAB is a collection of vector additions, dot products and matrix-vector multiplications, the multi-GPU version of the Bi-CGSTAB is straight forward. We have seen that the speedup on multi-GPUs is smaller than on single GPU due to the data transfer between CPU and GPU. However it is possible to compute a problem of a realistic size on multi-GPUs and the computation on multi-GPU is still many times faster than 12-core Westmere.
The shifted Laplace preconditioner consists of a coarse grid correction based on the Galerkin method with matrix-dependent prolongation and Gauss-Seidel as a smoother. The implementation of coarse grid correction on multi-GPU is straight forward, since the main ingredient of the coarse grid correction is matrix-vector multiplication. The coarse grid matrices are constructed on CPU and then transferred to the GPUs.
The Gauss-Seidel smoother on multi-GPU requires adaptation of the algorithm. We use eight color Gauss-Seidel, since the Helmholtz equation is given in three dimensions and computations at each discretization point should be done independent on the neighbours to allow parallelism.
We implemented the three-dimensional Helmholtz wave equation on a multi-GPU. To keep the double precision convergence the solver (Bi-CGSTAB method) is implemented on GPU in double precision and the preconditioner in single precision.
Two multi-GPU approaches have been considered: data parallel approach and a split of the algorithm.
For the data parallel approach, we were able to solve larger problems than on one GPU
and get a better performance than multi-threaded CPU implementation. However due to
the communication between GPUs and a CPU the resulting speedups have been considerably smaller
compared to the single-GPU implementation.
To minimize the communication but still be able to solve large problems we have introduced split of the algorithm. In this case the speedup on multi-GPUs is similar to the single GPU compared to the multi-core implementation.
See full comparison of the multi-GPU implementation to a single-GPU and a multi-threaded CPU implementation on a realistic problem size here or request a copy by filling in the contact form or by sending us an email.
We would like to thank again NVIDIA Corporation (in particular François Courteille) for access to the their many-core-multi-GPU architecture.
This work has been presented during the following conferences:
The post Parallel wave modelling on a 8-GPU monster machine appeared first on EZNumeric.
]]>The post appeared first on EZNumeric.
]]>Ronald H. Coase, Essays on Economics and Economists
The post appeared first on EZNumeric.
]]>The post Epidemic Simulation: Will You Survive a Virus Outbreak in the Netherlands? appeared first on EZNumeric.
]]>It started at the gym when talking to Rory de Vries who is a Taekwondo teacher and has a PhD in virology. Rory is working at the Erasmus MC doing research on vaccins for the influenza virus. With Rory’s feedback we thought of a simulation model that should be more realistic than existing models.
The first thing we did is to download the data from OpenStreetMap. Basically it is a list of nodes, ways and relations. We have downloaded the map of the Netherlands (which is about 1 GB, for the whole world is about 30 GB). Then the list of nodes was decimated to get about 15 million nodes (about the same number as the population in the Netherlands). The assumption is that the node density of the map is the same as the population density, meaning that areas with high population will also have high node density.
We start with the discretization of the map. For this we define mobility as the average distance that a person travels per day. This is one of the parameters that you can easily modify in the user interface. As an example and default value we took 10 km. The mobility defines the size of the discretization cell: the larger mobility the larger the cell size. The cell size is calculated as a product of the mobility and the time step, which is set by the user.
The next step is to define a contact rate, we denote is C, which means how many contacts on average has a person during infectious time. It is assumed to be constant over the population.
Another parameter is the probability (P) that this person will infect its contacts. For example, in case of the influenza 1 person infects on average 3 other persons. Here we assume a uniform probability function for the population.
Now that we have defined the global parameters for the epidemic simulation, let us have a look at the infection stages per person. We distinguish the following stages:
Actually we combined some of the infection stages compared to the medical definition (here), for simplicity of the simulation and notations.
Basically, the disease development can be described with the following function , where can be modified for a specific virus and is expressed in days.
At the beginning of the epidemic simulation we see a map of the population distribution in the Netherlands (people are represented by white dots). On the right hand site you see a panel with the simulation time, transparency slider, plot options and global parameters that we described above. The plot options are infection stage (blue colour means infectious period and red colour means recovered people that are immune to the disease) or sickness (front of the spreading of the disease). The disease outbreak starts in the Hague. By clicking on the map we have added another disease source in Groningen.
The Netherlands is a very densely populated country. Therefore, it is almost impossible to avoid an epidemic. We also have seen it in the movie above. The idea of this post is that we can model almost anything, whether it will reflect the truth, the history will tell us.
If you liked this article, let’s connect on LinkedIn, Facebook or by subscribing to the bi-weekly EZNumeric’s newsletter.
The post Epidemic Simulation: Will You Survive a Virus Outbreak in the Netherlands? appeared first on EZNumeric.
]]>