CPU, graphics cards ,FPGA, multi pics/audino speed

Andy, Mon Nov 25 2013, 06:40AM

Hi
I'm writing a program and would like to find the fast platform. The code has if statement and alot of branching, not many for loops, but alot of basic blocks that are the same.

Was thinking of using omp on the cpu, as I don't think the graphics card would speed up the code?, is fpga good or do the branch's slow it down?
What about 1000 pics or audino?

Thanks
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Carbon_Rod, Mon Nov 25 2013, 07:27AM

For some small platforms, people will dedicate FPGA space to kernel modules that solve specific problems efficiently in parallel.

OpenCL is part of the nVidia SDK, but whether it runs "faster" depends on the problem. A 24+ core multi-cpu machine will churn through data more quickly given it doesn't need to copy into GPU memory space. However, the same machine can not match 400+ 1.8 GHz dedicated gpu vector cells running in parallel.
OpenMP is fine when the problem can be broken apart, and doesn't need set locality in the cluster host partition slice. But in other situations... it can run "slower" than even a single core.

Note it takes far less time to learn Intel's Threading Building Blocks libraries and proper algorithm design...
A good compiler will usually in-line small functions to exploit pipelining.

The "Cloud" demand has shifted technologies into a new class of design problem.
This code example is very helpful in learning about these new paradigms: Link2
wink

Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Tue Nov 26 2013, 01:35AM

I don't mind purchasing the hardware(would like to), but don't want to send it into the cloud.
What info do you need?
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Carbon_Rod, Tue Nov 26 2013, 09:06AM

Describe big Theta, start here:
Link2

Read: Link2
Boost: Link2
STL: Link2


Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Tue Nov 26 2013, 11:37PM

Thanks Carbon_Rod
Got rid of most of the if statements and replaced it with a lookup table. Have you got any information on severs racks, do you set them up like a desktop? or do you think that a gaming rig would be a better option.
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Carbon_Rod, Wed Nov 27 2013, 06:59AM

A reliable rack server costs a bit more as they use special parity checked dram to detect errors, reliable Intel CPUs, and have several management options most people never encounter. If you already have a SAN, than the incremental cost of adding cores is minimal. There are also special GPU modules for this type of server, but they're not really useful for the task of serving files.

Desktops have some advantages as they are inexpensive if purchased used, and have more space for random parts. The low-end $50 GeForce GTX 295 is a great deal thanks to Microsoft, as they no longer really work for modern games and the linux CUDA developer drivers are mature for these cards. Note, don't bother putting more than 1 GPU card in a machine, and use at least a “750” watt power supply.

Hosts with an older Intel quad run these cards just fine, as do the 24 core >i7 workstations...
Ignore modern sleaze-box labels, and use a benchmark cpu list when buying.
Link2
You will find a discrepancy between value and performance in modern retail outlets.

What problem are you trying to solve?
Link2
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Wed Nov 27 2013, 07:51AM

Its a chemical finder type program, I have to bruteforce alot of combinations, checked out a desktop motherboard with four sockets that can handle amd opteron 12core, but that and ram would set the price at 6 grand....so still dreaming, that would take 4 hours for 500 chemical combinations, hopefully looking at 100million combinations...

With graphics cards do you get a lower performance with if and for loops? I can unroll some of the for loops. Last time I tryed to write a kernel for a Gcard it wasn't much faster than cpu, could you post or link to a good reference of how to program them.

Thanks for your help
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Shrad, Wed Nov 27 2013, 08:14AM

have a look at blade servers, they provide multi cpu and RAM with a small form factor and a much smaller than the desktop multi cpu units
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Wed Nov 27 2013, 10:15AM

There was a 8 core 1 socket for $1000, still haven't ruled out grahics cards, or pics the code could fit on a 32bit pic easly , maybe have a pic with 2gig sdcard with another layer for the second stage processing, $1000 dollars would give 1k cores, do able?

Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Shrad, Wed Nov 27 2013, 12:02PM

you would have to add some RAM or you will eat through read/write cycles of your SD card pretty quickly

maybe with playstations or something alike? I read somewhere you could install unix on some playstation 2 or 3 or xbox, I don't remember
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Thu Nov 28 2013, 12:19AM

Yep.
Is there any way to speed up code like this?

for(l1=0;l1<largenumber;l1++) {
for(l2=0;l2<largenumber;l2++) {
for(l3=0;l3<largenumber;l3++) {


do stuff
}
}
}

I've got four cores, but if I get more, there will be more for loops.

Thanks
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Carbon_Rod, Thu Nov 28 2013, 04:07AM

Not sure why you nested the loops instead of an abstraction for a dynamic programming solution.
And, there is still an instantiation like cost when going parallel.
Link2

Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Thu Nov 28 2013, 06:49AM

Did you mean ?
for(l1=0;l1<largenumber;l1++) {
for(l2=i1+1;l2<largenumber;l2++) {
for(l3=l2+1;l3<largenumber;l3++) {


do stuff
}
}
}
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Wastrel, Thu Nov 28 2013, 03:29PM

GPUs have a very small amount of memory per computational unit and have big penalties for conditional and branched code. In the case of your loops, optimising depends mostly on what 'stuff' actually is.

It sounds like you are trying to bruteforce a problem instead of solving it. Can you tell us any more?
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Thu Nov 28 2013, 07:03PM

I'm not sure how to workout the numbers apart from bruteforce, the below just checks to see if the the combinations when mixed with water can occupier the space

Thanks


unsigned int atoms[85];

atoms[0] = 100000;
for(i=1;i<85;i++) {
atoms[i]=atoms[0]/i;
}

H2O = atoms[7]+(atoms[0]*2);

for(l1=0;l1<largenumber; l1++) {
for(l2=i1+1;l2<largenumber;l2++) {
for(l3=l2+1;l3<largenumber;l3++) {

test = atoms[l]+atoms[l1]+atoms[l2];
test = test + H2O
if(test == 100000 || test == 200000 || test == 300000) {
fprintf(out,"%s%s%s",atomsname[l],atomsname[l1] ,atomsname[l2]);
}
someother test based on numbers

}
}
}
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Wastrel, Fri Nov 29 2013, 03:01PM

The code is currently a bit broken and I'm assuming the actual table is more complicated for real atoms. It is important to get the method working first, then optimise the algorithm and lastly optimise the code.

In the case of this algorithm the first two tests fail automatically because they are less than the value for water on it's own. The goal value, 300000 would best be represented by a variable set to 300000-H2O which pushes code out
of the inner loop (and out of all of them, but this is where it costs the most).

If the real values for atom[x] are a progressive sequence then the inner loop can be replaced by a binary search tree, which would speed up the code by about ten times. If the real values for atom[x] are not in a progressive sequence, then reorder them. The absolute representation of the atoms is not important to the code.
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Mon Dec 02 2013, 06:42AM

Thanks Wastrel
I have to run the bruteforce code more than once, I've saved the combinations to hdd, and will load it in ram, what I understood from the binary search tree, split the combinations into 5 files and can select the part to bruteforce with a check.

@Shrad
Checked out a blade server 24cores 64gig of ram and 2* 1tb hdd for $340 month, Link2
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Shrad, Mon Dec 02 2013, 08:40AM

yup, that's the kind of machines you would use for pure CPU taskforce

you have to use a unix server distro to take full advantage of it through some paralleling libraries

if you have multiple less-capable scavenged servers you also have the opportunity to cluster them, there are specialized distros but that's another story
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Mon Dec 02 2013, 04:59PM

I installed zorin 64bit and have been running from that.

"you would have to add some RAM or you will eat through read/write cycles of your SD card pretty quickly"

I think I fried my second hdd, the program now comes up with Bus error(core dumped) :( , what do you think could be wrong, or is it fixable
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Shrad, Mon Dec 02 2013, 09:57PM

it occurs only with flash memory.. as it has a limited read/write count before a memory cell is dead, if you use some small files which are updated really fast in permanent use, you'll eat the cells pretty fast..

the same will occur with mechanical devices but it will be from mechanical use/abuse

the solution is to create a small ramdisk (something like 4Gb) and make all your file access from there, so you also gain in speed (by a factor of 100 sometimes)
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Fri Dec 06 2013, 08:31AM

Hi, created a ramdisk speed it up from 1.16min to 59sec. The number of combinations I think might be to high, its 10 chars from 0 to 0xff. I read somewere that 10chars for md5 hash is doable with the right setup, I think they use graphics cards, below is some of the code, apart from asm i'm lost how to speed it up?

for(;hexq[8]<0x01;) {
if(hexq[0] > 0xff) {
hexq[0] =0;
hexq[1]++;
}
if(hexq[1] > 0xff) {
for(i=1;i<=7;i++) {
if(hexq[i] > 0xff) {
hexq[i] = 0x00;
hexq[i]++;
if(type1 < i) type1 = i;
}
}
}


for(s=0,eip=0,pie=0;s<1;s++) {
if(hex[i] > 100 && hex[i] <= 200) goto ten;
if(hex[i] > 200 && hex[i] <= 300) goto ten1;
if(hex[i] > 300 && hex[i] <= 400) goto ten2;
if(hex[i] > 400 && hex[i] <= 500) goto ten3;
if(hex[i] > 500 && hex[i] <= 600) goto ten4;
if(hex[i] > 600 && hex[i] <= 700) goto ten5;
if(hex[i] > 700 && hex[i] <= 800) goto ten6;

if(hex[i] == 0) {
do something
fprintf(data)
goto bot;
}
if(hex[i] == 1) {
do something
fprintf(data)
goto bot;
}


Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Carbon_Rod, Fri Dec 06 2013, 10:35AM

I recommended a dynamic programming abstraction given your algorithm has some overlapping subproblems (you add the same numbers many times in this example). However, you are skipping over some crucial steps by attempting to find more complex ways of permuting through W(k*n^3), rather than solving the actual problem.

ASM can be faster, but in most cases it is not due to architectural and automatic code optimizations of modern C/C++ compilers.
Link2

Best of luck,
Rod
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Shrad, Fri Dec 06 2013, 06:58PM

replace < and > with a mask

a good thing would be to use a single binay mask and << or >> the mask or the number to crunch

it is easy to see if a number is higher than a multiple of 2 with a mask, and increment this value either by the mask or the value with a pad by 2 operator
Re: CPU, graphics cards ,FPGA, multi pics/audino speed
Andy, Wed Dec 11 2013, 06:02PM

Thanks you three for the help, The projects on hold, I let to much dust cover the cpu heat sink, and have fried it :( , but its mostly done