User Panel
I'd forgotten about the inverse root function, I remember it only because of that strange constant in the algorithm.
At least Newton's Method is somewhere in there. |
|
Quoted: I'd forgotten about the inverse root function, I remember it only because of that strange constant in the algorithm. At least Newton's Method is somewhere in there. View Quote My code running on the 16 cores now is noticeable faster than the similar code running on the ARM only. Before I found this magic code it was way slower. Still performance tweaks to make but the parallel programming model is a lot less 'black magic voodoo' to me now. |
|
Quoted:
So if you all can plot out the universe movements on this thing can you model or simulate how drug molecules dock into proteins? Now that'd be cool and very useful in pharma world. Just sayin' View Quote There are already accelerated ASIC solutions that are faster than the board talked about in this thread. The Parallelia (board in this thread) runs on gives you 16 or 64 "general purpose" CPUs to work together on a problem. These can all be programmed in C, making it relatively easy to use. The USB "Miners" for bitcoin (or protein folding) are all ASICs (Application Specific Integrated Circuit), and they can only do what they were designed to do (fold protein or mine bitcoins), they are a parallel computer that has only one task possible. Whereas the Zynq Processor on the Parallelia board give you General Purpose CPUs that run at the same time (in parallel), letting you do any calculation faster, as long as it can be broken up into parts, such as each core rendering part of an image. Not all programs scale easily into parallel computing. In PC computing, for example Firefox still mostly runs on one core of your CPU, even if you have 8 cores it can run on. Software needs to be re-written a good deal so that it can be run on several cores at once so that it works faster. Photoshop is an example where different "chunks" of images and the effect function you are applying to that image are sent to different cores, so with 8 of them working on doing a blur, it finishes almost 8 times faster. That's the beauty of parallel computing, instead of 8 cores, a video card (like an nVidia 960) will have thousands of cores each of which is a wizard at math (floating point and matrix), all designed to render video quickly. |
|
Crappy video demo of the difference of the nbody code running strictly on the CPU vs parallel version on the 16 core epiphany chip.
The first run is the CPU only, second run on the 16 cores Both are calculating an 800 star system over 10 iterations. The more stars I add the the bigger the difference between the two. |
|
Quoted:
Crappy video demo of the difference of the nbody code running strictly on the CPU vs parallel version on the 16 core epiphany chip. The first run is the CPU only, second run on the 16 cores Both are calculating an 800 star system over 10 iterations. The more stars I add the the bigger the difference between the two. http://www.youtube.com/watch?v=7z6G5J-OB9Y View Quote That is a remarkable speed-up. How much do the "answers" differ from the sqrt lib function and the fast inverse root function? Is it small enough to still be modeling a galaxy once you are into the billions of iterations? |
|
Quoted: That is a remarkable speed-up. How much do the "answers" differ from the sqrt lib function and the fast inverse root function? Is it small enough to still be modeling a galaxy once you are into the billions of iterations? View Quote View All Quotes View All Quotes Quoted: Quoted: Crappy video demo of the difference of the nbody code running strictly on the CPU vs parallel version on the 16 core epiphany chip. The first run is the CPU only, second run on the 16 cores Both are calculating an 800 star system over 10 iterations. The more stars I add the the bigger the difference between the two. http://www.youtube.com/watch?v=7z6G5J-OB9Y That is a remarkable speed-up. How much do the "answers" differ from the sqrt lib function and the fast inverse root function? Is it small enough to still be modeling a galaxy once you are into the billions of iterations? I am currently doing 5 newtonian iterations and the numbers are close enough for me, I am not doing any kind of rigorous scientific analysis. For comparison, the x coordinate for star # 1 is -0.319215 on the CPU after 10 iterations The same x coordinate on the 16 core run is -0.319304 after 10 iterations. |
|
I bumped up the code to take the 800 stars through 1000 iterations instead of 10
CPU took 6 minutes 24 seconds Epiphany took 1 minute 17 seconds There is still room to squeeze out performance in the parallel code, it is just my first attempt, I am sure there are tricks, techniques, etc I can apply to make it even faster. |
|
Quoted:
I bumped up the code to take the 800 stars through 1000 iterations instead of 10 CPU took 6 minutes 24 seconds Epiphany took 1 minute 17 seconds There is still room to squeeze out performance in the parallel code, it is just my first attempt, I am sure there are tricks, techniques, etc I can apply to make it even faster. View Quote Is any of that time spent writing to the frame buffer / display? |
|
|
Quoted: Is any of that time spent writing to the frame buffer / display? View Quote View All Quotes View All Quotes Quoted: Quoted: I bumped up the code to take the 800 stars through 1000 iterations instead of 10 CPU took 6 minutes 24 seconds Epiphany took 1 minute 17 seconds There is still room to squeeze out performance in the parallel code, it is just my first attempt, I am sure there are tricks, techniques, etc I can apply to make it even faster. Is any of that time spent writing to the frame buffer / display? One thing I did change is previously I was writing the 800 star data to all 16 cores local memory. I changed that to only write to core #1 memory, and all the cores read from there. Trying to cut down on all the data moving back and forth. Also have the code running on the epiphany to have core #1 sum up all the results from all the other cores when they signal they are done, previously I was downloading all the data to the ARM and summing there. Also noticed on that fast inverse squareroot code if I newtonian iterate an even number of times, the values diverge rapidly. Any odd number of iterations keeps everything on track. |
|
Someone requested to see the code, too big for IM so posting here
Code not cleaned up yet, I tend to code to get something working, then go back and make it 'pretty' It is a work in progress, logic will change as I get more experience with these things. cut / paste screwed up the formatting ETA: Code formatted properly below vvvvvvvvvvvvvvvvvvvvvvv |
|
Let me see if the code /code tags work.
|
|
|
This part from the epiphany code, is it running on one core or all cores? I lost myself in the braces and think this is under core 1 code:
Where each core only adds the velocity * time step to the objects in their "sector". Maybe make it an extra subroutine? With the small object and step, it doesn't matter, but when you get to millions of points by 64 cores, falling back on one for doing all new positions may lag it down a good deal. Second, this part:
Isn't that something that would only need to happen once, and be outside a loop, rather than run on all cores? That's a lot of calls to e_get_global_address() Again, my understanding of the architecture is imperfect here. Lastly, can you make the cores into two workgroups, so that all the cores don't do the check to see if they are core 1? Define 15 cores to run that code without the check for core 1, and the 16th cored as core 1. Do they still get to share memory the way you are using it when making groups of cores into virtual CPUs? Did you time the code to see if the summing on the ARM was appreciable slower than summing on Core #1? |
|
Quoted: This part from the epiphany code, is it running on one core or all cores? I lost myself in the braces and think this is under core 1 code: // all cores done processing this iterations, calculate new x, y, z coordinates for next iteration for (i = 0; i < *n; i++) { p[i].x += p[i].vx * dt; p[i].y += p[i].vy * dt; p[i].z += p[i].vz * dt; } Where each core only adds the velocity * time step to the objects in their "sector". Maybe make it an extra subroutine? With the small object and step, it doesn't matter, but when you get to millions of points by 64 cores, falling back on one for doing all new positions may lag it down a good deal. Second, this part: for(i = 0; i < 4; i++){ for(j = 0; j < 4; j++){ z[num++] = e_get_global_address(i, j, o); Isn't that something that would only need to happen once, and be outside a loop, rather than run on all cores? That's a lot of calls to e_get_global_address() Again, my understanding of the architecture is imperfect here. Lastly, can you make the cores into two workgroups, so that all the cores don't do the check to see if they are core 1? Define 15 cores to run that code without the check for core 1, and the 16th cored as core 1. Do they still get to share memory the way you are using it when making groups of cores into virtual CPUs? Did you time the code to see if the summing on the ARM was appreciable slower than summing on Core #1? View Quote I am keeping it on core #1 to avoid the data move from local memory to ARM memory, there is a performance penalty to moving it over and moving it back after summing. The algorithm is a brute force algorithm, comparing every body to every other body to get new vx, vy, vx values based off x, y, z. I don't want to start calculating new x, y, z until all bodies have been compared to all the others. I am keeping the code for all cores the same for simplicity.
|
|
|
Just seen this from airport bar ( off on a cruise) I am impressed can we talk when I get back?
|
|
Quoted: Just seen this from airport bar ( off on a cruise) I am impressed can we talk when I get back? View Quote I am mostly just piecing together logic I am finding online. Here is a run with just 64 stars, it shows the gravity entanglement of the stars better. |
|
That's pretty awesome! Nice work adding the display to it. I assume the ones that zip off the display at the start are "slingshots"?
What are your planned addition/expansion ideas? |
|
Quoted: That's pretty awesome! Nice work adding the display to it. I assume the ones that zip off the display at the start are "slingshots"? What are your planned addition/expansion ideas? View Quote Kinda like NASA using a planet's gravity to fling a space probe off towards it's final destination. Once I get the 4 board cluster build I will try to get this scaled up to 64 processors. One thing I will add too is the ability to assign a mass to each star, right now they are all treated as equal mass Then what I can do is create a few massive stars that should attract clouds of lighter stars forming mini galaxies. Then, as 'God' of my digital universe I can send a couple mini galaxies on a collision course or try to get them to orbit each other. All kinds of possibilities. |
|
Having mass increase as they 'glom' together would make for an awesome simulation.
Maybe make the boundaries toroidal space, so no mass is lost? Toroidal space gives wrapraound, like the asteroids game, where if something goes off the left side of the display, it reappears at the right side with same mass and velocity, same for top and bottom edges. --ETA: You don't have to actually model and transform the objects to be traveling on the surface of a toroid, just edge checking, like x>720, x=0, and x<0, x=720 (or whatever your display width/height are. |
|
I am having so much fun with this.
Here is a 192 star system, 2 mega stars with 100,000x the mass of the others. I was wondering why more stars are not captured in orbit, but now realize a stable orbit in the universe is probably kind of rare, unless I had billions of stars to play with here, or let it run for a hundred years, I am not going to randomly get many. The two mega stars are slowly being pulled together as well, would need to let it run a long time to get there though. Also, the space has 'depth' so it may look like a star passes near a mega star and not get affected much by it's gravity, in reality it is far in front or behind the star. What's interesting is I have been letting this run for quite a while now, and stars that left off the screen are coming back, so some long elliptical orbits going on there. Down to just a handful of stars now but more keep coming back from off screen periodically. |
|
I am running it now with 800 stars, I think the more stars there are they keep each other in check as each star has an influence on all the other stars gravitationally. The star system seems more stable now, or just moving slower due to more stars lol.
If I could scale this up to 50,000 stars or so that probably would be a pretty good simulation. There is a Barnes-Hut (Piet Hut from the 'Institute for Advanced Study' of Einstein fame) algorithm that works with larger datasets and reduces the number of calculations required. I will look into implementing that. |
|
Quoted:
I am having so much fun with this. Here is a 192 star system, 2 mega stars with 100,000x the mass of the others. I was wondering why more stars are not captured in orbit, but now realize a stable orbit in the universe is probably kind of rare, unless I had billions of stars to play with here, or let it run for a hundred years, I am not going to randomly get many. The two mega stars are slowly being pulled together as well, would need to let it run a long time to get there though. Also, the space has 'depth' so it may look like a star passes near a mega star and not get affected much by it's gravity, in reality it is far in front or behind the star. What's interesting is I have been letting this run for quite a while now, and stars that left off the screen are coming back, so some long elliptical orbits going on there. Down to just a handful of stars now but more keep coming back from off screen periodically. http://www.youtube.com/watch?v=GDupI2e5SQo View Quote That's actually pretty amazing considering you've had the board less than a week, and weren't doing a ton of parallel code prior to that! Keep the vids going! I have no idea how to capture video direct on board, you could use fbgrab for screenshots, but I think you would need to run through another computer to save it as a video file. The camera at display works well enough. It's pretty cool to see how the objects react to the large mass objects. |
|
Quoted: That's actually pretty amazing considering you've had the board less than a week, and weren't doing a ton of parallel code prior to that! Keep the vids going! I have no idea how to capture video direct on board, you could use fbgrab for screenshots, but I think you would need to run through another computer to save it as a video file. The camera at display works well enough. It's pretty cool to see how the objects react to the large mass objects. View Quote View All Quotes View All Quotes Quoted: Quoted: Down to just a handful of stars now but more keep coming back from off screen periodically. http://www.youtube.com/watch?v=GDupI2e5SQo That's actually pretty amazing considering you've had the board less than a week, and weren't doing a ton of parallel code prior to that! Keep the vids going! I have no idea how to capture video direct on board, you could use fbgrab for screenshots, but I think you would need to run through another computer to save it as a video file. The camera at display works well enough. It's pretty cool to see how the objects react to the large mass objects. |
|
Quoted:
google helps! I have found most of the solutions for this via google and the parallella forum. View Quote View All Quotes View All Quotes Quoted:
Quoted:
Quoted:
Down to just a handful of stars now but more keep coming back from off screen periodically. http://www.youtube.com/watch?v=GDupI2e5SQo That's actually pretty amazing considering you've had the board less than a week, and weren't doing a ton of parallel code prior to that! Keep the vids going! I have no idea how to capture video direct on board, you could use fbgrab for screenshots, but I think you would need to run through another computer to save it as a video file. The camera at display works well enough. It's pretty cool to see how the objects react to the large mass objects. You can't give google all the credit, don't sell yourself short as a True Geek1. You've shown that you full comprehend the code that is running. All google may have done is save you some typing and let you skip past a couple parts of trial and error. 1: Meant as a badge of pride. |
|
Quoted: You can't give google all the credit, don't sell yourself short as a True Geek1. You've shown that you full comprehend the code that is running. All google may have done is save you some typing and let you skip past a couple parts of trial and error. 1: Meant as a badge of pride. View Quote View All Quotes View All Quotes Quoted: Quoted: Quoted:google helps! I have found most of the solutions for this via google and the parallella forum. You can't give google all the credit, don't sell yourself short as a True Geek1. You've shown that you full comprehend the code that is running. All google may have done is save you some typing and let you skip past a couple parts of trial and error. 1: Meant as a badge of pride. |
|
Quoted: Yeah, I showed this sim running to the wife, she rolled her eyes and called me a geek! lol View Quote View All Quotes View All Quotes Quoted: Quoted: Quoted: Quoted:google helps! I have found most of the solutions for this via google and the parallella forum. You can't give google all the credit, don't sell yourself short as a True Geek1. You've shown that you full comprehend the code that is running. All google may have done is save you some typing and let you skip past a couple parts of trial and error. 1: Meant as a badge of pride. |
|
Achieved a stable orbit, found a defect in my code, I have 768 layers of depth in the virtual space, all the stars were at between 1 and 2 depth, not so many stars slingshotting off into space anymore, and more orbits going on.
Here is one stable orbit around a mega star that has been going on for 10 minutes now |
|
Quoted:
Achieved a stable orbit, found a defect in my code, I have 768 layers of depth in the virtual space, all the stars were at between 1 and 2 depth, not so many stars slingshotting off into space anymore, and more orbits going on. Here is one stable orbit around a mega star that has been going on for 10 minutes now http://www.youtube.com/watch?v=NVhRDWDPjmI View Quote Can you add a splash of color to show Z buffer (in-out) distance? Varying shade of blue or something, top 3 bits of depth for the blue intensity from 128-255? |
|
Originally Posted By brass Can you add a splash of color to show Z buffer (in-out) distance? Varying shade of blue or something, top 3 bits of depth for the blue intensity from 128-255? View Quote Will try and tackle that today sometime. |
|
This thread is interesting, but confusing. Can someone explain in plain, non tech speak, what is happening?
|
|
Quoted: This thread is interesting, but confusing. Can someone explain in plain, non tech speak, what is happening? View Quote Think of the 16 core processor as 16 little computers. I coded up a simulation that shows stars moving around with gravity from all the stars affecting the speed, direction of motion, etc of each other star. Say I have 1600 stars in the simulation, instead of 1 CPU having to do all those calculations for 1600 stars, my code has 100 stars being processed on each of the 16 cores. All 16 are processing at the same time, so much faster to finish 1600 star calculations that way than one CPU chewing through the calculations. |
|
Quoted:
I bought a little computer that has a custom 16 core processor. Think of the 16 core processor as 16 little computers. I coded up a simulation that shows stars moving around with gravity from all the stars affecting the speed, direction of motion, etc of each other star. Say I have 1600 stars in the simulation, instead of 1 CPU having to do all those calculations for 1600 stars, my code has 100 stars being processed on each of the 16 cores. All 16 are processing at the same time, so much faster to finish 1600 star calculations that way than one CPU chewing through the calculations. View Quote View All Quotes View All Quotes Quoted:
Quoted:
This thread is interesting, but confusing. Can someone explain in plain, non tech speak, what is happening? Think of the 16 core processor as 16 little computers. I coded up a simulation that shows stars moving around with gravity from all the stars affecting the speed, direction of motion, etc of each other star. Say I have 1600 stars in the simulation, instead of 1 CPU having to do all those calculations for 1600 stars, my code has 100 stars being processed on each of the 16 cores. All 16 are processing at the same time, so much faster to finish 1600 star calculations that way than one CPU chewing through the calculations. Thank you, and that is fucking awesome. Kind of like I read a while back that a bunch of regular computers linked together could be equivalent to one super computer... Could this be done for large excel files that crash regular Windows? |
|
Quoted: Thank you, and that is fucking awesome. Kind of like I read a while back that a bunch of regular computers linked together could be equivalent to one super computer... Could this be done for large excel files that crash regular Windows? View Quote View All Quotes View All Quotes Quoted: Quoted: Quoted: This thread is interesting, but confusing. Can someone explain in plain, non tech speak, what is happening? Think of the 16 core processor as 16 little computers. I coded up a simulation that shows stars moving around with gravity from all the stars affecting the speed, direction of motion, etc of each other star. Say I have 1600 stars in the simulation, instead of 1 CPU having to do all those calculations for 1600 stars, my code has 100 stars being processed on each of the 16 cores. All 16 are processing at the same time, so much faster to finish 1600 star calculations that way than one CPU chewing through the calculations. Thank you, and that is fucking awesome. Kind of like I read a while back that a bunch of regular computers linked together could be equivalent to one super computer... Could this be done for large excel files that crash regular Windows? |
|
Quoted: I was thinking about adding some 'red shift' 'blue shift' depending on if the star is moving toward the screen or away from it. Will try and tackle that today sometime. View Quote View All Quotes View All Quotes Quoted: Originally Posted By brass Can you add a splash of color to show Z buffer (in-out) distance? Varying shade of blue or something, top 3 bits of depth for the blue intensity from 128-255? Will try and tackle that today sometime. Makes it a little easier to visualize what is going on. Now a guy in Australia wants the code, he saw the vid I posted on the parallella forum. I told him I will email it to him. I guess I need to get on github. |
|
That's great! github is free for one repository, and it will walk you through how to set it up and such. Pretty easy.
|
|
Don't know if you got this resolved, but you don't need hardware "multiple/divide". That's what shift is for. Shifts and masking are awesome speed ups over library calls.
|
|
|
Quoted: That's great! github is free for one repository, and it will walk you through how to set it up and such. Pretty easy. View Quote |
|
View Quote View All Quotes View All Quotes Quoted:
Quoted:
That's great! github is free for one repository, and it will walk you through how to set it up and such. Pretty easy. https://github.com/capnrob97/rob_nbody Yep. Now you can see the colorized and formatted code without having the forum mess it up. Following on from what was posted above about mashing bits for speed.... I've been thinking of a way to quickly do the color shift. Rather than comparing velocity to 100, it would essentially compare to zero, but you wouldn't notice. Start out all particles grey (127,127,127, or 7F,7F,7F in RGB. If MSB of velocity vector is set (negative velocity), AND sz with Blue value, otherwise, AND sz with Red Value. When the velocities are under 10 or so, the color shift won't be extremely noticeable. If they are, clear sign bit and shift right before the AND. It might be quicker or slower than a couple of compares followed by addition and subtraction, or the compiler may optimize it to something similar anyway, resulting in just a bit more difficult read code without much gain. Do you have a video of your colorized output version? Lastly, you may want to change the comments on the code to match yours a bit. Or at least make the title of the code in the comments match your file name, instead of hello_world.c. |
|
Yeah, I'll clean up the code at some point, just used the example hello_world as a starting point.
I turned off the red shift / blue shift din't look as good as I thought it would. Also had to slow it down for smaller number of stars, they zip around so fast can't see what is happening, the dt variable in e_rob_nbody.c controls how much time advances each iteration, it made it 100 smaller with 32 stars, easier to watch the motion. Gonna code to auto set that based on the number of stars created. Next step is to try shared memory for the star data to see if I can scale up past 800 stars, shared mem can be 32 meg, local mem on the Epiphany is 32k Then try to implement the Barnes-Hut algorithm. I think this will get me started: |
|
|
|
Quoted:
Excel has used multiple cores for a number of years now. View Quote View All Quotes View All Quotes Quoted:
Quoted:
Quoted:
Only if a version of excel is written that takes advantage of multi cores. Fucking Bill Gates Excel has used multiple cores for a number of years now. Oh?? Ok, I take it back, Mr. Gates... Is there an easy way that you know of that a non tech guy like me could explain to our it guys on how to implement that on our network? I.e. any simple to follow instructions / guidelines out there? |
|
Quoted:
Oh?? Ok, I take it back, Mr. Gates... Is there an easy way that you know of that a non tech guy like me could explain to our it guys on how to implement that on our network? I.e. any simple to follow instructions / guidelines out there? View Quote View All Quotes View All Quotes Quoted:
Quoted:
Quoted:
Quoted:
Only if a version of excel is written that takes advantage of multi cores. Fucking Bill Gates Excel has used multiple cores for a number of years now. Oh?? Ok, I take it back, Mr. Gates... Is there an easy way that you know of that a non tech guy like me could explain to our it guys on how to implement that on our network? I.e. any simple to follow instructions / guidelines out there? If you have office 2010 or higher, just running it will use all the cores on your CPU (typically 2 to 8). Hitting CTRL+SHIFT+ESC and clicking the "Performance" tab will show the CPU load for each core. If only one is running at peak, look at your processes to see which one that is. |
|
I got a shared memory version going but it is very slow, shared memory access is way more time consuming than local memory.
If I can't scale above 800 stars I probably won't fiddle with the Barnes-Hut algorithm either, as the brute force code is handling it. |
|
Not sure how I've missed this thread until now.
I did some MPI programming in college but haven't touched it since. I might have to pick up one of these boards, but I don't have a very good use for one other than as a toy. |
|
Quoted:
I got a shared memory version going but it is very slow, shared memory access is way more time consuming than local memory. If I can't scale above 800 stars I probably won't fiddle with the Barnes-Hut algorithm either, as the brute force code is handling it. View Quote Could you work it in quadrants, using local RAM for 1/4 of the stars, writeback to main RAM, go to next quadrant work it in local RAM and writeback, etc. etc. ? |
|
Quoted: Could you work it in quadrants, using local RAM for 1/4 of the stars, writeback to main RAM, go to next quadrant work it in local RAM and writeback, etc. etc. ? View Quote View All Quotes View All Quotes Quoted: Quoted: I got a shared memory version going but it is very slow, shared memory access is way more time consuming than local memory. If I can't scale above 800 stars I probably won't fiddle with the Barnes-Hut algorithm either, as the brute force code is handling it. Could you work it in quadrants, using local RAM for 1/4 of the stars, writeback to main RAM, go to next quadrant work it in local RAM and writeback, etc. etc. ? |
|
Sign up for the ARFCOM weekly newsletter and be entered to win a free ARFCOM membership. One new winner* is announced every week!
You will receive an email every Friday morning featuring the latest chatter from the hottest topics, breaking news surrounding legislation, as well as exclusive deals only available to ARFCOM email subscribers.
AR15.COM is the world's largest firearm community and is a gathering place for firearm enthusiasts of all types.
From hunters and military members, to competition shooters and general firearm enthusiasts, we welcome anyone who values and respects the way of the firearm.
Subscribe to our monthly Newsletter to receive firearm news, product discounts from your favorite Industry Partners, and more.
Copyright © 1996-2024 AR15.COM LLC. All Rights Reserved.
Any use of this content without express written consent is prohibited.
AR15.Com reserves the right to overwrite or replace any affiliate, commercial, or monetizable links, posted by users, with our own.