User Panel
Are there any limits on how many can be chained to run parallel?
How hard would it be to program a password cracker to run a dictionary attack with substitutions and then run through random character combinations? |
|
Quoted:
Are there any limits on how many can be chained to run parallel? How hard would it be to program a password cracker to run a dictionary attack with substitutions and then run through random character combinations? View Quote Much faster with a thousand CUDA cores on a modern graphics card - code already exists. |
|
Quoted:
There is a dma_copy() function I think I can use to speed things up. View Quote View All Quotes View All Quotes Quoted:
Quoted:
Quoted:
I got a shared memory version going but it is very slow, shared memory access is way more time consuming than local memory. If I can't scale above 800 stars I probably won't fiddle with the Barnes-Hut algorithm either, as the brute force code is handling it. Could you work it in quadrants, using local RAM for 1/4 of the stars, writeback to main RAM, go to next quadrant work it in local RAM and writeback, etc. etc. ? DMA copy should do the trick. What frequency does the parallelia run at, and what frequency is the memory controller/RAM running at? You could work it either as 1-n stars divided up, or an x by y chunk of display to have each core work. The latter would have boundary issues, needing more memory though. Once that is done, you can set an interrupt on the ARM to put the bodies on the display from RAM 30 times per second, though you may run into a jitter issue if using x by y chunks of display on the parallel cores. |
|
Quoted: Are there any limits on how many can be chained to run parallel? How hard would it be to program a password cracker to run a dictionary attack with substitutions and then run through random character combinations? View Quote Here is the academic paper: https://www.usenix.org/system/files/conference/woot14/woot14-malvoni.pdf I believe one of the latest parallella examples has a password cracking program
|
|
Quoted:
Much faster with a thousand CUDA cores on a modern graphics card - code already exists. View Quote View All Quotes View All Quotes Quoted:
Quoted:
Are there any limits on how many can be chained to run parallel? How hard would it be to program a password cracker to run a dictionary attack with substitutions and then run through random character combinations? Much faster with a thousand CUDA cores on a modern graphics card - code already exists. I've got a solid CUDA setup that cooks, but core for core, CUDA > parallela? In other words, if I had 100 parallela boards chained, am I still better off with an equivalent CUDA graphics card? |
|
Quoted:
There is a paper I saw on this and an example program I believe related to this. Give me a sec to find it. Here is the academic paper: https://www.usenix.org/system/files/conference/woot14/woot14-malvoni.pdf I believe one of the latest parallella examples has a password cracking program Thanks! View Quote View All Quotes View All Quotes Quoted:
Quoted:
Are there any limits on how many can be chained to run parallel? How hard would it be to program a password cracker to run a dictionary attack with substitutions and then run through random character combinations? Here is the academic paper: https://www.usenix.org/system/files/conference/woot14/woot14-malvoni.pdf I believe one of the latest parallella examples has a password cracking program Thanks! |
|
Quoted:
I've got a solid CUDA setup that cooks, but core for core, CUDA > parallela? In other words, if I had 100 parallela boards chained, am I still better off with an equivalent CUDA graphics card? View Quote View All Quotes View All Quotes Quoted:
Quoted:
Quoted:
Are there any limits on how many can be chained to run parallel? How hard would it be to program a password cracker to run a dictionary attack with substitutions and then run through random character combinations? Much faster with a thousand CUDA cores on a modern graphics card - code already exists. I've got a solid CUDA setup that cooks, but core for core, CUDA > parallela? In other words, if I had 100 parallela boards chained, am I still better off with an equivalent CUDA graphics card? Yes. Video card GPU has much more and far faster RAM to work with, in addition to being clocked much much faster than the hobbyist board. |
|
Quoted: Yes. Video card GPU has much more and far faster RAM to work with, in addition to being clocked much much faster than the hobbyist board. View Quote View All Quotes View All Quotes Quoted: Quoted: Quoted: Quoted: Are there any limits on how many can be chained to run parallel? How hard would it be to program a password cracker to run a dictionary attack with substitutions and then run through random character combinations? Much faster with a thousand CUDA cores on a modern graphics card - code already exists. I've got a solid CUDA setup that cooks, but core for core, CUDA > parallela? In other words, if I had 100 parallela boards chained, am I still better off with an equivalent CUDA graphics card? Yes. Video card GPU has much more and far faster RAM to work with, in addition to being clocked much much faster than the hobbyist board. I could heat my house in winter with the heat my Nvidia card throws off. |
|
Funny but true. I think there is a company that rents out (or pays you) server cabinets to homeowners in the winter that pipe the residual heat for the home.
I dread the high temp alarms on heavy processing days. |
|
If Adapteva gets a paying customer for a 1024 core chip, they will build one. The founder posted that on their forum, they aren't going to build these and then wait for a customer to come around anymore.
Being a small company is probably hard to land that first big client, but Ericcson I believe pumped $3 mill into the company already. |
|
Quoted: DMA copy should do the trick. What frequency does the parallelia run at, and what frequency is the memory controller/RAM running at? You could work it either as 1-n stars divided up, or an x by y chunk of display to have each core work. The latter would have boundary issues, needing more memory though. Once that is done, you can set an interrupt on the ARM to put the bodies on the display from RAM 30 times per second, though you may run into a jitter issue if using x by y chunks of display on the parallel cores. View Quote View All Quotes View All Quotes Quoted: Quoted: Quoted: Quoted: I got a shared memory version going but it is very slow, shared memory access is way more time consuming than local memory. If I can't scale above 800 stars I probably won't fiddle with the Barnes-Hut algorithm either, as the brute force code is handling it. Could you work it in quadrants, using local RAM for 1/4 of the stars, writeback to main RAM, go to next quadrant work it in local RAM and writeback, etc. etc. ? DMA copy should do the trick. What frequency does the parallelia run at, and what frequency is the memory controller/RAM running at? You could work it either as 1-n stars divided up, or an x by y chunk of display to have each core work. The latter would have boundary issues, needing more memory though. Once that is done, you can set an interrupt on the ARM to put the bodies on the display from RAM 30 times per second, though you may run into a jitter issue if using x by y chunks of display on the parallel cores. I will be getting the 4 board case next week, and will probably spend more time trying to distribute this over 4 boards. This is more a learning exercise in parallel programming for me than squeezing every last compute cycle out of it, for now at least. I am easily amused, and am enjoying building little universes with what I have now and watching the stars move around with various mega-stars placed about. Had a good figure 8 orbit going earlier today of a small star back and forth with the mega-stars. |
|
|
|
Quoted: Another one. I self-identify now as the uber-geek of arfcom, until a challenger emerges (goatboy? subnet?) Goatboy, I need a little crown under my avatar until someone takes the uber-geek title. http://www.youtube.com/watch?v=ZXyV_9p56BU View Quote |
|
Another way to do it is set it at 1k stars, and have it render to the SD card, so you could play back the video anywhere.
The other thing to add that is computationally easy would be "glomming" bodies together with low delta v collisions, and elastic behavior above that delta v. |
|
A little off topic, but I'm looking for a quality Thunderbolt 2 PCI-e expansion chassis for my CUDA cards. Any suggestions?
|
|
Quoted:
Oh?? Ok, I take it back, Mr. Gates... Is there an easy way that you know of that a non tech guy like me could explain to our it guys on how to implement that on our network? I.e. any simple to follow instructions / guidelines out there? View Quote View All Quotes View All Quotes Quoted:
Quoted:
Quoted:
Quoted:
Only if a version of excel is written that takes advantage of multi cores. Fucking Bill Gates Excel has used multiple cores for a number of years now. Oh?? Ok, I take it back, Mr. Gates... Is there an easy way that you know of that a non tech guy like me could explain to our it guys on how to implement that on our network? I.e. any simple to follow instructions / guidelines out there? You don't have to do anything, it's already built in since 2010. |
|
Quoted:
I've got a solid CUDA setup that cooks, but core for core, CUDA > parallela? In other words, if I had 100 parallela boards chained, am I still better off with an equivalent CUDA graphics card? View Quote View All Quotes View All Quotes Quoted:
Quoted:
Quoted:
Are there any limits on how many can be chained to run parallel? How hard would it be to program a password cracker to run a dictionary attack with substitutions and then run through random character combinations? Much faster with a thousand CUDA cores on a modern graphics card - code already exists. I've got a solid CUDA setup that cooks, but core for core, CUDA > parallela? In other words, if I had 100 parallela boards chained, am I still better off with an equivalent CUDA graphics card? Far better with a modern graphics card. Thousands of cores vice a couple hundred, and cores highly optimized for password cracking. |
|
This is a password cracker that runs on the parallella
https://github.com/parallella/parallella-examples/tree/master/john |
|
Improved the graphics output today, there was a slight flicker going on with the graphics.
Went with a double-buffering technique, and now the display is rock solid, no flicker at all, looks a lot more polished and 'professional'. Graphics programming is not my thing, so that was a new concept for me, lol New code on github. github is pretty sweet, by the way. |
|
I am so tempted to program the little UFO from 'Asteroids' to randomly appear and shoot a few stars as it flies across the screen.
|
|
Quoted:
Another one. I self-identify now as the uber-geek of arfcom, until a challenger emerges (goatboy? subnet?) Goatboy, I need a little crown under my avatar until someone takes the uber-geek title. http://www.youtube.com/watch?v=ZXyV_9p56BU View Quote No, you've definitely got me out-geeked by a mile, with this thread (and I love following it so far). I'm barely a pretender to the throne, at this point. If I want a legitimate shot at the title, I'll have to go a different route. |
|
|
Fixed the star slingshot issue once and for all, there is a softening constant in the calculations that prevents the force between two stars from approaching infinity as they get extremely close.
I had that value set too low. Here is a 850 star system collapsing on itself. None of the stars have any velocity when it starts, the gravity of the star system starts pulling them together, and I get a nice little star cluster after the collapse. |
|
Quoted:
Fixed the star slingshot issue once and for all, there is a softening constant in the calculations that prevents the force between two stars from approaching infinity as they get extremely close. I had that value set too low. Here is a 850 star system collapsing on itself. None of the stars have any velocity when it starts, the gravity of the star system starts pulling them together, and I get a nice little star cluster after the collapse. http://youtu.be/_Uefw9bbNP0 View Quote Best one yet! If you let it run for a long time, do you end up with any orbits? Does it eventually fall into a psuedo-static system after an overnight run? |
|
Quoted: Best one yet! If you let it run for a long time, do you end up with any orbits? Does it eventually fall into a psuedo-static system after an overnight run? View Quote View All Quotes View All Quotes Quoted: Quoted: Fixed the star slingshot issue once and for all, there is a softening constant in the calculations that prevents the force between two stars from approaching infinity as they get extremely close. I had that value set too low. Here is a 850 star system collapsing on itself. None of the stars have any velocity when it starts, the gravity of the star system starts pulling them together, and I get a nice little star cluster after the collapse. http://youtu.be/_Uefw9bbNP0 Best one yet! If you let it run for a long time, do you end up with any orbits? Does it eventually fall into a psuedo-static system after an overnight run? |
|
I have 6 hours of driving this weekend to pick up a boat.
I will think long and hard while driving about how to scale this up to more stars once the 4 board cluster is build out next week. Trouble is, I have to do some soldering to get the four boards powered off the 4 corner connects from a single power source when I build the little tower. I suck at soldering. |
|
Adapteva just sent an email, the desktop version dropped from $149 to $99 on Amazon
Ordering another now. I love this thing. I will have 5 boards now to satisfy my peabrain interest in astrophysics |
|
Quoted: Adapteva just sent an email, the desktop version dropped from $149 to $99 on Amazon Ordering another now. I love this thing. I will have 5 boards now to satisfy my peabrain interest in astrophysics View Quote Better download all the linux distros for it I can find. |
|
I modified my code to restart the 850 star system collapse every 1500 iterations.
Would make a nice demo in the background at the Adapteva booth next trade show 'hint, hint' if they get tired of the mandelbrot one Random star positions to start each time. |
|
Ha, I will have a ready made nbody sim to play with on the 4 board cluster as soon as I get it built, courtesy of the US Army
https://github.com/USArmyResearchLab/mpi-epiphany |
|
lol, I am looking at their nbody code and they did the exact same thing I did, use that function from the Quake III game
|
||||||||||||||||
|
Quoted:
lol, I am looking at their nbody code and they did the exact same thing I did, use that function from the Quake III game View Quote Great minds and all of that. Another place to look for code would be the Parallax Propeller code base. That's an 8 core BASIC programmed parallel system on a chip from about 8 years ago. They've got some very tight code running on it for things like NTSC display, and many other applications. You could easily scale up some of their demos to your system. I was thinking on what you said about connecting the power busses together on your boards... In the videos where they are connecting multiple boards, they appear to be using an independent power supply for each board. Two cords per board, one Ethernet, one power, and on one of them, the I/O. I'm unsure what you are using for a power supply, though. A single 2A "wall wart" would be overloaded by 4 boards from what I've read. Heat management also turns into an issue. I think one 5" fan would be more effective with cooling air than a 1" fan on the heatsink of each board. |
|
Quoted: Great minds and all of that. Another place to look for code would be the Parallax Propeller code base. That's an 8 core BASIC programmed parallel system on a chip from about 8 years ago. They've got some very tight code running on it for things like NTSC display, and many other applications. You could easily scale up some of their demos to your system. I was thinking on what you said about connecting the power busses together on your boards... In the videos where they are connecting multiple boards, they appear to be using an independent power supply for each board. Two cords per board, one Ethernet, one power, and on one of them, the I/O. I'm unsure what you are using for a power supply, though. A single 2A "wall wart" would be overloaded by 4 boards from what I've read. Heat management also turns into an issue. I think one 5" fan would be more effective with cooling air than a 1" fan on the heatsink of each board. View Quote View All Quotes View All Quotes Quoted: Quoted: lol, I am looking at their nbody code and they did the exact same thing I did, use that function from the Quake III game Great minds and all of that. Another place to look for code would be the Parallax Propeller code base. That's an 8 core BASIC programmed parallel system on a chip from about 8 years ago. They've got some very tight code running on it for things like NTSC display, and many other applications. You could easily scale up some of their demos to your system. I was thinking on what you said about connecting the power busses together on your boards... In the videos where they are connecting multiple boards, they appear to be using an independent power supply for each board. Two cords per board, one Ethernet, one power, and on one of them, the I/O. I'm unsure what you are using for a power supply, though. A single 2A "wall wart" would be overloaded by 4 boards from what I've read. Heat management also turns into an issue. I think one 5" fan would be more effective with cooling air than a 1" fan on the heatsink of each board. It comes with a power supply to power all four boards https://github.com/abopen/parallella-cluster-case |
|
Got a solid strategy in my brain on how to scale this up to a few thousand stars on 16 cores
Just got to get home and start coding |
|
Quoted:
Got a solid strategy in my brain on how to scale this up to a few thousand stars on 16 cores Just got to get home and start coding View Quote That'd be cool. Do you have an 8 port gigabit ethernet switch to plug all the boards into? If you thought memory copy was slow... I'm unsure how to do an optimal parallel implementation, now that there are three bus speeds (local ram to DRAM, Local RAM DMA, and Board to Board). Possibly give n/4 stars to work on, and while the 16 cores are computing, they'd update local DRAM at the end of each cycle with the 6 numbers for each particle, then the ARM processors would consolidate all of that data into double buffered frames on all boards, with board 0 responsible for buffering it out to the display. |
|
Quoted: That'd be cool. Do you have an 8 port gigabit ethernet switch to plug all the boards into? If you thought memory copy was slow... I'm unsure how to do an optimal parallel implementation, now that there are three bus speeds (local ram to DRAM, Local RAM DMA, and Board to Board). Possibly give n/4 stars to work on, and while the 16 cores are computing, they'd update local DRAM at the end of each cycle with the 6 numbers for each particle, then the ARM processors would consolidate all of that data into double buffered frames on all boards, with board 0 responsible for buffering it out to the display. View Quote View All Quotes View All Quotes Quoted: Quoted: Got a solid strategy in my brain on how to scale this up to a few thousand stars on 16 cores Just got to get home and start coding That'd be cool. Do you have an 8 port gigabit ethernet switch to plug all the boards into? If you thought memory copy was slow... I'm unsure how to do an optimal parallel implementation, now that there are three bus speeds (local ram to DRAM, Local RAM DMA, and Board to Board). Possibly give n/4 stars to work on, and while the 16 cores are computing, they'd update local DRAM at the end of each cycle with the 6 numbers for each particle, then the ARM processors would consolidate all of that data into double buffered frames on all boards, with board 0 responsible for buffering it out to the display. I was copying all the star data to core 1 local memory, and each core reading from there, processing 1/16th of the data, and writing it back. Now I am putting 1/16th of the star data in each cores local memory (call these base stars), then copy another 1/16 chuck to each core to start (call these work unit stars). Say 4096 stars, I put 256 base stars on each core to start. Then I put 256 work unit stars on each core, so each core now has 256 work unit stars to calculate forces against the 256 base unit stars. When that is done, each core reads each others core base unit stars and calculates the forces of it work unit stars vs those base unit stars. When they cycle through each others 16 cores, 4096 stars have had force calculations against all the other stars in the system. That's the theory, and sort of working right now, some stars aren't moving, have a bug in the code somewhere trying to figure out. |
|
This thing is kicking my ass today.
There is something strange I just don't understand yet between the ARM local memory and each core local memory. I think some kind of byte alignment issue I need to wrap my head around. When I start trying to use pointer magic to move pieces of the data to each core from ARM memory, strange behavior going on. My pointer logic is sound, I believe, so there maybe some diffs on byte alignment between the two sides or something. Grr, good news is when I finally nail this issue, I will have a pretty good mental grasp of the this thing I think. |
|
|
Quoted:
Got it working, stupid logic issue on my part slowed me down, here with 4096 stars, slow as shit though, need to go to Barnes Hut algorithm to cut down on the number of calculations. http://www.youtube.com/watch?v=dM-zDACmznE View Quote 1 frame per second is a bit slow, but you are doing something that you thought was impossible last week. Is that simulation on one board or the 4 boards together? |
|
Quoted: 1 frame per second is a bit slow, but you are doing something that you thought was impossible last week. Is that simulation on one board or the 4 boards together? View Quote View All Quotes View All Quotes Quoted: Quoted: Got it working, stupid logic issue on my part slowed me down, here with 4096 stars, slow as shit though, need to go to Barnes Hut algorithm to cut down on the number of calculations. http://www.youtube.com/watch?v=dM-zDACmznE 1 frame per second is a bit slow, but you are doing something that you thought was impossible last week. Is that simulation on one board or the 4 boards together? The nbody simulation when doing a brute force all body compare, the number of calculations go through the roof as you add more stars to the sim, that's why no one does that on large sims, and why I will now try to migrate to faster algorithms to keep up with the increasing star count. Good news is my strategy worked on getting more stars on this thing. |
|
So your current algorithm is O n² with your implementation, while Barnes-Hutt optimum speed is O n log n ?
Or have you made tweaks to get a bit faster than n² currently? With n being star count. Expanding that Big O notation with 4096 stars: n² -> 16.772•106 n [log] n -> 14,796 That's a pretty huge performance boost in the Barnes-Hut algorithm, 16 million vs 15 thousand! |
|
Yeah, the brute force is computationally expensive.
There is another algorithm I may play around with as well http://www-hpcc.astro.washington.edu/faculty/marios/papers/ppl/node7.html |
|
Here is an astrophysicist using PS3s for his black hole research, I love it!
http://www.space.com/26943-sony-playstations-calculate-black-hole-motion.html |
|
|
Quoted: Ordered one of the Parallella boards today, similar to a Raspberry Pi, but has a 16 core Epiphany processor as well. http://www.amazon.com/Adapteva-Parallella-16-Desktop-Computer/dp/B0091UD6TM/ref=sr_1_1?ie=UTF8&qid=1433186643&sr=8-1&keywords=parallella No idea what I am going to do with it yet, lol, but I will figure out something. Only uses 5 watts of power, so much more efficient energy wise than an Nvidia type card. |
|
Quoted: Quoted: Ordered one of the Parallella boards today, similar to a Raspberry Pi, but has a 16 core Epiphany processor as well. http://www.amazon.com/Adapteva-Parallella-16-Desktop-Computer/dp/B0091UD6TM/ref=sr_1_1?ie=UTF8&qid=1433186643&sr=8-1&keywords=parallella No idea what I am going to do with it yet, lol, but I will figure out something. Only uses 5 watts of power, so much more efficient energy wise than an Nvidia type card. http://www.youtube.com/watch?v=hFWIC3RF0f8 This little board has a 16 processor coprocessor, kinda like a graphics card does. |
|
Quoted: Curious, what do you mean by an Nvidia type card? A video card? Kind of an apples and oranges comparison if so. GPUs are designed to brute force graphics and are power hungry by nature although they do get more efficient with each newer iteration and die shrink. Those little hobbyist mini computers are specifically designed around low cost/low power consumption chips and aren't meant to compete with a dedicated GPU. This precludes the need for things like cooling solutions more complicated than simple airflow over the chip which would drive up cost and increase the size of the end product. Nvidia graphics cards (GPUs) are made up of hundreds or thousands of little processing cores and are used in scientific computing all over the world. This little board has a 16 processor coprocessor, kinda like a graphics card does. http://www.youtube.com/watch?v=B6DGSIH6h08 View Quote View All Quotes View All Quotes Quoted: Quoted: Quoted: Ordered one of the Parallella boards today, similar to a Raspberry Pi, but has a 16 core Epiphany processor as well. http://www.amazon.com/Adapteva-Parallella-16-Desktop-Computer/dp/B0091UD6TM/ref=sr_1_1?ie=UTF8&qid=1433186643&sr=8-1&keywords=parallella No idea what I am going to do with it yet, lol, but I will figure out something. Only uses 5 watts of power, so much more efficient energy wise than an Nvidia type card. http://www.youtube.com/watch?v=hFWIC3RF0f8 Nvidia graphics cards (GPUs) are made up of hundreds or thousands of little processing cores and are used in scientific computing all over the world. This little board has a 16 processor coprocessor, kinda like a graphics card does. http://www.youtube.com/watch?v=B6DGSIH6h08 |
|
Probably going to order one of these Nvidia Jetsons as well
192 cores on this guy and runs Linux, less than $200 http://www.amazon.com/NVIDIA-Jetson-TK1-Development-Kit/dp/B00L7AWOEC Eta just ordered one. If I write my star sim in opencl I should be able to port over to the Nvidia relatively easily |
|
Quoted:
Probably going to order one of these Nvidia Jetsons as well 192 cores on this guy and runs Linux, less than $200 http://www.amazon.com/NVIDIA-Jetson-TK1-Development-Kit/dp/B00L7AWOEC Eta just ordered one. If I write my star sim in opencl I should be able to port over to the Nvidia relatively easily View Quote That's closer to what I've dabbled with, but that is CHEAP for a Kepler GPU Powered board! I'm wondering if they can be paralleled as easily, or if a dual SLI type connection isn't even an option. |
|
Sign up for the ARFCOM weekly newsletter and be entered to win a free ARFCOM membership. One new winner* is announced every week!
You will receive an email every Friday morning featuring the latest chatter from the hottest topics, breaking news surrounding legislation, as well as exclusive deals only available to ARFCOM email subscribers.
AR15.COM is the world's largest firearm community and is a gathering place for firearm enthusiasts of all types.
From hunters and military members, to competition shooters and general firearm enthusiasts, we welcome anyone who values and respects the way of the firearm.
Subscribe to our monthly Newsletter to receive firearm news, product discounts from your favorite Industry Partners, and more.
Copyright © 1996-2024 AR15.COM LLC. All Rights Reserved.
Any use of this content without express written consent is prohibited.
AR15.Com reserves the right to overwrite or replace any affiliate, commercial, or monetizable links, posted by users, with our own.