Hello all!
I am trying to run the following code with different n sizes on an Xeon Phi KNC (with 61 cores and 4T/C) and Xeon (with 2 sockets of Xeon E5-2660 v2).
I am getting the timings as shown in the tables below. However, I am trying to understand why MIC's preformance are poorer than running a Xeon processor. What am I doing wrong here, and how can I fix it (if possible)?
Thanks!
CODE:
program prog
integer, allocatable :: arr1(:), arr2(:)
integer :: i, n, time_start, time_end
n=481
do while (n .le. 481000000)
allocate(arr1(n),arr2(n))
call system_clock(time_start)
!dir$ offload begin target(mic)
!$omp SIMD
do i=1,n
arr1(i) = arr1(i) + arr2(i)
end do
!dir$ end offload
call system_clock(time_end)
write (,) "n=",n," time=",time_end-time_start
deallocate(arr1,arr2)
n = n*10
end do
end program
Xeon-Phi RESULTS:
n= 481 time= 8881
n= 4810 time= 75
n= 48100 time= 53
n= 481000 time= 261
n= 4810000 time= 1991
n= 48100000 time= 18912
n= 481000000 time= 188203
Settings:
#!/bin/bash #SBATCH -N 1 #SBATCH -o out_122 #SBATCH --exclusive export MIC_KMP_AFFINITY=verbose,granularity=fine,scatter export MIC_OMP_NUM_THREADS=122 ./prog.exe
sbatch -p xphi -N 1 --exclusive run_par.sh
while all of the settings are in run_par.sh and xphi is the name of the device.
Its also worth mentioning that a native run (addition of !dir$ offload begin target(mic) before the !$omp SIMD) yields a much better results.
n= 481 time= 0
n= 4810 time= 0
n= 48100 time= 6
n= 481000 time= 55
n= 4810000 time= 455
n= 48100000 time= 4342
n= 481000000 time= 43322
In the native run rhe settings are:
#!/bin/bash #SBATCH -N 1 #SBATCH -o out_244_native #SBATCH --exclusive export SINK_LD_LIBRARY_PATH=...intel/compilers_and_libraries/linux/lib/mic:$SINK_LD_LIBRARY_PATH micnativeloadex ./prog.exe.MIC -e "KMP_AFFINITY=verbose,granularity=fine,scatter"
Xeon RESULTS:
n= 481 time= 0
n= 4810 time= 0
n= 48100 time= 2
n= 481000 time= 19
n= 4810000 time= 93
n= 48100000 time= 706
n= 481000000 time= 7006
Here is the output of lscpu command on my Xeon machine:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping: 4
CPU MHz: 1203.382
BogoMIPS: 4405.99
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
My MIC specs are (tail of /proc/cpuinfo):
processor : 239
vendor_id : GenuineIntel
cpu family : 11
model : 1
model name : 0b/01
stepping : 3
cpu MHz : 1052.630
cache size : 512 KB
physical id : 0
siblings : 240
core id : 59
cpu cores : 60
apicid : 239
initial apicid : 239
fpu : yes
fpu_exception : yes
cpuid level : 4
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr htsyscall nx lm nopl lahf_lm
bogomips : 2112.44
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: