Hi Cong,
recently we added more fine-grained timers measuring the elapsed time of different phases to the NEST source code, see here: https://nest-simulator.readthedocs.io/en/v3.3/guides/built-in_timers.html
If you are interested in the consumed memory you query it on the python level like this: nest.ll_api.sli_func('memory_thisjob')
If you are looking for more sophisticated performance metrics and measurements you might want to try tools such as VTune or Amduprof, depending on your architecture. They both support MPI and openMP, work on the C++ level and can be used in conjunction with pyNEST. If you are only interested in specific parts of the code, you can insert specific start and stop points into the source code for restricting data collection. E.g. if you are only interested in the simulation phase and not in the network construction phase, you could exclude the latter. In my experience it declutters the output and helps arriving at interpretable results. This could tackle the issue you had with gproof.
Hope this helps and let me know if you have more questions,
Jari