断断续续的SIGSEV(段故障),SIGABORT和进程挂起在c#代码中使用Mono

本文关键字:挂起 代码 Mono 进程 SIGABORT 段故障 SIGSEV 故障 断断续续 | 更新日期: 2023-09-27 18:10:15

我们在Ubuntu上运行的c# mono项目中发现了间歇性的分段故障和进程挂起。我花了相当多的时间试图调试这个问题,包括以下说明:http://www.mono-project.com/docs/debug+profile/debug/

数据点:

  • 这种情况发生的频率在不同的环境中差异很大。在我们的UAT环境中,这种情况很少发生。在生产环境中,它每隔几个小时运行一次,而在我们的开发机器上,该进程能够运行20分钟而没有失败就已经很幸运了。

  • 我们将mono版本升级到4.03,没有任何改进。

症状:

进程挂起,不响应SIGQUIT或SIGTERM,或者SIGSEGV或SIGABRT失败

下面是一个示例转储,尽管它们略有不同,但大多数都不包含以下断言失败:

* Assertion: should not be reached at sgen-scan-object.h:101
Native stacktrace:
        /usr/bin/mono() [0x4b23ac]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7fbaa5e50340]
        /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fbaa5ab1cc9]
        /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fbaa5ab50d8]
        /usr/bin/mono() [0x629839]
        /usr/bin/mono() [0x629a47]
        /usr/bin/mono() [0x629b96]
        /usr/bin/mono() [0x5d85a8]
        /usr/bin/mono() [0x5cbd56]
        /usr/bin/mono() [0x5cd458]
        /usr/bin/mono() [0x5cdaab]
        /usr/bin/mono() [0x5d0d32]
        /usr/bin/mono(mono_gc_collect+0x28) [0x5d1458]
        /usr/bin/mono() [0x59c18a]
        /usr/bin/mono() [0x623a06]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fbaa5e48182]
        /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fbaa5b7547d]
Debug info from gdb:
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No threads.
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

我不能100%确定挂起、段错误和符号都是由相同的问题引起的,但我怀疑它们是。挂起感觉不像普通的死锁,因为进程不响应SIGQUIT或SIGTERM。

我已经尝试了附加gdb,按照http://www.mono-project.com/docs/debug+profile/debug/中的说明,但结果不太壮观。

这是我的。gdbinit:

less ~/.gdbinit
handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint
define mono_stack
 set $mono_thread = mono_thread_current ()
 if ($mono_thread == 0x00)
   printf "No mono thread associated with this thread'n"
 else
   set $ucp = malloc (sizeof (ucontext_t))
   call (void) getcontext ($ucp)
   call (void) mono_print_thread_dump ($ucp)
   call (void) free ($ucp)
 end
end

下面是我的一个gdb调试会话(挂起的进程)的输出:

(gdb) where
#0  0x00007f2bbba05062 in do_sigsuspend (set=0x945300) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
#1  __GI___sigsuspend (set=0x945300) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
#2  0x00000000005c8ccc in ?? ()
#3  <signal handler called>
#4  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#5  0x00000000005fdda7 in ?? ()
#6  0x0000000000610aac in ?? ()
#7  0x0000000000585f6e in ?? ()
#8  0x0000000000586ee9 in ?? ()
#9  0x00000000403eb416 in ?? ()
#10 0x000000000290e8b0 in ?? ()
#11 0x00007fff29bfacb0 in ?? ()
#12 0x0000000000000000 in ?? ()
(gdb) p mono_pmip (0x00000000005fdda7)
$1 = 0
(doesn’t seem to print anything either to gdb console or process stdout)
(gdb) call mono_locks_dump (0)
$2 = 0
Total locks (in 10 array(s)): 16368, used: 399, on freelist: 213, to recycle: 15752
(gdb) mono_stack()
"<unnamed thread>" tid=0x0x7f2bbc8d47c0 this=0x0x7f2bbc858140 thread handle 0x403 state : waiting on 0x41a : Event  owns ()
  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) System.Threading.WaitHandle.WaitOne_internal (System.Threading.WaitHandle,intptr,int,bool) <IL 0x0001c, 0xffffffff>
  at System.Threading.WaitHandle.WaitOne (System.TimeSpan,bool) <0x0009b>
  at System.Threading.WaitHandle.WaitOne (System.TimeSpan) <0x0001d>
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.RunUntilSignaled () [0x00073] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:184
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.Run (string[]) [0x00019] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:35
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.Main (string[]) [0x00000] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:24
  at (wrapper runtime-invoke) <Module>.runtime_invoke_int_object (object,intptr,intptr,intptr) <IL 0x0006c, 0xffffffff>

"<unnamed thread>" tid=0x0x7f2bbc8d47c0 this=0x0x7f2bbc858140 thread handle 0x403 state : waiting on 0x41a : Event  owns ()
  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) System.Threading.WaitHandle.WaitOne_internal (System.Threading.WaitHandle,intptr,int,bool) <IL 0x0001c, 0xffffffff>
  at System.Threading.WaitHandle.WaitOne (System.TimeSpan,bool) <0x0009b>
  at System.Threading.WaitHandle.WaitOne (System.TimeSpan) <0x0001d>
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.RunUntilSignaled () [0x00073] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:184
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.Run (string[]) [0x00019] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:35
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.Main (string[]) [0x00000] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:24
  at (wrapper runtime-invoke) <Module>.runtime_invoke_int_object (object,intptr,intptr,intptr) <IL 0x0006c, 0xffffffff>
call mono_locks_dump (0)
$1 = 51700864
(gdb) call mono_locks_dump (1)
$2 = 56715296
Total locks (in 10 array(s)): 16368, used: 399, on freelist: 213, to recycle: 15752
Lock 0x29d68d0 in object 0x7f2ba8d13590 untaken
Lock 0x29d68f8 in object 0x7f2b7482c2c0 untaken
Lock 0x29d6920 in object 0x7f2b7482cd00 untaken
Lock 0x29d6948 in object 0x7f2b7482cb70 untaken
Lock 0x29d6970 in object 0x7f2b7482c760 untaken
Lock 0x29d6998 in object 0x7f2b7482d380 untaken
Lock 0x29d69c0 in object 0x7f2b7482c540 untaken
Lock 0x29d69e8 in object 0x7f2b7482c240 untaken
…...
times lots

(gdb) call mono_object_describe (0x41a)
The following is printed to the gdb console. 
Program received signal SIGSEGV, Segmentation fault.
0x000000000052c1a2 in mono_object_describe ()
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(mono_object_describe) will be abandoned.
When the function is done executing, GDB will silently stop.
(gdb) quit
A debugging session is active.
        Inferior 1 [process 7763] will be detached.
Quit anyway? (y or n) y
Detaching from program: /usr/bin/mono-sgen, process 7763
As soon as gdb finishes, the process writes remaining log messages to gdb console and then restarts (possibly by upstart)
ubuntu@shim-megastore-prod:/var/log/upstart$ 2015-08-20 01:48:20,124  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  Service check complete.
2015-08-20 01:48:22,641  INFO   (  5) iri.PdSDaemon.Services.CloudWatchService  ::  936 metrics averaged...
2015-08-20 01:48:22,716  INFO   (  5) iri.PdSDaemon.Services.CloudWatchService  ::  4 metrics posted to CloudWatch.
2015-08-20 01:48:29,568  INFO   (ker) piri.PdSDaemon.Services.PriceSyncService  ::  98.8% synchronised (15.1/sec)
2015-08-20 01:48:39,820  DEBUG  (  4) ri.PdSDaemon.Services.ProductSyncService  ::  Zzzz
Process restarts, or is restarted by Upstart
2015-08-20 06:51:20,163  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  Ponte dei Sospiri Daemon Version 1.0.5695.31695
2015-08-20 06:51:20,172  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  Process ID: 12625
2015-08-20 06:51:20,172  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::
2015-08-20 06:51:20,182  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  ProductSyncService is not running, firing it up...
2015-08-20 06:51:20,183  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  CloudWatchService is not running, firing it up...
2015-08-20 06:51:20,185  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  OrderProcessingService is not running, firing it up...
The above is all written to the gdb console window. From then on, the output goes to the upstart console log.

下面是项目的依赖项列表:

  <package id="AWSSDK" version="2.3.20.0" targetFramework="net40" />
  <package id="CsvHelper" version="2.10.0" targetFramework="net40" />
  <package id="FluentMigrator" version="1.4.0.0" targetFramework="net40" />
  <package id="Mono.Options" version="1.1" targetFramework="net40" />
  <package id="Npgsql" version="2.2.5" targetFramework="net40" />
  <package id="ServiceStack.Common" version="3.9.71" targetFramework="net40" />
  <package id="ServiceStack.OrmLite.PostgreSQL" version="3.9.71" targetFramework="net40" />
  <package id="ServiceStack.OrmLite.Sqlite.Mono" version="3.9.71" targetFramework="net40" />
  <package id="ServiceStack.Text" version="3.9.71" targetFramework="net40" />
targetFramework="net40" />
  <package id="log4net" version="2.0.3" targetFramework="net40" />

关于我如何获得导致这种情况发生的更具体信息的任何想法/建议?这似乎可能是mono中的一个bug,或者是一个本地库(因为我们没有不安全的代码),但我似乎无法找出问题来自哪里。

非常感谢任何帮助!

断断续续的SIGSEV(段故障),SIGABORT和进程挂起在c#代码中使用Mono

好吧,这是Ubuntu内核中一个已知的bug。

Xamarin有一个报告错误:https://bugzilla.xamarin.com/show_bug.cgi?id=29827

所以如果你在这些机器上更新内核,这个bug应该会消失(让我们希望)。

干杯!