Ticket #30 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

regress.pike timing issue (locking problem?)

Reported by: arend Owned by: gnugo
Priority: normal Milestone: 3.7.7
Component: regressions Version:
Severity: minor Keywords:
Cc: patch: yes

Description (last modified by arend) (diff)

Running ./regress.pike reading:220 atari_atari:29 nngs1:42 almost always stalls after atari_atari:29. Adding '--options "--mode gtp --level 1"' makes the stall almost certain.

Apparently Gunnar cannot reproduce this.

A second failure are spurious "test result missing" error messages that are most easily triggered by './regress.pike --check-unoccupied'.

Attachments

arend_7_7.7-regress.pike_locking.diff Download (5.0 KB) - added by arend 5 years ago.
Revise synchronization between threads; localize write_queue.

Regression Results

Attachment Rev. PASS FAIL Nodes Status
arend_7_7.7-regress.pike_locking.diff Download never tested

Change History

Changed 5 years ago by arend

  • description modified (diff)

Changed 5 years ago by arend

Revise synchronization between threads; localize write_queue.

Changed 5 years ago by arend

The second problem is easily explained: it happens when the program_reader "overtakes" the program_writer thread, i.e. when it obtains the test result before the program_write has analyzed the next line in the .tst-file that contains the correct test result.

The patch in the 1st attachment solves this somewhat drastically by having the reader wait until the complete .tst-file got processed by the program_write; implemented via a Thread.Queue()object.

The first problem happens, I believe, when the program_reader sends the cond->signal() before the program_writer reaches the cond->wait(condmutexkey) point, so that the signal gets lost.

The patch solves this by using a queue instead, and localizing the write_queue variable. I am not sure the latter is necessary, I have to think about that again.

Changed 5 years ago by arend

My analysis of the first problem is wrong, as adding a condmutex->lock() before sending the signal doesn't solve it.

It seems like some commands get sent to the wrong instance of GNU Go when a testsuite is finished, and the next started. I am at a loss as to explain why, but that is why localizing the write_queue variable as in the attached patch solves the problem, I think.

Changed 5 years ago by arend

For now, I have failed to analyze the first problem correctly. However, the attached patch works, and I suggest to use it.

Changed 5 years ago by arend

  • status changed from new to closed
  • patch set
  • resolution set to fixed
  • milestone set to 3.7.7
Note: See TracTickets for help on using tickets.