• Fur Affinity Forums are governed by Fur Affinity's Rules and Policies. Links and additional information can be accessed in the Site Information Forum.

2019-07-16 Unplanned Outage

Dragoneer

Site Developer
Staff member
Site Director
Outage resolved. We were performing maintenance on the site and one of the services became unresponsive, leading to an unplanned outage.

EDIT: Apparently the issue which caused the outage has crept back. We are investigating.
 

Dragoneer

Site Developer
Staff member
Site Director
Been on a conference call for a few hours now. We're continuing to work with IMVU's Ops team to try to pinpoint the root cause of the failure. This is not a hardware issue. We're still not 100% sure where the issue originated from, and we're working on trying to get the site back up and running.

When we brought the site up several times earlier the site ran fine for 15-45 minutes without issue, but the database would lock up and become unresponsive. However, the system is giving us no specific errors as to what's causing it. At the moment we're trying to see if we can trigger the specific error to try to pinpoint the problem.
 

Dragoneer

Site Developer
Staff member
Site Director
Site is currently up. Continuing to monitor. We're not 100% sure if we've been able to resolve the problem, and will have to see if the issue replicates itself from this point onward. I will post an update later on if this fix works a description of what happened.
 

Dragoneer

Site Developer
Staff member
Site Director
We are making an additional backup of site data and will be attempting to reload our database.

Part of the problem we're experiencing is we keep running into queries which conflict with one another. They become interlocked, and eventually time out, and when said time out occurs they take the site down. We believe we have triggered some specific, obscure bug in our database software which is causing the problem. No settings or configuration changes have been changed recently.

We're hoping that by restoring the database we can resolve this issue and fix what originally caused the problem. This will take some time, however.

Again, we apologize for these issues. We understand this is not ideal and creates an inconvenience to the community.

Getting the site up and reliably running is our top priority.
 

yak

Site Developer
Staff member
Administrator
We appear to have encountered some rare case of undo tablespace corruption, or something else which is essentially causing individual database queries to lock waiting on each other's semaphores which never get released because of this circular dependency.
This only appears to happen during heavy write loads. Below is an example:

Code:
--Thread 112943326208 has waited at trx0undo.ic line 171 for 57.00 seconds the semaphore:
X-lock on RW-latch at 0x1474aff4f8 created in file buf0buf.cc line 1425
a writer (thread id 110968400384) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file trx0undo.ic line 190
Last time write locked in file /usr/ports/databases/percona57-server/work/percona-server-5.7.26-29/storage/innobase/include/trx0undo.ic line 171


--Thread 110968400384 has waited at trx0rseg.ic line 48 for 57.00 seconds the semaphore:
X-lock on RW-latch at 0x146c407458 created in file buf0buf.cc line 1425
a writer (thread id 112943326208) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.cc line 1101
Last time write locked in file /usr/ports/databases/percona57-server/work/percona-server-5.7.26-29/storage/innobase/include/trx0rseg.ic line 48

All of the data appears to be fine and readily accessible, provided heavy write activity doesn't take place. Which isn't how websites usually work. After having tried out numerous possible solutions to the problem, we are only left with an option of re-installing and re-importing the database as a whole, which should take care of any possible corruption of control structures in the InnoDB internals by virtue of a complete do-over. We also have backups available which should allow us to restore everything back to the point when issues started taking place.

This is being worked on right now and will unfortunately take a while to finish due to the sheer amount of data being moved around.
 

Dragoneer

Site Developer
Staff member
Site Director
No change at this time. Import is still in progress. I know this doesn't add anything new, but we want to make sure we're communicating with the community so you're kept in the loop -- even if that means "We're still waiting for the import to finish."
 
Top