[SOLR-5961] Solr gets crazy on /overseer/queue state change - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 4.7.1
Fix Version/s: 4.10.4, 5.0
Component/s: SolrCloud
Labels:
None
Environment:

CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes (separate machines)

Description

No idea how to reproduce it, but sometimes Solr stars littering the log with the following messages:

419158 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.DistributedQueue ? LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged

419190 [Thread-3] INFO org.apache.solr.cloud.Overseer ? Update state numShards=1 message={
"operation":"state",
"state":"recovering",
"base_url":"http://${IP_ADDRESS}/solr",
"core":"${CORE_NAME}",
"roles":null,
"node_name":"${NODE_NAME}_solr",
"shard":"shard1",
"collection":"${COLLECTION_NAME}",
"numShards":"1",
"core_node_name":"core_node2"}

It continues spamming these messages with no delay and the restarting of all the nodes does not help. I have even tried to stop all the nodes in the cluster first, but then when I start one, the behavior doesn't change, it gets crazy nuts with this " /overseer/queue state" again.

PS The only way to handle this was to stop everything, manually clean up all the data in ZooKeeper related to Solr, and then rebuild everything from scratch. As you should understand, it is kinda unbearable in the production environment.

Attachments

Issue Links

incorporates

SOLR-6480 Too Many Open files trying to ask a replica to recover

Closed

relates to

SOLR-7033 RecoveryStrategy should not publish any state when closed / cancelled.

Closed

Activity

People

Assignee:: Shalin Shekhar Mangar

Reporter:: Maxim Novikov

Votes:: 4 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 04/Apr/14 20:43

Updated:: 13/Mar/16 07:59

Resolved:: 27/Feb/15 18:45