Cassandra troubleshooting guide: Site24x7

Apache Cassandra is a NoSQL distributed database designed to handle large amount of data across multiple servers. It follows peer-to-peer architecture and any node in the cluster provides the same functionality as any other node in the cluster.

Cluster status

Before jumping to troubleshooting, let's discuss how to check the overall health of the cluster. You can use nodetool status to access the status of cluster, which will be useful if one or more nodes are down or unreachable.

nodetool status

Nodetool status provides various information such as node's status, state, address, owns percentage, host id, and rack.

Status - U (Up) or D (Down)
State - N (Normal), L (Leaving), J (Joining), M (Moving).

Leaving - Indicates a node is being removed/decommissioned from the cluster.
Joining - Indicates a node is being added to the cluster.
Moving - Indicates a node's token range is being moved.

Datacenter: Datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load      Tokens   Owns      Host ID                               Rack
UN  10.40.49.3    64.3 GB    256      48.9%     c469cbec-e0e0-46a0-b31e-f1c49860b03f  RAC2
UN  10.40.40.55   55.76 GB   256      50.8%     91cab37f-d427-454e-9656-4a25b93d9fe2  RAC1
UN  10.40.38.120  58.01 GB   256      49.7%     47219cd8-f55a-41f8-bb17-b21dbb4092fa  RAC3
UN  10.40.40.56   58.88 GB   256      50.4%     2bc08dc1-f132-469f-a4bc-4a6156237061  RAC3
UN  10.40.8.75    58.54 GB   256      51.1%     27fcfee0-2a3f-4e16-878e-cc4897233802  RAC2
DN  10.40.17.20   57 GB      256      49.2%     0934da93-f777-445c-a801-9ed0d4972fef  RAC1

Here you can see one of the nodes is unreachable or down. If you have many nodes, use the grep command to identify the abnormal nodes.

nodetool status | grep -v '^UN'

This will list only the nodes that are not Up and Normal.

Datacenter: Datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load      Tokens   Owns      Host ID                               Rack
DN  10.40.17.20   57 GB      256      49.2%     0934da93-f777-445c-a801-9ed0d4972fef  RAC1

Common troubleshooting steps

Identify the problematic nodes
Narrow down the problems
Fix the root cause

Identify the problematic node

The first step in troubleshooting is to identify where the problem is. With the help of client logs, Cassandra logs, and other tools provided by Cassandra we have to identify if the problem is in client or any one node, or a subset of nodes.

Client logs

Client logs will help us identify the source of the problem. Some of the common errors you could find in logs are:

NoHostAvailable Exception

When the client or driver is not able to establish connection with any node in the cluster, this error will be thrown. It can happen when:

Multiple or all nodes are down in the cluster. Use nodetool status to verify the cluster status and node availability.
An incorrect configuration in the whitelist policy. For instance, you might have configured DCAwareRoundRobinPolicy with an incorrect localDC or node address
When all nodes are busy due to a high load and are not able to accept any new incoming requests.

Unavailable Exception

In this case, the request has reached the coordinator node and there is not enough replicas available to meet the requested consistency level.

If this exception is thrown from multiple coordinator nodes, then multiple replicas are not alive or unreachable at that time. Use nodetool getendpoints to get the possible list of nodes for that query and start from there.
If the exception is always thrown from a single coordinator node, then it is highly likely that the coordinator node is separated from the cluster. Use nodetool status on the coordinator node to find its status.

OperationTimedOutException

Operation timed out occurs when the coordinator does not respond within the timeout specified by the client. It can occur when the client side timeout is short or the particular query is fetching large amounts of data that could not be processed within the specified time. The error message generally displays the last host tried by the client. The coordinator will be a good starting point to debug the issue.

ReadTimeOutException or WriteTimeOutException

This is a server-side timeout that occurs between coordinator and replicas. This happens when a request reaches the coordinator node, and the coordinator believes there are enough replicas available for processing and request data from replicas. But for some reason, one or more replicas do not respond within the expected time, causing the read timeout or write timeout. This exception generally refers to a server-side problem as the default timeout is high.

Identifying the endpoints

If only a particular query consistently fails, you can use nodetool getendpoints to get all the nodes that own a particular partition key. You can start debugging from those list of nodes.

Assuming the following query is always failing, it means one or more nodes that own the given partition key is the problem.

Select * from sampleks.table1 where partitionkey=123;

Use the following to query the endpoints of "123":

nodetool getendpoints sampleks table1 123
#Lists the replica nodes
10.40.40.55
10.40.38.59
10.40.41.66

Narrow down the problem

Once you have a list of faulty nodes, the next step is to narrow down the problems in node. You can use Cassandra logs and various nodetool commands to find out the status of node.

Read latency

One of the common problems faced by Cassandra users is high read latency. High read latency can be either due to coordinator or individual replicas.

Coordinator query latency

If coordinator node is the suspect, then you can use nodetool proxyhistograms to find the read and write latency.

./nodetool proxyhistograms
#Lists the coordinator node's latency
proxy histograms
Percentile      Read Latency     Write Latency     Range Latency
                (micros)          (micros)          (micros)
50%             315.85            379.02            263.21
75%             545.79            454.83            315.85
95%             943.13            654.95            454.83
98%             1358.10           943.13            454.83
99%             1955.67           1358.10           454.83
Min             51.01             88.15             152.32
Max             8409.01           2816.16           454.83

The common problem for coordinator's high latency are:

Cross DC traffic
Consistency level
Large batches
Querying multiple partitions by using the IN clause from the partition key
Network latency

Local query latency

Local query latency refers to the latency of the replica nodes to process the request. Common causes for high local latency are:

Reading a large number of sstables: Use nodetool tablehistograms to find the sstable count.
Reading a large partition: Use nodetool tablestats to find the average and max partition size and use logs to see the max partition size that is flushed or compacted.
High tombstone count: Use logs to find the tombstone count read and nodetool tablestats to check the average tombstone per slice.
Insufficient threads: Use nodetool tpstats to find the thread stats.
Low machine configuration such as low RAM.

Table Histograms

The nodetool tablehistograms provides latency of the local read/write.

./nodetool tablehistograms keyspace table
#provides table level latency
Percentile  SSTables  Write Latency  Read Latency  Partition Size  Cell Count
                      (micros)       (micros)      (bytes)                  
50%         3.00       51.01         454.83        88148           258
75%         3.00       73.46         785.94        263210          770
95%         5.00       88.15         1358.10       1131752         2759
98%         5.00       105.78        1955.67       3379391         3311
99%         5.00       126.93        4055.27       4866323         4768
Min         0.00       8.24          51.01         30              0
Max         5.00       263.21        4866.32       802187438       42510

Tablehistograms provides the number of sstables read per query, read latency, write latency, partition size, and cell count.

If the number of sstables read per query is high, then read queries will have a high latency. This can happen if the wrong compaction strategy is chosen. You have to choose a compaction strategy based on your read and write load. Another situation could be if the minor compaction is disabled.

The partition size and cell count can help tune your data pattern.

Table stats

If you suspect issues with one particular table, you can use nodetool tablestats to get the table level latency and other statistics.

nodetool tablestats cassandra.table1
#List the tablestats
Keyspace: cassandra
    Read Count: 28498299
    Read Latency: 0.3223316537594051 ms.
    Write Count: 35110896
    Write Latency: 0.04817532471971094 ms.
    Pending Flushes: 0
        Table: table1
        SSTable count: 13
        Space used (live): 4551417734
        Space used (total): 4551417734
        Space used by snapshots (total): 0
        Off heap memory used (total): 4548182
        SSTable Compression Ratio: 0.14427946324401666
        Number of partitions (estimate): 183506
        Memtable cell count: 474
        Memtable data size: 195955
        Memtable off heap memory used: 0
        Memtable switch count: 529
        Local read count: 28498299
        Local read latency: 0.122 ms
        Local write count: 35110896
        Local write latency: 0.039 ms
        Pending flushes: 0
        Bloom filter false positives: 17515
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 431528
        Bloom filter off heap memory used: 431424
        Index summary off heap memory used: 53606
        Compression metadata off heap memory used: 4063152
        Compacted partition minimum bytes: 30
        Compacted partition maximum bytes: 43388628
        Compacted partition mean bytes: 181607
        Average live cells per slice (last five minutes): 2.69670113244707
        Maximum live cells per slice (last five minutes): 35
        Average tombstones per slice (last five minutes): 1.0
        Maximum tombstones per slice (last five minutes): 1

Threadpool Stats

Cassandra maintains multiple thread pools for different tasks. Each threadpool has a specific function to perform. You can use nodetool tpstats to find the state of these thread pools, such as number of tasks that are active, completed, and blocked.

You can also obtain the number of dropped messages, which can shed some light on which stage is creating a problem. In the example below the mutation and request response is dropped.

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     0         0      385618082         0                 0
ViewMutationStage                 0         0              0         0                 0
ReadStage                         0         0      537370386         0                 0
RequestResponseStage              0         0      376599718         0                 0
ReadRepairStage                   0         0        1603479         0                 0
CounterMutationStage              0         0              0         0                 0
MiscStage                         0         0              0         0                 0
CompactionExecutor                0         0       38697073         0                 0
MemtableReclaimMemory             0         0          41849         0                 0
PendingRangeCalculator            0         0              2         0                 0
GossipStage                       0         0       70732668         0                 0
SecondaryIndexManagement          0         0              0         0                 0
HintsDispatcher                   0         0            188         0                 0
MigrationStage                    0         0             11         0                 0
MemtablePostFlush                 0         0          68006         0                 0
ValidationExecutor                0         0              0         0                 0
Sampler                           0         0              0         0                 0
MemtableFlushWriter               0         0          41555         0                 0
InternalResponseStage             0         0           1122         0                 0
AntiEntropyStage                  0         0              0         0                 0
CacheCleanupExecutor              0         0              0         0                 0
Native-Transport-Requests         0         0     1668155897         0                72
UserDefinedFunctions              0         0              0         0                 0
    
Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                    10
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             2
PAGED_RANGE                  0
READ_REPAIR                  0

Cassandra logs

Cassandra provides multiple log files for various purposes. Some of the main logs are:

system.log
debug.log
gc.log

system.log

This is the default general logging provided by Cassandra. Various fields available are level, thread_name/id, date, class name, line number, and message.

Using log level

Sometimes checking the whole log file can be overwhelming. Use log level to filter the messages

# Below command will list last 5 warning/error messages
grep 'WARN\|ERROR' system.log | tail -5

Search all rotated log files

By default, Cassandra will compress the old rotated log files, so if you want to filter through all the rotated logs use zgrep

# Search all compressed rotated files
zgrep 'WARN\|ERROR' system.log.* | less

GC pauses

GCInspector.java class logs message when there is a long gcpause.

#List last 10 gcpauses in system.log
grep 'GCInspector' system.log | tail -10

Reading too many tombstones

Tombstones can bring the entire cluster or node down if not handled properly. The read response time can increase when a large number of tombstones is required to be read.

grep 'tombstone cells' system.log | tail -10

Sample Output

WARN  [SharedPool-Worker-1] 2021-07-30 02:01:53,990 ReadCommand.java:520 - Read 452 live rows and 1392 tombstone cells for query SELECT * FROM keyspace.table1 WHERE id = 456 LIMIT 5000 (see tombstone_warn_threshold)
WARN  [SharedPool-Worker-1] 2021-07-30 02:01:53,990 ReadCommand.java:520 - Read 2 live rows and 1892 tombstone cells for query SELECT * FROM keyspace.table1 WHERE id = 9887 LIMIT 5000 (see tombstone_warn_threshold)

In this warning log, too many tombstones are read which will make the query slow. Modify gc_grace_seconds according to the load.

Large partition warning

grep "Writing large partition" system.log | tail -10

Sample Output

WARN  [CompactionExecutor:13606] 2021-08-08 00:00:35,968 BigTableWriter.java:171 - Writing large partition keyspace/table1:8354000001876068 (251029397 bytes to sstable /home/cassandra/data_directories/data/keyspace/table1-01e71f5bfc3d36fd80b7abac29bd7766/mc-77869-big-Data.db) 

WARN  [CompactionExecutor:14390] 2021-08-11 00:00:36,176 BigTableWriter.java:171 - Writing large partition keyspace/table1:8354000001876068 (258559119 bytes to sstable /home/cassandra/data_directories/data/keyspace/table1-01e71f5bfc3d36fd80b7abac29bd7766/mc-77873-big-Data.db)

Debug log

Debug log is similar to system.log but provides some additional debugging information. This log is more noisier than system.log. It's recommended that you utilize any centralized logging or log analysis tool to accomplish this efficiently.

If you plan to search manually, it's better if you have a list of messages you are going to search.

#Lists the count of column familys flushed
grep 'Flushing largest CFS' debug.log | awk '{print $9, $10}' | sort | uniq -c 
    
4 CFS(Keyspace='cassandra', ColumnFamily='table1')
1 CFS(Keyspace='cassandra', ColumnFamily='table2')
2 CFS(Keyspace='cassandra', ColumnFamily='table1')
1 CFS(Keyspace='system', ColumnFamily='size_estimates')
1 CFS(Keyspace='sampleks', ColumnFamily='sampletable')

gc.log

This a normal conventional java GC log. You can use this to fine-tune JVM and GC configurations.

Fixing the issue

In this article, we went through various troubleshooting steps, including narrowing down the problem. We also determined the appropriate actions to take to resolve common issues, such as changing the data model, optimizing the querying pattern, and scaling nodes.

Cassandra troubleshooting guide