[Spacewalk-list] Run away DELETE query eating my disk space

Thu May 1 04:09:59 UTC 2014

Chris <dmagick at gmail.com> writes:

> On 30/04/14 14:05, Jon Miller wrote:
>> Before asking for more disk space, I scheduled some down time to perform a
>> "VACUUM FULL VERBOSE ANALYZE" on my PostgreSQL database today along with
>> rebuilding all of my indexes in the DB. After bringing Spacewalk back up
>> again, the DELETE query commenced again and is right back in the same cycle
>> again. I was hoping that a full maintenance cycle would magically fix things
>> for me.
>>
>> So more detail on my situation. Here is the EXPLAIN plan which also gives you
>> insight into our DB size with the row counts:
>>
>>    EXPLAIN DELETE FROM rhnPackageChangeLogData WHERE id NOT IN ( SELECT DISTINCT changelog_data_id FROM rhnPackageChangeLogRec );
>>                                                  QUERY PLAN
>>    ------------------------------------------------------------------------------------------------------
>>     Seq Scan on rhnpackagechangelogdata  (cost=318312.96..134742963.76 rows=192456 width=6)
>>       Filter: (NOT (SubPlan 1))
>>       SubPlan 1
>>         ->  Materialize  (cost=318312.96..318905.08 rows=42512 width=8)
>>               ->  HashAggregate  (cost=317678.33..318103.45 rows=42512 width=8)
>>                     ->  Seq Scan on rhnpackagechangelogrec  (cost=0.00..281315.66 rows=14545066 width=8)
>>
>> Beyond just raw row counts, the size of each table are
>> rhnPackageChangeLogData = 110 MB and rhnPackageChangeLogRec = 2469 MB. Trying
>> my best to impersonate a DBA here. A couple things I'm contemplating now:
>>
>> 1. Try increasing the working memory parameter.
>>     I'm thinking this due to the fact that it looks like it wants to load a
>>     ~2.5G hash in memory for that subquery. Within my postgresql.conf file,
>>     the "work_mem" is not explicitly set and I'm still trying to figure out
>>     how to query the current size.
>>
>> 2. Drop the DISTINCT clause from the subquery.
>>     I'm thinking this because the EXPLAIN without the 'DISTINCT' clause makes
>>     it drop the HashAggregate step:
>>    EXPLAIN DELETE FROM rhnPackageChangeLogData WHERE id NOT IN ( SELECT changelog_data_id FROM rhnPackageChangeLogRec );
>>                                               QUERY PLAN
>>    ------------------------------------------------------------------------------------------------
>>     Seq Scan on rhnpackagechangelogdata  (cost=352677.73..45926203886.33 rows=192456 width=6)
>>       Filter: (NOT (SubPlan 1))
>>       SubPlan 1
>>         ->  Materialize  (cost=352677.73..554945.39 rows=14545066 width=8)
>>               ->  Seq Scan on rhnpackagechangelogrec  (cost=0.00..281315.66 rows=14545066 width=8)
>>
>>    I actually see the following comment within the Task_queries.xml file:
>>    "DISTINCT makes the DELETE faster for postgresql" but as I'm thinking "less
>>    memory and able to complete" might be preferable than potential speed.
>>
>> 3. Others? Perhaps there an alternate SQL approach that can be taken? Or would you
>>     suspect something else is terrible wrong here?
>
> Reverse the logic so you find the id's to explicitly delete:
>
> delete from rhnPackageChangeLogData where id in (SELECT d.id from 
> rhnPackageChangeLogData d left join rhnPackageChangeLogRec l on 
> (d.id=l.changelog_data_id) where l.changelog_data_id is null);
>
> I don't think I have the same amount of data as you, but here are some 
> quick stats:
>
> spacewalk=# SELECT count(*) from rhnPackageChangeLogData;
>   count
> --------
>   176325
> (1 row)
>
> Time: 20.140 ms
> spacewalk=#
> spacewalk=# select count(*) from rhnPackageChangeLogRec;
>    count
> ---------
>   5121684
> (1 row)
>
> Time: 421.497 ms
> spacewalk=# begin;
> BEGIN
> spacewalk=# delete from rhnPackageChangeLogData where id in (SELECT d.id 
> from rhnPackageChangeLogData d left join rhnPackageChangeLogRec l on 
> (d.id=l.changelog_data_id) where l.changelog_data_id is null);
> DELETE 0
> Time: 2646.490 ms
> spacewalk=# rollback;
> ROLLBACK
> Time: 0.155 ms

Thanks for the reply. I wondered if I could do that but it would have been a
while before I would have came up with that query. Perhaps upstream folks
would like to test that version for Task_queries.xml within the git repo.

Like before, I like to turn it into a "select count(*)" version first as a
test. I ran it and forgot about it until I realized it was still running and
killed it after ~3.5hours. 

  spaceschema=# select count(*) from rhnPackageChangeLogData where id in
  spaceschema-#   (SELECT d.id from rhnPackageChangeLogData d
  spaceschema(#     left join rhnPackageChangeLogRec l on (d.id=l.changelog_data_id)
  spaceschema(#     where l.changelog_data_id is null);
  ^C
  Session terminated, killing shell... ...killed.

I now fear that something more fundamental is wrong with my DB.

-- 
Jon Miller