[Spacewalk-list] Run away DELETE query eating my disk space

Fri May 2 22:54:35 UTC 2014

My current plan is to stop Spacewalk this evening, dump the DB, recreate the
DB and load my data back. Why? I was troubleshooting another issue related to
clients that use a Spacewalk proxy. While I was strace'ing Tomcat, I found
that forked processes were segfault'ing and the thing they did just before
was reading from a PostgreSQL socket. My evening procedure is going off the
theory that my DB is somehow corrupted. 

Meanwhile, here are some specs on our Spacewalk server:
- VMware instance running RHEL 6.4
  + 8GB of memory, 2 vCPUs
- PosgresSQL version 8.4.13-1.el6_3.x86_64
- Spacewalk version 2.0.1-1 (spacewalk-postgresql)

Thanks for the interest,
Jon Miller

Paul Robert Marino <prmarino1 at gmail.com> writes:

> Increase your working memory. Also increase your maintenance working
> memory.
> That will help but we need to know more details before we can help.
> The exact specs of the box.
> The PostgreSQL configuration.
> The exact version of PostgreSQL you are running there are huge differences
> between 8 and 9.
> What OS and version you are running on. For example I run PostgreSQL 9.x on
> RHEL 6 which causes huge problems unless you change the default java driver
> which was at the time I did it a manual operation. And this can also be a
> problem for external PostgreSQL databases.
>
> -- Sent from my HP Pre3
>
> ---------------------------------------------------------------------------
> On May 1, 2014 21:52, Jon Miller <jonebird at gmail.com> wrote: 
>
> Chris <dmagick at gmail.com> writes: 
>
>>> Thanks for the reply. I wondered if I could do that but it would have
> been a 
>>> while before I would have came up with that query. Perhaps upstream
> folks 
>>> would like to test that version for Task_queries.xml within the git
> repo. 
>>> 
>>> Like before, I like to turn it into a "select count(*)" version first as
> a 
>>> test. I ran it and forgot about it until I realized it was still running
> and 
>>> killed it after ~3.5hours. 
>>> 
>>> spaceschema=# select count(*) from rhnPackageChangeLogData where id in 
>>> spaceschema-# (SELECT d.id from rhnPackageChangeLogData d 
>>> spaceschema(# left join rhnPackageChangeLogRec l on
> (d.id=l.changelog_data_id) 
>>> spaceschema(# where l.changelog_data_id is null); 
>>> ^C 
>>> Session terminated, killing shell... ...killed. 
>> 
>> Yikes. 
>> 
>> Time to check indexes? Check there is one on 
>> rhnpackagechangelogrec(changelog_data_id): 
>> 
>> spacewalk=# \d rhnpackagechangelogrec 
>> Table "public.rhnpackagechangelogrec" 
>> Column | Type | Modifiers 
>> -------------------+--------------------------+------------------------ 
>> id | numeric | not null 
>> package_id | numeric | not null 
>> changelog_data_id | numeric | not null 
>> created | timestamp with time zone | not null default now() 
>> modified | timestamp with time zone | not null default now() 
>> Indexes: 
>> "rhn_pkg_clr_id_pk" PRIMARY KEY, btree (id) 
>> "rhn_pkg_clr_pid_cld_uq" UNIQUE, btree (package_id, changelog_data_id) 
>> "rhn_pkg_clr_cld_uq" btree (changelog_data_id) 
>> 
>> <snip> 
>
> Yes, I have the same indexes: 
> spaceschema=# \d rhnpackagechangelogrec 
> Table "public.rhnpackagechangelogrec" 
> Column | Type | Modifiers 
> -------------------+--------------------------+------------------------ 
> id | numeric | not null 
> package_id | numeric | not null 
> changelog_data_id | numeric | not null 
> created | timestamp with time zone | not null default now() 
> modified | timestamp with time zone | not null default now() 
> Indexes: 
> "rhn_pkg_clr_id_pk" PRIMARY KEY, btree (id) 
> "rhn_pkg_clr_pid_cld_uq" UNIQUE, btree (package_id, changelog_data_id) 
> "rhn_pkg_clr_cld_uq" btree (changelog_data_id) 
>
>> and the one on 'id' in changelogData should be there because it's a 
>> primary key: 
>> 
>> spacewalk=# \d rhnpackagechangelogdata 
>> Table "public.rhnpackagechangelogdata" 
>> Column | Type | Modifiers 
>> ---------+--------------------------+------------------------ 
>> id | numeric | not null 
>> name | character varying(128) | not null 
>> text | character varying(3000) | not null 
>> time | timestamp with time zone | not null 
>> created | timestamp with time zone | not null default now() 
>> Indexes: 
>> "rhn_pkg_cld_id_pk" PRIMARY KEY, btree (id) 
>> 
>> <snip> 
>
> Same here as well or perhaps an extra one? (maybe you snipped too much?) 
> spaceschema=# \d rhnpackagechangelogdata 
> Table "public.rhnpackagechangelogdata" 
> Column | Type | Modifiers 
> ---------+--------------------------+------------------------ 
> id | numeric | not null 
> name | character varying(128) | not null 
> text | character varying(3000) | not null 
> time | timestamp with time zone | not null 
> created | timestamp with time zone | not null default now() 
> Indexes: 
> "rhn_pkg_cld_id_pk" PRIMARY KEY, btree (id) 
> "rhn_pkg_cld_nt_idx" btree (name, "time") 
>
> Two days ago I actually shutdown Spacewalk so I could do a "VACUUM FULL 
> VERBOSE ANALYZE;" and then I rebuild every index on the DB before bringing 
> Spacewalk back up again. 
>
> While the query was running yesterday, I was watching a "vmstat" output and 
> while it was busy, there didn't seem to be any blatant resource constraint 
> with the system on the whole. A lot of block I/O and the occasional swap 
> activity but definitely not thrashing. 
>
> Have you made any tweaks to your postgresql.conf that you can share? That
> is 
> where my head is now but I've got more reading / learning to do before I 
> start introducing tuning changes. 
>
> Thanks,