2012-06-26

Trac: Fixing the dreaded "Duplicate entry" Error during SVN Rescan

So, The other day I was minding my own business, resyncing an SVN repository in trac and I ran across this error:
File "/usr/bin/trac-admin", line 7, in ? sys.exit(
  File "/usr/lib/python2.4/site-packages/Trac-0.11.6-py2.4.egg/trac/admin/console.py", line 1325, in run admin.run()
  File "/usr/lib/python2.4/site-packages/Trac-0.11.6-py2.4.egg/trac/admin/console.py", line 155, in run self.cmdloop()
  File "/usr/lib64/python2.4/cmd.py", line 142, in cmdloop stop = self.onecmd(line)
  File "/usr/lib/python2.4/site-packages/Trac-0.11.6-py2.4.egg/trac/admin/console.py", line 138, in onecmd rv = cmd.Cmd.onecmd(self, line) or 0
  File "/usr/lib64/python2.4/cmd.py", line 219, in onecmd return func(arg)
  File "/usr/lib/python2.4/site-packages/Trac-0.11.6-py2.4.egg/trac/admin/console.py", line 680, in do_resync repos = env.get_repository().sync(self._resync_feedback)
  File "/usr/lib/python2.4/site-packages/Trac-0.11.6-py2.4.egg/trac/versioncontrol/cache.py", line 214, in sync (str(next_youngest),
  File "/usr/lib/python2.4/site-packages/Trac-0.11.6-py2.4.egg/trac/db/util.py", line 64, in execute return self.cursor.execute(sql_escape_percent(sql), args)
  File "/usr/lib64/python2.4/site-packages/MySQLdb/cursors.py", line 163, in execute    self.errorhandler(self, exc, value)
  File "/usr/lib64/python2.4/site-packages/MySQLdb/connections.py", line 35, in defaulterrorhandler raise errorclass, errorvalue
_mysql_exceptions.IntegrityError: (1062, "Duplicate entry '37295-releases/may0712-D' for key 'PRIMARY'")

How did we get here?!

As background, this is a copy of a repository we've been working with for a while. We're copying it to a new repository as we have two sets of files in it that were originally connected projects but have grown farther and farther apart to the point where they don't really reference each other or need to be together. At one point we had a few people who somehow checked in duplicate deletions of the same file in the original repository. (How that got past SVN is another story, but it wasn't anything out of the ordinary). In any case, this isn't the first time I've run across this error and since we made a copy of the repo that had that issue, we now have it here.

Now, if this was the very last revision and the repo was quiescent, you could do a little clever hacking on the svn repo to remove the duplication... but that's not possible to do here. No, instead we're going to get MySQL Crafty.

Time for Mysql Sluthing

First off, I open up my trusty commandline mysql client and pick my database. Looking at the list of tables you'll see the "revision" and "node_change" tables which look like this:
mysql> describe revision;
+---------+---------+------+-----+---------+-------+
| Field   | Type    | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
| rev     | text    | NO   | PRI | NULL    |       |
| time    | int(11) | YES  | MUL | NULL    |       |
| author  | text    | YES  |     | NULL    |       |
| message | text    | YES  |     | NULL    |       |
+---------+---------+------+-----+---------+-------+
4 rows in set (0.00 sec)

mysql> describe node_change;
+-------------+------+------+-----+---------+-------+
| Field       | Type | Null | Key | Default | Extra |
+-------------+------+------+-----+---------+-------+
| rev         | text | NO   | PRI | NULL    |       |
| path        | text | NO   | PRI | NULL    |       |
| node_type   | text | YES  |     | NULL    |       |
| change_type | text | NO   | PRI | NULL    |       |
| base_path   | text | YES  |     | NULL    |       |
| base_rev    | text | YES  |     | NULL    |       |
+-------------+------+------+-----+---------+-------+

A Plan Comes Together

'rev' is a Primary Key (or part of one) in both cases and, according to the message above, we're trying to insert the same primary key twice. So how are we going to fix this? If I delete the offending records, it'll rescan the repo up to that point, see that it's not there, add it, then add the second one. So, that doesn't help us at all. We need to leave it there but not let it complain the next time. If we leave it there but drop the unique requirement, it'll write it for the first rev, and then write over it again with the second rev, or (if it doesn't look for an existing record first) it'll just add the line twice. In the latter case, we can clean that up later and then re-institute the unique requirement. Sounds like a win, let's do it.

Dropping Constraints

mysql> alter table revision drop primary key;
Query OK, 37295 rows affected (0.11 sec)
Records: 37295  Duplicates: 0  Warnings: 0

mysql> alter table node_change drop primary key;
Query OK, 328916 rows affected (3.13 sec)
Records: 328916  Duplicates: 0  Warnings: 0


mysql> describe revision;
+---------+---------+------+-----+---------+-------+
| Field   | Type    | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
| rev     | text    | NO   |     | NULL    |       |
| time    | int(11) | YES  | MUL | NULL    |       |
| author  | text    | YES  |     | NULL    |       |
| message | text    | YES  |     | NULL    |       |
+---------+---------+------+-----+---------+-------+
4 rows in set (0.00 sec)

mysql> describe node_change;
+-------------+------+------+-----+---------+-------+
| Field       | Type | Null | Key | Default | Extra |
+-------------+------+------+-----+---------+-------+
| rev         | text | NO   | MUL | NULL    |       |
| path        | text | NO   |     | NULL    |       |
| node_type   | text | YES  |     | NULL    |       |
| change_type | text | NO   |     | NULL    |       |
| base_path   | text | YES  |     | NULL    |       |
| base_rev    | text | YES  |     | NULL    |       |
+-------------+------+------+-----+---------+-------+
6 rows in set (0.00 sec)
Alright! Let's resync again... just running "tracadmin resync"... And watching it count... for a while. Eventually we'll see something like:
37309 revisions cached.
Done.

Searching for UFO's

So, now we've got all that data in the db... including the duplicate revision. We can't just add back the keys because it'll fail the constraint validation. (If you run "alter table node_change add primary key (rev(16), path(512), change_type(1));" you'll get "ERROR 1062 (23000): Duplicate entry '32152-branches/...' for key 'PRIMARY'".We're going to have to clear that extra one out first, so let's find it.
mysql> select count(*) as c,rev,path,change_type from node_change group by concat(rev, path, change_type) order by c desc limit 10;
+---+-------+---------------------------------+-------------+
| c | rev   | path                            | change_type |
+---+-------+---------------------------------+-------------+
| 2 | 37295 | releases/May0712                | D           |
| 1 | 33277 | releases/S2611/myApp/main.cpp   | E           |
| 1 | 33277 | releases/S2611/myApp            | E           |
...
Well, there she is, 'c'=2, so we've got two of the same thing, just like we thought. I put a limit of 10 instead of only pulling the top result so we could make sure that there were no other problem entries pulled in after that one during the sync. If there were, we'd have to do the following steps for each.

Dropping the Dupes

mysql> select * from node_change where rev=37295 and path='releases/May0712';
+-------+------------------+-----------+-------------+------------------+----------+
| rev   | path             | node_type | change_type | base_path        | base_rev |
+-------+------------------+-----------+-------------+------------------+----------+
| 37295 | releases/May0712 | D         | D           | releases/May0712 | 37294    |
| 37295 | releases/may0712 | D         | D           | releases/may0712 | 37294    |
+-------+------------------+-----------+-------------+------------------+----------+
2 rows in set (0.20 sec)

mysql> delete from node_change where rev=37295 and path='releases/may0712' limit 1;
Query OK, 1 row affected (0.19 sec)

mysql> select * from node_change where rev=37295 and path='releases/may0712';
+-------+------------------+-----------+-------------+------------------+----------+
| rev   | path             | node_type | change_type | base_path        | base_rev |
+-------+------------------+-----------+-------------+------------------+----------+
| 37295 | releases/may0712 | D         | D           | releases/may0712 | 37294    |
+-------+------------------+-----------+-------------+------------------+----------+
1 row in set (0.24 sec)
You can see that MySQL honors the 'limit' clause on deletes, saving us a little work here. Now we should be able to add our constraint back on.
mysql> alter table node_change add primary key (rev(16), path(512), change_type(1));
Query OK, 329329 rows affected (6.94 sec)
Records: 329329  Duplicates: 0  Warnings: 0

mysql> alter table revision add primary key (rev(16));
Query OK, 37309 rows affected (0.29 sec)
Records: 37309  Duplicates: 0  Warnings: 0
And we're done!