retweet.py 10 billion bug

This weekend @SanMo (original post) started freaking out and reposting the same tweets over and over again. Code superstar Chris Finke wasn’t available to help me track down the issue so I reanimated Chris Nelson for assistance. My initial thought was that Twitter had changed the way it served mentions. retweet.py stores the status_id of each status it retweets in a sqlite database so it doesn’t repeat itself. Browsing the database, I noticed that the two tweets it was repeating were the first with ids above 10 billion (a recent milestone). Chris pointed out that running sqlite> SELECT MAX(status_id) FROM retweets; Returned the id ‘9663742534’ and not the true maximum. The table that retweet.py creates has two text columns, one for status_id and one for the timestamp. Changing the status_id column to integer causes MAX() to work properly (I’m not sure what the technical reason behind this failure is). To get retweet.py running again, I did the following (via Chris) from the command line:

# sqlite3 sanmo.sqlite
sqlite> CREATE TABLE retweets2 (status_id INTEGER PRIMARY KEY, timestamp TEXT);
sqlite> INSERT INTO retweets2 SELECT * FROM retweets;
sqlite> DROP TABLE retweets;
sqlite> ALTER TABLE retweets2 RENAME TO retweets;
sqlite> .quit

That will shift all the old data into a new table. The initial database creation routine needs to be fixed in retweet.py and will probably be in version 1.3. The fix above works for me but your mileage may vary.

2 Responses to “retweet.py 10 billion bug”

  1. cstone says:

    The reason the sorting didn’t work the way you wanted is because that MAX was using a text collation scheme to order things, not a numeric comparison. retweets.status_id’s previous type was varchar(100), and there’s nothing implicitly converting it to integer just for your comparison. ‘9’ is higher than ‘1’ in most of the world of text comparisons. It’s the same result you’d get if you had sorted the text IDs with the unix “sort” command without using -n.

    sqlite3 has a CAST expression, so you could also get the comparison you wanted without copying the table by doing SELECT MAX(CAST(status_id AS integer)) FROM retweets. But don’t do this: fixing the table is the right way.

  2. cnelson says:

    You are totally right cstone, don’t know how I missed that. I had assumed doing a MAX() would cause it to do an implicit cast and there was a bounds issue with that cast, but that doesn’t seem to be the case in this test:


    $ sqlite3 test.db
    SQLite version 3.1.3
    Enter ".help" for instructions
    sqlite> create table test (x varchar(100));
    sqlite> insert into test (x) values (1);
    sqlite> insert into test (x) values (2);
    sqlite> insert into test (x) values (3);
    sqlite> insert into test (x) values (4);
    sqlite> insert into test (x) values (5);
    sqlite> insert into test (x) values (6);
    sqlite> insert into test (x) values (7);
    sqlite> insert into test (x) values (8);
    sqlite> insert into test (x) values (9);
    sqlite> insert into test (x) values (10);
    sqlite> select * from test;
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    sqlite> select max(x) from test;
    9

Leave a Reply