Cleaning up the SWU GTFS feed with pygtfs and sqlite3

For a Top Secret Project™ we are currently using SWU’s official GTFS feed. Since we had to clean up the feed a bit and ran into some errors, I thought it best to document our workflow to save it for posterity – it might even help others ;

Step 1: Loading into sqlite

For me, the first step is always to load the feed into sqlite3, since it makes it easier to follow through with all the manipulations I have in mind for later.

My current tool of choice is pygtfs, which can be installed through pip. If you don’t have sqlite installed, you might want to do that now, too.

pip3 install pygtfs
apt install sqlite3

pygtfs brings along the handy gtfs2db tool, which allows for importing your feed into a database right from the command line:

gtfs2db append ~/Downloads/gtfs.zip gtfs.db

This command will, however, fail after importing all the records and trying to write it to the database, because it can’t read a datetime (ValueError: time data “ does not match format ‚%Y%m%d‘). After some digging I found the missing feed_start_date and feed_end_date in feed_info.txt to be the culprit. According to spec, they are merely optional, but the gtfs2db script seems to depend on it. Nothing a quick edit in the text editor of your choice can’t fix. All there is left to do is to retry the import, and open a sqlite3 session with the newly created database. I am in the habit of switching on the headers and into column mode right from the get-go.

sqlite3 -header -column gtfs.db

Step 2: Exploring the stops table

For our project, we wanted to clean up the stops table a bit. First of all, in violation of the spec, in the SWU feed the stop_id is multiply assigned to every stop point („Haltepunkt“) within a stop („Haltestelle“). We can find this rather easily:

SELECT stop_name, count(stop_code) FROM stops GROUP BY stop_code ORDER BY count(stop_code) DESC limit 15;
 stop_name   count(stop_code)
 ----------  ----------------
 ZOB Ost     9               
 Staufenrin  9               
 Universit  8               
 Donaustadi  6               
 Egertweg    6               
 Römerplat  6               
 Gewerbesch  6               
 ZUP         6               
 Stadtwerke  5               
 Eselsberg   5               
 Sonnenstra  5               
 Theodor-He  5               
 Kuhberg Sc  5               
 Hauptbahnh  4               
 Theater     4

Furthermore, some stop points (i.e., the platforms within one stop) are assigned doubly or even thrice with the same coordinates. This happens if one stop point is being served by different operational branches. In the case of the SWU feed, the internal data format distinguishes between bus, tram and night bus service. This means that if a stop point is being served by all three of those branches, it will appear thrice in the data set:

SELECT stop_id, stop_code, stop_name, stop_lat, stop_lon, COUNT(distinct stop_lon) FROM stops GROUP BY stop_code HAVING count(distinct stop_lon) > 1 ORDER BY count(distinct stop_lon) DESC limit 20;

stop_id     stop_code   stop_name          stop_lat    stop_lon    COUNT(distinct stop_lon)

----------  ----------  -----------------  ----------  ----------  ------------------------

3579        1240        Universität Süd  48.421839   9.956403    6                       

3559        1383        Gewerbeschulen K�  48.384625   9.959908    6                       

1869        1700        ZUP                48.39179    10.0032     6                       

117         1050        Staufenring        48.403521   10.004344   5                       

3745        1200        Eselsberg Hasenko  48.414326   9.962903    5                       

3557        1390        Kuhberg Schulzent  48.383775   9.955289    5                       

1235        1052        Donaustadion       48.405101   10.006927   4                       

145         1072        Mecklenburgweg     48.43453    10.024493   4                       

147         1073        Thüringenweg      48.433379   10.020177   4                       

149         1074        Haslacher Weg      48.429919   10.013786   4                       

3260        1087        Egertweg           48.42583    10.01243    4                       

217         1171        Manfred-Börner-S  48.41967    9.942766    4                       

3581        1241        Botanischer Garte  48.424912   9.956829    4                       

3584        1245        Kliniken Wissensc  48.424331   9.952453    4                       

3586        1246        Universität West  48.422234   9.947201    4                       

3565        1360        Römerplatz        48.39077    9.975428    4                       

3561        1393        Grimmelfinger Weg  48.38564    9.965312    4                       

774         1506        Benzstraße        48.365706   9.941817    4                       

75          1008        Hauptbahnhof       48.39983    9.98404     3                       

3571        1020        Stadtwerke         48.4031     9.986038    3

I hacked together the following python script that will unify these stop points so that each coordinate will appear exactly once. That is, it will output SQL statements to be pasted into sqlite which should do the trick. Hacky and crappy, but there we go. Note that I have uncommented (and did not test) the transfers bit, since the SWU feed does not use transfers. Note that this leaves the multiply assigned stop_ids untreated, as of now.

sched = pygtfs.Schedule("gtfs.db")
sq = sched.stops_query
d = {}
for each in sq.all():
   if each.stop_code in d:
     for existing in d[each.stop_code]:
       if (d[each.stop_code][existing].stop_lat == each.stop_lat) and (d[each.stop_code][existing].stop_lon == each.stop_lon):
         print('UPDATE stop_times SET stop_id = ' + d[each.stop_code][existing].stop_id + ' WHERE stop_id = ' + each.stop_id + ';')
         #print('UPDATE transfers SET from_stop_id = '+ d[each.stop_code][existing].stop_id + ' WHERE from_stop_id = ' + each.stop_id + ';')
         #print('UPDATE transfers SET to_stop_id = '+ d[each.stop_code][existing].stop_id + ' WHERE to_stop_id = ' + each.stop_id + ';')
         print('DELETE FROM stops WHERE stop_id = ' + each.stop_id + ';')
     d[each.stop_code][each.stop_id] = each;
   else:
     d[each.stop_code] = {};
     d[each.stop_code][each.stop_id] = each

Step 3: Removing Deadheads

One bug in the current SWU feed is that it includes trips that should not be facing towards the customer. Namely, all the trips between the bus depot and the first stop on one vehicle’s Umlauf, and the return trips towards the depot. This might result in strange routing results and will definitely make the transitfeed validator complain. The following SQL statements will delete both the stop_times and the trips for those deadheads:

DELETE FROM trips WHERE trip_id IN (SELECT DISTINCT trip_id FROM stop_times WHERE stop_id IN ("170", "171"));
DELETE FROM stop_times WHERE trip_id IN (SELECT DISTINCT trip_id FROM stop_times WHERE stop_id IN ("170", "171");

Step 4: Re-Exporting the data, casting times past 24:00:00 without using strftime

This was one part which kept me scratching my head for way too long. Simple full-table dumps are not a big issue:

sqlite3 -header -csv gtfs.db 'SELECT * FROM trips'  > trips.txt

After some successfull exports, I hit a bump in the road with the stop_times table because of the way GTFS and sqlite each store times. A vehicle departing one minute before midnight from one stop and arriving two minutes past midnight at the next stop will be modeled with the textual entries of „23:59:00“ and „24:02:00“, respectively. Since the operational day ends somewhen between midnight and the wee hours of the next day, it is not unusual to encounter departure/arrival times that read something like „27:35:00“ – at 03:35 a.m. of the following day, but still within the validity period of the previous/current day (I am getting mixed up with terminology myself, I guess)

However, pygtfs will convert all those textual time entries into DateTime objects, which means that the two times above would be stored as 1970-01-01 23:59:00.000000 and 1970-01-02 00:02:00.000000. And any manipulation with strftime and other functions to manipulate Date and Time didn’t get me anywhere – or, to put it differently, all the usual functions correctly and stubbornly refused to return non-standard times that use funny hours like 24 or anything above.

In the end, the statement to get GTFS compliant times was not that hard, after all. If you ever want to calculate hours, minutes and seconds past a date, this is the way to go:

SELECT trip_id, printf("%02d",((strftime("%s", "arrival_time") - strftime("%s", "1970-01-01 00:00:00.00000")) / (60*60))) || ":" || printf("%02d",(((strftime("%s", "arrival_time") - strftime("%s", "1970-01-01 00:00:00.00000")) % (60*60)) / 60)) || ":" || printf("%02d",(((strftime("%s", "arrival_time") - strftime("%s", "1970-01-01 00:00:00.00000")) % 60) % 60)) AS arrival_time, printf("%02d",((strftime("%s", "departure_time") - strftime("%s", "1970-01-01 00:00:00.00000")) / (60*60))) || ":" || printf("%02d",(((strftime("%s", "departure_time") - strftime("%s", "1970-01-01 00:00:00.00000")) % (60*60)) / 60)) || ":" || printf("%02d",(((strftime("%s", "departure_time") - strftime("%s", "1970-01-01 00:00:00.00000")) % 60) % 60)) AS departure_time, stop_id, stop_sequence, stop_headsign, pickup_type, drop_off_type, shape_dist_traveled, timepoint FROM stop_times

But, as @LucasWerkmeistr pointed out, in this case we are always subtracting 1970-01-01 00:00:00.00000 – which is zero. The simplified version for exporting the tables is as follows:

sqlite3 -header -csv gtfs.db 'SELECT trip_id, printf("%02d",(strftime("%s", "arrival_time") / (60*60))) || ":" || printf("%02d",(strftime("%s", "arrival_time") % (60*60)) / 60) || ":" || printf("%02d",(strftime("%s", "arrival_time") % 60) % 60) AS arrival_time, printf("%02d",(strftime("%s", "departure_time")) / (60*60)) || ":" || printf("%02d",(strftime("%s", "departure_time")  % (60*60)) / 60) || ":" || printf("%02d",(strftime("%s", "departure_time") % 60) % 60) AS departure_time, stop_id, stop_sequence, stop_headsign, pickup_type, drop_off_type, shape_dist_traveled, timepoint FROM stop_times'  > stop_times.txt

Note that the export mentions the shape_dist_traveled and timepoint columns. This is because this iteration was not made on a squeaky clean SWU feed, but after some massaging with Conveyal’s GTFS Editor. But that will be a topic for another time 😉

edited on 2019-03-16 to fix errors in the last two scripts and to clarify stuff

stk

schlampig sortiertes Zeug, das ich schon immer mal sagen wollte.

Cleaning up the SWU GTFS feed with pygtfs and sqlite3

Step 1: Loading into sqlite

Step 2: Exploring the stops table

Step 3: Removing Deadheads

Step 4: Re-Exporting the data, casting times past 24:00:00 without using strftime

Ein Gedanke zu „Cleaning up the SWU GTFS feed with pygtfs and sqlite3“

Schreibe einen Kommentar Antworten abbrechen