I promised I would replace the image files in my demographic transition post with html tables, and now I’ve followed through. My program didn’t generate the whole table, I just copied all the rows with numbers into a text file and then turned them into html rows which I could stick in a table. The table structure those rows sit in was made by hand, as were the rows not containing numbers. I also had to remove all the superscripts, since they were screwing things up. My code is below.
def convertToHTML(infilename, outfilename):

    """Take text grabbed from a pdf table, turn it into html rows"""
    infile = open(infilename)
    intext = infile.read()
    infile.close()
    inrows = intext.split("\n")
    # I'd like to generalize this so as not to rely on the following invariant
    numcountries = 6
    outrows = []
    for inrow in inrows:
        tokens = inrow.split(" ")
        #There are a bunch of empty strings due to consecutive spaces
        while "" in tokens:
            tokens.remove("")
        headerlength = len(tokens) - numcountries
        headertokens = tokens[:headerlength]
        bodytokens = tokens[headerlength:]
        outrow = ""
        outrow += "<tr align=\"center\">"
        outrow += "<th scope=\"row\" align=\"left\">"
        outrow += " ".join(headertokens)
        outrow += "</th>"
        for token in bodytokens:
            outrow += "<td>"
            outrow += token
            outrow += "</td>"
        outrow += "</td>"
        outrows.append(outrow)
    outtext = "\n".join(outrows)
    outfile = open(outfilename, "w")
    outfile.write(outtext)
    outfile.close()

convertToHTML(“input.txt”,”output.txt”)

Advertisements