Lost in the rabbit hole of Google Takeout

Deutsch   |   Source

Backup your data. They said. All of it. They said. And then came Google Takeout

/images/import_gplus_post.png

G+ post HTML file from Takeout

I get the 4, 20 and Google+

Takeout is Google's user data archive system for numerous products. My primary target me was to download an archive of my Google+ activities.

The approach is quite simple: choose the product in Takeout and wait until the archive(s) has/have been generated. The downloadable archive will be valid for a week but you can generate new archives at any time.

Lesson 1:
Choose zip as filetype if you use umlauts, there could be encoding issues in tgz files.

I remembered that there was an import plugin for Nikola and I imagined to throw in the archive and to get a usable local site in return. At this point of the article the reader may speculate that this didn't work in the slightest way.

Unpacking presents

The first inspection reveals:

  1. All G+ posts are located in Google+ stream/Posts as HTML files. These files appear usable.
  2. Image links just point to filenames. The path is missing so only images in the same directory are shown but
  3. Images are scattered among different directories (in Posts and Photos and their subfolders). The majority of image files are stored in Photos of posts in date corresponding subfolders.
  4. There are different date formats in peaceful co-existence:
Photos of posts/
 ├── 02.06.14
 ├── 02.06.16
 ├── 22. Juli 2013
 ├── 23.01.17
 ├── 2011-08-14
 └── 2012-03-13
  1. There is a corresponding JSON file for every image but not for HTML files.
  2. Strucure of HTML files:
/images/import_gplus_inspector.thumbnail.png

Dumdidumdumdum...Inspektor Gadget

Lesson 2
You can open only single posts, there are a lot of deadlinks in image posts, but share and reaction information are displayed (public/private/collection/community post, +1, reshares and comments).

Your entry: Nikola

With low expectations I install the import plugin for Nikola and see what happens. Nothing. The posts once were provided as JSON files but not in recent days.

I brachiate through the files, importing HTML files first. The import plugin instantiates a new Nikola site, so I can just trial and error like hell. Then I care about deadlinks, then titles, it kepps getting better with every build.

The result is a static website of my Google+ stream including +1's and comments and a link to the original post.

Theming

In general the import is independent from any theme. I personally recommend hyde which even can be improved by the custom.css that is included in the archive.

Wishlist

  • local search function
  • filter posts by share status

Attention!

In case you consider a publicly accessible stream backup you have to keep in mind that the imported data also includes all privately shared posts.

Conclusion

As a long-term heavy Google+ user you are used to inconsistencies and improvementent constantly getting worse so a Takeout archive is no more than a sparring partner to train with. It is only a matter of time until my version of the import plugin will go the way of all those Google messengers before.

/images/takeout_gplus_slow.gif

static Google+ Nikola site (hyde theme)

If you want to try yourself:

Listings

Usage (README.md)

import_gplus_README.md (Source)

This plugin does a rough import of a Google+ stream as provided by [Google Takeout](http://google.com/takeout/)

Videos work, content in general works, attached images may or may not work depending on source.

The output is html, and there's little to no configuration done in the resulting site.

## IMPORTANT

As of today (July 2018) this mostly rewritten plugin works until Google plays around again.

If you consider to release this into the wilderness, keep in mind that the import includes not only public but also private and community shares.

Enjoy.

## Usage

 * Download Google Takeout as zip file if you use umlauts. There may be encoding issues.
 * Extract the dump file, merge parts if you have multiple files.
 * Additional Python package requirement: [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/).
 * Extract the plugin into the `plugins` folder of an existing Nikola site. The plugin will create a new site in an subfolder so there won't be any contaminations with actual data. If you are unsure or don't want that you can easily initiate an empty site for the purpose: `nikola init dummy_site`.
 * Open `plugins/import_gplus.py` and adapt folder names to your language settings.
 * Run `nikola import_gplus path/to/takeout_folder`.
 * The plugin inits a new Nikola site called `new_site` (no shit, Sherlock), you will have to change into that directory to run build commands.
 * Building the site can take long and possibly wake up your fans. You may want to test the output with a fraction of the available data.
 * Although the output should work with any theme, it looks quite nice with [hyde](https://themes.getnikola.com/v7/hyde/); hpstr is okay, too. Consider to copy the included `custom.css` into the `themes/THEME_NAME/assets/css` directory for an even better experience.
 * Tweaking `conf.py`:
  * disable comments:  `COMMENT_SYSTEM = ""`
  * link to your G+ profile in `NAVIGATION_LINKS`, disable link to RSS feed

Plugin

import_gplus.py (Source)

# -*- coding: utf-8 -*-

from __future__ import unicode_literals, print_function
import os
import shutil

try:
    import bs4
except ModuleNotFoundError:
    raise

from nikola.plugin_categories import Command
from nikola import utils
from nikola.utils import req_missing
from nikola.plugins.basic_import import ImportMixin
from nikola.plugins.command.init import SAMPLE_CONF, prepare_config

LOGGER = utils.get_logger('import_gplus', utils.STDERR_HANDLER)


class CommandImportGplus(Command, ImportMixin):
    """Import a Google+ dump."""

    name = "import_gplus"
    needs_config = False
    doc_usage = "[options] extracted_dump_file_folder"
    doc_purpose = "import a Google+ dump"
    cmd_options = ImportMixin.cmd_options

    def _execute(self, options, args):
        '''
            Import Google+ dump
        '''

        if not args:
            print(self.help())
            return

        options['filename'] = args[0]
        self.export_folder = options['filename']
        self.output_folder = options['output_folder']
        self.import_into_existing_site = False
        self.url_map = {}

        # Google Takeout folder structure, adapt to your language settings

        # Takeout/
        # ├── +1/
        # ├── Google+ stream/
        # |   ├── Posts/
        # |   ├── Photos/
        # |   |   ├── Photos of posts/
        # |   |   └── Photos of polls/
        # |   ├── Activities/
        # |   ├── Collections/
        # |   └── Events/
        # ├── Google+ Communities/
        # └── index.html

        gto_root = "Takeout"
        gto_plus1 = "+1"
        gto_stream = "Stream in Google+"
        gto_posts = "Beiträge"
        gto_photos = "Fotos"
        gto_photos_posts = "Fotos von Beiträgen"
        gto_photos_polls = "Umfragefotos"
        gto_activity = "Aktivitätsprotokoll"
        gto_collections = "Sammlungen"
        gto_events = "Veranstaltungen"
        gto_communities = "Google+ Communities"

        # path to HTML formatted post files
        post_path = os.path.join(self.export_folder,
                                 gto_root,
                                 gto_stream,
                                 gto_posts)

        # collect all files
        files = [f for f in os.listdir(os.path.join(post_path)) if os.path.isfile(os.path.join(post_path, f))]

        # filter relevant HTML files
        html_files = [f for f in files if f.endswith(".html")]
        LOGGER.info("{} posts ready for import".format(len(html_files)))

        # init new Nikola site "new_site", edit conf.py to your needs
        # change to this folder for the for build process
        self.context = self.populate_context(self.export_folder, html_files, post_path)
        conf_template = self.generate_base_site()
        self.write_configuration(self.get_configuration_output_path(), conf_template.render(**prepare_config(self.context)))
        self.import_posts(self.export_folder, html_files, post_path)

        # In the Takeout archive photos are linked to the main working
        # directory although they do not necessarily exist there (Hello
        # deadlinks!). The image files are spread to several folders.

        # All archive photos will be copied to the "images" folder.
        try:
            os.makedirs(os.path.join(self.output_folder, "images"))
            LOGGER.info("Image folder ceated.")
        except:
            pass

        for root, dirs, files in os.walk(os.path.join(self.export_folder, gto_root)):
            for f in files:
                if f.lower().endswith(".jpg") or f.lower().endswith(".jpeg") or f.lower().endswith(".png"):
                    if not os.path.isfile(os.path.join(self.output_folder, "images",f)):
                        shutil.copy2(os.path.join(root, f), os.path.join(self.output_folder, "images"))
                        LOGGER.info("{} copied to Nikola image folder.".format(f))

    @staticmethod
    def populate_context(folder, names, path):
        # We don't get much data here
        context = SAMPLE_CONF.copy()
        context['DEFAULT_LANG'] = 'de'
        context['BLOG_DESCRIPTION'] = ''
        context['SITE_URL'] = 'http://localhost:8000/'
        context['BLOG_EMAIL'] = ''
        context['BLOG_TITLE'] = "Static G+ stream archive"

        # Get any random post, all have the same data
        with open(os.path.join(path, names[0])) as f:
            soup = bs4.BeautifulSoup(f, "html.parser")
            context['BLOG_AUTHOR'] = soup.find("a", "author").text

        context['POSTS'] = '''(
            ("posts/*.html", "posts", "post.tmpl"),
            ("posts/*.rst", "posts", "post.tmpl"),
        )'''
        context['COMPILERS'] = '''{
        "rest": ('.txt', '.rst'),
        "html": ('.html', '.htm')
        }
        '''
        return context

    def import_posts(self, folder, names, path):
        """Import all posts."""
        self.out_folder = 'posts'

        for name in names:
            with open(os.path.join(path, name)) as f:
                soup = bs4.BeautifulSoup(f, "html.parser")

                description = ""
                tags = []

                title_string = str(soup.title.string)
                title = self.prettify_title(title_string)

                # post date is the 2nd link on the page
                post_date = soup.find_all("a")[1].text

                # collect complete post content
                post_text = soup.find("div", "main-content")
                link_embed = soup.find("a", "link-embed")
                media_link = soup.find_all("a", "media-link")
                album = soup.find("div", "album")
                video = soup.find("div", "video-placeholder")
                visibility = soup.find("div", "visibility")
                activity = soup.find("div", "post-activity")
                comments = soup.find("div", "comments")

                if video is not None:
                    tags.append("video")

                for link in media_link:
                    # link to image in image folder if not external link
                    if not link["href"].startswith("http"):
                        filename = link["href"]
                        try:
                            link["href"] = os.path.join("..", "images", filename)
                            tags.append("photo")
                        except TypeError:
                            LOGGER.warn("No href attribute to convert link destination ({})".format(link))
                        try:
                            link.img["src"] = os.path.join("..", "images", filename)
                        except TypeError:
                            LOGGER.warn("No src attribute to convert link destination ({})".format(link))
                    # throw away redundant p tag filled with the post text
                    try:
                        link.p.decompose()
                    except AttributeError:
                        pass

                # multiple entries only in albums, so we only need first item
                # BeautifulSoup.find_all() always returns result, so media_link
                # is never None
                try:
                    media_link = media_link[0]
                except IndexError:
                    media_link = None

                if album is not None:
                    tags.append("photo_album")
                    # we don't need media_link if album is available
                    media_link = None

                if link_embed is not None:
                    tags.append("link")
                    # we don't need media_link if we got external link
                    media_link = None

                content = ""
                for part in [post_text,
                             link_embed,
                             album,
                             media_link,
                             visibility,
                             activity,
                             comments]:
                    if part is not None:
                        content = "{}\n{}\n".format(content, part)

                # receive link from post date
                link = soup.find_all("a")[1].get("href")

                slug = utils.slugify("{}_{}".format(post_date.split()[0], title), lang="de")

                if not slug:  # should never happen
                    LOGGER.error("Error converting post:", title)
                    return

                # additional metadata
                more = {"link": link, # original G+ post
                        "hidetitle": True, # doesn't work for index pages
                        }

                self.write_metadata(os.path.join(self.output_folder, self.out_folder, slug + ".meta"), title, slug, post_date, description, tags, more)
                self.write_content(
                    os.path.join(self.output_folder, self.out_folder, slug + ".html"),
                    content)

    def write_metadata(self, filename, title, slug, post_date, description, tags, more):
        super(CommandImportGplus, self).write_metadata(
            filename,
            title,
            slug,
            post_date,
            description,
            tags,
            **more
            )

    def prettify_title(self, t):
        """
            Titles are generated from post text.
            Cut junk and shorten to one line
            for readability and convenience.
        """
        # reduce title string to one line
        t = t.split("<br>")[0]
        # link in title? just cut it out, ain't nobody got time for that
        t = t.split("<a ")[0]
        # same for user link
        t= t.split("span class=")[0]
        # cut trailing dots
        if t.endswith("..."):
            t = t[:-3]
        # cut html elements and fix quotation marks
        for tag in [("<b>", ""),
                     ("</b>", ""),
                     ("&quot;", "\""),
                     ("&#39;", "'"),
                     ("<b", ""),
                     ("</", ""),
                     ("<i>", ""),
                     ("</i>", ""),
                     ("<", ""),]:
            t = t.replace(tag[0], tag[1])

        return t

Comment on
Comments powered by Disqus